📌 개념 정리
중심경향지표
: 데이터가 특정 값을 중심으로 집중되는 정도
변동성지표
: 데이터의 산포 정도
📌 데이터 불러오기 및 구조 확인
> library(MASS)
> data(survey)
> str(survey)
'data.frame': 237 obs. of 12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
📌 median() : 중위수
# Pulse (응답자의 맥박수)의 중위수(50번째 백분위수, 50%)
median(survey$Pulse) # NA 존재
median(survey$Pulse, na.rm=T)
📌 quantile() : 백분위수
quantile(data, probs=)
# 5번째 백분위수(5%)
> quantile(survey$Pulse, probs=0.05, na.rm=T)
5%
59.55
> quantile(survey$Pulse, probs=0.5, na.rm=T) # 중위수와 일치함을 확인
50%
72.5
# 두 개 이상 비율을 벡터로 지정하면, 해당 비율에 해당하는 백분위수 출력
> quantile(survey$Pulse, probs=c(0.05, 0.95), na.rm=T) # 5번째, 95번째 백분위수
5% 95%
59.55 92.00
> seq(0, 1, 0.25) # probs 인수에 자동으로 지정되어 있음
[1] 0.00 0.25 0.50 0.75 1.00
> quantile(survey$Pulse, na.rm=T)
0% 25% 50% 75% 100%
35.0 66.0 72.5 80.0 104.0
# 80 이하 맥북수를 갖고 있는 사람이 전체의 몇퍼센트?
> mean(survey$Pulse <= 80, na.rm=T)
[1] 0.7552083
📌 summary() : 요약통계량
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(iris$Sepal.Width) # 수치형 변수
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.800 3.000 3.057 3.300 4.400
> summary(iris$Species) # factor형 변수
setosa versicolor virginica
50 50 50
> summary(as.character(iris$Species)) # character형 변수
Length Class Mode
150 character character
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# list 형태의 summary()는 요약통계량을 구할 수 없음
> iris.lst <- as.list(iris)
> summary(iris.lst)
Length Class Mode
Sepal.Length 150 -none- numeric
Sepal.Width 150 -none- numeric
Petal.Length 150 -none- numeric
Petal.Width 150 -none- numeric
Species 150 factor numeric
📌 lapply()
list 형태의 요약통계량
> lapply(iris.lst, summary)
$Sepal.Length
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
$Sepal.Width
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.800 3.000 3.057 3.300 4.400
$Petal.Length
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.600 4.350 3.758 5.100 6.900
$Petal.Width
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.100 0.300 1.300 1.199 1.800 2.500
$Species
setosa versicolor virginica
50 50 50
📌 range() : 범위
최소값과 최대값이 출력됨
> range(survey$Pulse, na.rm=T)
[1] 35 104
📌 var(), sd() : 분산, 표준편차
> var(survey$Pulse, na.rm=T)
[1] 136.5896
> sd(survey$Pulse, na.rm=T)
[1] 11.68716
📌 stat.desc()
pastecs 패키지의 stat.desc() : 요약통계량
> install.packages("pastecs")
> library(pastecs)
# mtcars data set의 mpg, hp, wt 변수에 대한 요약통계량
> stat.desc(mtcars[c('mpg', 'hp', 'wt')])
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285
📌 describe()
psych 패지키의 describe() : 기술통계량
> install.packages('psych')
> library(psych)
> describe(mtcars[c('mpg', 'hp', 'wt')])
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
📌 tapply()
tapply(벡터형식, 집단변수, 집단별로 적용할 통계량 함수) : 범주형
> levels(survey$Exer) # 운동습관 범주
[1] "Freq" "None" "Some"
# 운동습관에 따른 맥박수 평균
> tapply(survey$Pulse, INDEX=survey$Exer, FUN=mean, na.rm=T)
Freq None Some
71.96842 76.76471 76.18750
# 성별에 따른 맥박수 평균
> tapply(survey$Pulse, INDEX=survey$Sex, FUN=mean, na.rm=T)
Female Male
75.12632 73.19792
# 두 개의 집단변수 모두 고려한 평균 맥박수 list로 묶어줌
> tapply(survey$Pulse, INDEX=list(survey$Exer, survey$Sex), FUN=mean, na.rm=T)
Female Male
Freq 73.60976 70.67925
None 71.42857 80.50000
Some 77.00000 75.03030
📌 aggregate()
# aggregate() 함수를 이용해도 출력 형태는 다르지만 동일한 결과 나타남
> aggregate(survey$Pulse, by=list(survey$Exer), FUN=mean, na.rm=T)
Group.1 x
1 Freq 71.96842
2 None 76.76471
3 Some 76.18750
> aggregate(survey$Pulse, by=list(Exercise=survey$Exer), FUN=mean, na.rm=T) # group 이름 성정
Exercise x
1 Freq 71.96842
2 None 76.76471
3 Some 76.18750
> aggregate(survey$Pulse, by=list(Exersice=survey$Exer, Sex=survey$Sex), FUN=mean, na.rm=T)
Exersice Sex x
1 Freq Female 73.60976
2 None Female 71.42857
3 Some Female 77.00000
4 Freq Male 70.67925
5 None Male 80.50000
6 Some Male 75.03030
# aggregate() 함수는 tapply() 함수와 달리 데이터프레임 형식의 데이터셋을 처리 가능
> aggregate(survey[c('Pulse', 'Age')],
+ by=list(survey$Exer), FUN=mean, na.rm=T)
Group.1 Pulse Age
1 Freq 71.96842 20.34495
2 None 76.76471 21.47575
3 Some 76.18750 20.13952
📌 by()
by(데이터셋, INDICES=집단변수, FUN)
by() 함수의 FUN 인수에 사용자 정의 함수 가능하지만 sapply로 넘겨줘야 함
ex) FUN=function(x) sapply(x, 함수, na.rm=T)
> by(survey[c('Pulse', 'Age')],
+ INDICES=list(Exercise=survey$Exer),
+ FUN=summary)
Exercise: Freq
Pulse Age
Min. : 40.00 Min. :16.92
1st Qu.: 65.00 1st Qu.:17.62
Median : 71.00 Median :18.50
Mean : 71.97 Mean :20.34
3rd Qu.: 78.00 3rd Qu.:20.33
Max. :104.00 Max. :70.42
NA's :20
------------------------------------------------------------------------------------------------------------------------
Exercise: None
Pulse Age
Min. : 50.00 Min. :16.92
1st Qu.: 68.00 1st Qu.:18.29
Median : 76.00 Median :19.33
Mean : 76.76 Mean :21.48
3rd Qu.: 86.00 3rd Qu.:20.37
Max. :104.00 Max. :43.83
NA's :7
------------------------------------------------------------------------------------------------------------------------
Exercise: Some
Pulse Age
Min. : 35.00 Min. :16.75
1st Qu.: 69.50 1st Qu.:17.52
Median : 76.00 Median :18.54
Mean : 76.19 Mean :20.14
3rd Qu.: 84.25 3rd Qu.:19.67
Max. :100.00 Max. :73.00
NA's :18
> aggregate(survey[c('Pulse', 'Age')],
+ by=list(Exercise=survey$Exer),
+ FUN=summary)
Exercise Pulse.Min. Pulse.1st Qu. Pulse.Median Pulse.Mean Pulse.3rd Qu. Pulse.Max. Pulse.NA's Age.Min. Age.1st Qu. Age.Median Age.Mean Age.3rd Qu. Age.Max.
1 Freq 40.00000 65.00000 71.00000 71.96842 78.00000 104.00000 20.00000 16.91700 17.62500 18.50000 20.34495 20.33300 70.41700
2 None 50.00000 68.00000 76.00000 76.76471 86.00000 104.00000 7.00000 16.91700 18.29150 19.33350 21.47575 20.37475 43.83300
3 Some 35.00000 69.50000 76.00000 76.18750 84.25000 100.00000 18.00000 16.75000 17.52075 18.54150 20.13952 19.66700 73.00000
📌 describeBy()
psych 패키지의 describeBy()
describeBy(데이터셋, group=집단변수)
> describeBy(survey[c('Pulse', 'Age')],
+ group=list(Exercise=survey$Exer))
Descriptive statistics by group
Exercise: Freq
vars n mean sd median trimmed mad min max range skew kurtosis se
Pulse 1 95 71.97 10.93 71.0 71.57 10.38 40.00 104.00 64.0 0.28 0.73 1.12
Age 2 115 20.34 6.18 18.5 19.08 1.61 16.92 70.42 53.5 5.32 36.42 0.58
------------------------------------------------------------------------------------------------------------------------
Exercise: None
vars n mean sd median trimmed mad min max range skew kurtosis se
Pulse 1 17 76.76 14.14 76.00 76.73 11.86 50.00 104.00 54.00 0.20 -0.79 3.43
Age 2 24 21.48 7.06 19.33 19.80 1.61 16.92 43.83 26.92 2.32 4.09 1.44
------------------------------------------------------------------------------------------------------------------------
Exercise: Some
vars n mean sd median trimmed mad min max range skew kurtosis se
Pulse 1 80 76.19 11.67 76.00 76.66 11.86 35.00 100 65.00 -0.52 0.63 1.30
Age 2 98 20.14 6.70 18.54 18.80 1.54 16.75 73 56.25 5.69 38.61 0.68