hypothesis test(T-test)

seongyong·2021년 3월 11일

T-test apply hypothesis test lambda 콤마제거

통계

목록 보기

1/4

학습내용

hypothesis test

귀무가설 : 기존의 주장
대립가설 : 내가 주장하고자 하는 바
p-value : 귀무가설을 지지하는 확률

T-test

one sample t-test

from scipy import stats
stats.ttest_1samp(data, 가설검정 하고자하는 mean)

two sample t-test

비교하고자하는 두 데이터가 independent인지 상관관계가 있는지 파악
1) independent라면 stats.ttest_ind()
분산이 같은지 확인 = levene's test 이용
표본이 많으면(대략30개 이상) 정규성 검사 안해도됨(중심극한정리), 적으면 정규성 검사
-> 정규성만족, 분산이 같다면 var_equal = True(Student's test)
-> 정규성만족, 분산 다르다면 var_equal = False(Welch's test)
2) paired 데이터이면 stats.ttest_rel()

one side two sample t-test
stats.ttest_ind(data1, data2, alternative='greater') #two-side, less도 사용가능

one way anova

stats.f_oneway(g1, g2, g3)

데이터 샘플링

p = [0.1, 0, 0.3, 0.6, 0]
np.random.choice(5, 3, p=p, replace=True) #복원추출, p를 지정하지않는다면 같은 확률로 무작위 샘플링

데이터 전처리(추가학습)

isin 사용

df2[df2['자치구'].isin(sample_area)]['이팝나무']
#sample_area안에 있는 자치구들의 row를 불러와서 '이팝나무'column을 호출

,제거

df = df.replace('[^\d.]', '',regex=True).astype(float)
df.applymap(lambda x: re.sub(r'[^\d.]+', '', x)).astype(float)
df.transform(lambda x: x.str.replace(r'[^\d.]+', '')).astype(float)
tree1 = pd.to_numeric(trees['은행나무'].str.replace(',','')) # 은행나무 데이터

-> df.apply(replace()) replace는 잘 적용됨
-> df.applylambda x: re.sub(r'[^\d.]+', '', x)).astype(float)가 적용되지 않는 이유에 대해서 의문이 생김

map, apply, applymap 사용 이유

Pandas에서 제공하지 않는 기능, 즉 내가 만든 커스텀 함수(custom function)를 DataFrame에 적용하려면 map함수, apply함수, applymap함수를 사용

map : series에서만 적용(1차원 배열), 값 하나하나에 접근하면서 해당 함수를 적용
apply : 데이터프레임, series에서 둘다 적용, 각 row나 column별로 적용
applymap : 데이터프레임에 적용가능, 각 요소별로 적용, DataFrame의 각 요소마다 커스텀 함수(반드시 Single vaule를 반환하는)를 수행

replace와 re.sub()에 대해서 알게된 내용.

replace는 데이터프레임 차원에서도 적용가능

re.sub() : 문자열에만 적용

결론

replace는 Series 차원에서 적용이 가능하여 apply함수를 사용해서 , 제거 가능.
re.sub()함수는 데이터프레임 전체에 한번에 적용시키고 싶은 경우, Series 차원에서 적용이 불가능하여 apply함수대신 applymap을 사용해야함. 아니면 row or column별로 apply 시키던가.

추후 공부 계획

anova
다양한 sampling

seongyong

다음 포스트