부트스트랩 (Bootstraping)

Angie An·2023년 12월 1일

AI sampling 데이터전처리 디북 딥터디

AI

목록 보기

1/4

부트스트랩이란?

Cross Validation(교차 검증)처럼 샘플링의 한 방법
Replacement(뽑았던 걸 넣고 다시 뽑기, 중복 허용)을 이용해 뽑은 샘플들을 가지고 모집단(population)의 값을 추론하는 방법
예시: 한 나라의 평균 키를 구한다고 해보자. 그런데 내가 가진 샘플은 100명의 키에 대한 정보밖에 없다. 이때 샘플 데이터에서 60명을 뽑고 60명의 평균 키를 구한다. 뽑은 60명을 제외하지 않고 다시 기존 샘플에 넣고, 다시 한 번 새로운 60명을 뽑는다(replacement). 이 과정을 반복하면, 각기 다른 분포(distribution)를 가진 60명의 평균 키 값을 계속해서 얻을 수 있다.
- 이 과정을 반복하면, 한정된 양의 샘플 데이터를 가지고도, 전체 모집단의 평균 값의 범위를 추론할 수 있게 된다. 즉, 우리가 추정한 값이 데이터가 변동함에 따라 어느 정도로 변하는 지 (추정값의 표준 편차, 신뢰 구간 - confidence interval)를 알 수 있게 된다. → 불확실성(uncertainty)를 줄여준다.

https://dgarcia-eu.github.io/SocialDataScience/2_SocialDynamics/025_Bootstrapping/Bootstrapping.html

AI에서 부트스트랩의 활용

머신러닝에서는 앙상블(Ensemble) 기법, 특히 bagging을 통해 Random Forest에서 많이 사용된다고 한다.
- Bagging = Boostrap + Aggregating 의 약자
- (bagging에 대해서는 다른 디북에서 다루겠습니다.)
데이터의 양을 늘리는 것과 같은 효과가 있고, 데이터 셋의 전체 분포가 고르지 않을 때, 고르게 만들어주는 효과가 있다.

Python 코드 예시

sklearn.utils.resample (공식문서 링크)

from sklearn.utils import resample 
import numpy as np

# Generate some example data
np.random.seed(2023)
original_data = np.random.normal(loc=10, scale=2, size=100)

# Number of bootstrap samples
num_samples = 1000

# Perform bootstrap resampling
bootstrap_samples = []
for _ in range(num_samples):
    resampled_data = resample(original_data) # replace = True: default
    bootstrap_samples.append(resampled_data)

# Now 'bootstrap_samples' contains 1000 bootstrap samples

# You can then use these samples to compute confidence intervals or other statistics
# For example, let's compute the mean and 95% confidence interval 
means = np.mean(bootstrap_samples, axis=1)
confidence_interval = np.percentile(means, [2.5, 97.5])

print("Original data mean:", np.mean(original_data))
print("Bootstrap resampled means 95% CI:", confidence_interval)

"""결과값
Original data mean: 9.901016443098726
Bootstrap resampled means 95% CI: [ 9.45744398 10.32271447]"""

참고 자료 (References)

Angie An

다음 포스트

부트스트랩 (Bootstraping)

AI

부트스트랩이란?

AI에서 부트스트랩의 활용

Python 코드 예시

참고 자료 (References)

불균형 데이터 (imbalanced data)

0개의 댓글