260310 [ Day 43 ] - ML, DL - Part 2 (3)

TaeHyun·2026년 3월 10일

TIL

목록 보기

169/185

오늘로 머신러닝 파트가 끝이 났다. 이상탐지 부분이 어려운 개념들이 많아서 진이 많이 빠졌지만 어려운만큼 이후에 도움이 될 거라고 생각해서 열심히 들었던 것 같다.

SPE(Squared Prediction Error)는 PCA 모델로 복원한 값과 실제값의 차이를 사용
- 실무에서는 실제 고장 비율을 이용하여 상위 분위수로 임계값을 설정
PCA 모델로 전체 데이터 복원
- inverse_transform : 주성분 점수를 원본 데이터 형태로 복원

X_hat = model_pca.inverse_transform(pca_score)

residual = X_scaled - X_hat

spe = (residual ** 2).sum(axis=1)

spe_threshold = np.quantile(a=spe, q=1-fail_rate)

spe_outlier = (spe > spe_threshold)

spe_outlier.sum()
# np.int64(339)

df.loc[spe_outlier, 'Target'].value_counts(normalize=True).sort_index()
# Target
# 0    0.666667
# 1    0.333333
# Name: proportion, dtype: float64

이상치는 정상 데이터보다 특성 공간에서 고립되기 쉬운 패턴을 가진다는 가정하에 무작위 트리를 생성하여 탐지
- 여러 개의 트리를 생성한 후 관측값별로 고립되기까지 걸린 깊이의 평균을 계산
- 이상치는 크게 벗어난 극단값을 가지는 경우가 많아 깊이가 짧은 특징이 있음

from sklearn.ensemble import IsolationForest

iso = IsolationForest(
    n_estimators=100,
    contamination=fail_rate,
    random_state=0
)

iso.fit(X=train_num)
iso_pred = iso.predict(X=train_num)

iso_outlier = (iso_pred == -1)

iso_outlier.sum()
# np.int64(339)

df.loc[iso_outlier, 'Target'].value_counts(normalize=True).sort_index()
# Target
# 0    0.855457
# 1    0.144543
# Name: proportion, dtype: float64

정상 데이터의 경계를 학습하여 경계 밖의 관측값을 이상치로 판단
- 정상치는 원점에서 멀리 떨어지도록 변환
- 이상치는 원점에 가까운 위치로 매핑
- 커널 함수를 사용하여 비선형 패턴을 학습

from sklearn.svm import OneClassSVM

ocs = OneClassSVM(nu=fail_rate)

ocs.fit(X=X_scaled)
ocs_pred = ocs.predict(X=X_scaled)

ocs_outlier = (ocs_pred == -1)

ocs_outlier.sum()
# np.int64(340)

df.loc[ocs_outlier, 'Target'].value_counts(normalize=True).sort_index()
# Target
# 0    0.711765
# 1    0.288235
# Name: proportion, dtype: float64

내일은 깃허브 특강이 있는 날이다. 매번 깃허브를 그냥 구글 드라이브처럼 사용하기만 해서 협업할 때 사용하는 기능들에 대해 많이 배우면 좋겠다.

Hello I'm TaeHyunAn, Currently Studying Data Analysis