๐Ÿซ ๋น…๋ถ„๊ธฐ ์‹ค๊ธฐ ์ค€๋น„

m_ngyeongยท2025๋…„ 6์›” 15์ผ
1
post-thumbnail

๋น…๋ฐ์ดํ„ฐ ๋ถ„์„œ ๊ธฐ์‚ฌ ์‹ค๊ธฐ

help(), dir() ์ ๊ทน ํ™œ์šฉ !!!!!!!

  • help(): ๋„์›€๋ง(docstring)์„ ์ถœ๋ ฅ
  • dir(): ๊ฐ์ฒด๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์†์„ฑ๊ณผ ๋ฉ”์„œ๋“œ(ํ•จ์ˆ˜) ๋ชฉ๋ก ์ถœ๋ ฅ
import pandas as pd

#dir์„ ํ†ตํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜ ํ™•์ธ
print(dir(pd))
print(dir(pd.DataFrame))

import sklearn
print(sklearn.__all__)

# ์ „์ฒ˜๋ฆฌ ๋ฌด์—‡์„ ํ•  ์ˆ˜ ์žˆ์ง€?
import sklearn.preprocessing
print(sklearn.preprocessing.__all__)

๐Ÿฆ‹ ์ œ 1์œ ํ˜•(10์ , ๋ฌธ์ œ 3๊ฐœ): ์ˆ˜ํ–‰ ์ˆœ์„œ์™€ ๋‹ต ๊ณ„์‚ฐ

IQR(์ด์ƒ์น˜)

๋ฐ์ดํ„ฐ ์ค‘์—์„œ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์€ ๊ฐ’(์ด์ƒ์น˜) ๋“ค์„ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.

Q1 = df[col].quantile(0.25)   # 1์‚ฌ๋ถ„์œ„์ˆ˜ (ํ•˜์œ„ 25%)
Q3 = df[col].quantile(0.75)   # 3์‚ฌ๋ถ„์œ„์ˆ˜ (์ƒ์œ„ 25%)
IQR = Q3 - Q1                 # IQR: ์ค‘๊ฐ„ 50% ๋ฒ”์œ„

# ์ด์ƒ์น˜ ๊ธฐ์ค€ ๋ฒ”์œ„
ํ•˜ํ•œ = Q1 - 1.5 * IQR
์ƒํ•œ = Q3 + 1.5 * IQR
  • quantile(): ๋ฐ์ดํ„ฐ์˜ ๋ถ„์œ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜
    0.25 ๋ถ„์œ„์ˆ˜(=1์‚ฌ๋ถ„์œ„์ˆ˜), 0.5 ๋ถ„์œ„์ˆ˜(=์ค‘์•™๊ฐ’), 0.75 ๋ถ„์œ„์ˆ˜(=3์‚ฌ๋ถ„์œ„์ˆ˜)

๐Ÿฆ‹ ์ œ 2์œ ํ˜•(40์ , ๋ฌธ์ œ 1๊ฐœ):

1. ๋ฐ์ดํ„ฐ ์œ ํ˜• ํŒŒ์•…

โ–ช๏ธ .info() / .info

print(train.info())

โ–ช๏ธ .shape : ํŠœํ”Œ ํ˜•ํƒœ๋กœ ๋ฐฐ์—ด ์ •๋ณด ํ™•์ธ

print(X_train.shape)

2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

(1) ๋…๋ฆฝ๋ณ€์ˆ˜/์ข…์†๋ณ€์ˆ˜ ๋ถ„๋ฆฌ, train/test set ๋ถ„๋ฆฌ

  • axis = 0: ์•„๋ž˜๋กœ ๋‚ด๋ ค๊ฐ€๋ฉด์„œ ๊ณ„์‚ฐ โ†’ ์—ด(์„ธ๋กœ)์„ ๋”ฐ๋ผ ๊ณ„์‚ฐ
  • axis = 1: ์˜†์œผ๋กœ ๊ฐ€๋ฉด์„œ ๊ณ„์‚ฐ โ†’ ํ–‰(๊ฐ€๋กœ)์„ ๋”ฐ๋ผ ๊ณ„์‚ฐ
    โš ๏ธ ํ–‰์€ ๊ฐ€๋กœ = ํ–‰๊ฑฐ๋Š” ๊ฐ€๋กœ
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
print(df)
'''
   A  B
0  1  4
1  2  5
2  3  6
'''

print(df.sum(axis=0)) # โ†’ ๊ฐ ์—ด์˜ ํ•ฉ
print(df.sum(axis=1)) # โ†’ ๊ฐ ํ–‰์˜ ํ•ฉ
'''
A     6
B    15

0     5
1     7
2     9
'''

(2) ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: fillna()

# ํ™˜๋ถˆ๊ธˆ์•ก์— ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฑด ํ™˜๋ถˆ์„ ํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ์˜๋ฏธ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ
X_trian['ํ™˜๋ถˆ๊ธˆ์•ก'] = X_trian['ํ™˜๋ถˆ๊ธˆ์•ก'].fillna(0)
X_test['ํ™˜๋ถˆ๊ธˆ์•ก'] = X_test['ํ™˜๋ถˆ๊ธˆ์•ก'].fillna(0)

print(X_trian.isna().sum())
print(X_test.isna().sum())

(3) ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ์Šค์ผ€์ผ๋ง

Why? ๊ฐ•์ˆ˜๋Ÿ‰ ๊ฐ’์ด ํฌ๋‹ค๊ณ  ํ•ด์„œ ์ค‘์š”ํ•˜๋‹ค๋Š” ๋œป์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ™์€ ์Šค์ผ€์ผ๋กœ ๋ฐ”๊ฟ”์•ผํ•œ๋‹ค.

  • Min-Max Scaling(์ตœ์†Œ-์ตœ๋Œ€ ์ •๊ทœํ™”) : 0~1 ์‚ฌ์ด๋กœ ์ •๊ทœํ™”

    Xscaled=Xโˆ’XminXmaxโˆ’XminX_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
    # ๐Ÿ”ง scikit-learn
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    num_columns = X_trian.select_dtypes(exclude='object').columns # ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ œ์™ธํ•˜๊ณ  ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜๋งŒ ๊ฐ€์ ธ์˜ด
    X_trian[num_columns] = scaler.fit_transform(X_trian[num_columns]) # fit_transform(): ํ•™์Šต โ†’ ์ ์šฉ
    X_test[num_columns] = scaler.transform(X_test[num_columns])       # transform(): ์ ์šฉ
    • ๋ฒ”์œ„: 0 ~ 1 / -1 ~ 1
    • ๋ฏผ๊ฐ๋„: ์ด์ƒ์น˜์— ๋ฏผ๊ฐ
    • ์‚ฌ์šฉ์ฒ˜: ์ด๋ฏธ์ง€, ๋”ฅ๋Ÿฌ๋‹ ๋“ฑ
  • Standard Scaling(ํ‘œ์ค€ํ™”, Z-score ์ •๊ทœํ™”): ์ •๊ทœ๋ถ„ํฌ ๊ธฐ๋ฐ˜

    Xscaled=Xโˆ’ฮผฯƒX_{\text{scaled}} = \frac{X - \mu}{\sigma}

    (ํ‰๊ท : ฮผ, ํ‘œ์ค€ํŽธ์ฐจ: ฯƒ)

    # ๐Ÿ”ง scikit-lear
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    • ๋ฒ”์œ„: ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1
    • ๋ฏผ๊ฐ๋„: ์ƒ๋Œ€์ ์œผ๋กœ ์•ˆ์ •์ ์ด๋ฉฐ, ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ํšจ๊ณผ์ 
    • ์‚ฌ์šฉ์ฒ˜: ํšŒ๊ท€, PCA ๋“ฑ

(4) ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜(Object) ์ธ์ฝ”๋”ฉ

why? "Sunny", "Rainy"๊ฐ€ ๊ฐ™์€ ๋ฌธ์ž์—ด์€ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆซ์ž๋ฅผ ๋ถ€์—ฌํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ.

  • Label Encoding(๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ):

    ['ํ”ผ์ž', '์น˜ํ‚จ', '์ฝœ๋ผ'] โ†’ [0, 1, 2]
    
    # ๐Ÿ”ง scikit-lear
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    X_trian['encoded'] = le.fit_transform(X_trian['food'])
    X_test['encoded'] = le.fit_transform(X_test['food'])
    • ์ˆœ์„œ ์—†์Œ
    • ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(RandomForest, XGBoost ๋“ฑ)์— ์ฃผ๋กœ ์‚ฌ์šฉ๋จ
    • ๊ฐ ๋ฒ”์ฃผ์— ์ˆซ์ž ID ๋ถ€์—ฌ
  • One-Hot Encoding(์›-ํ•ซ ์ธ์ฝ”๋”ฉ):

    • fit โ†’ transform โ†’ DataFrame โ†’ concat
    'ํ”ผ์ž' โ†’ [1, 0, 0]
    '์น˜ํ‚จ' โ†’ [0, 1, 0]
    '์ฝœ๋ผ' โ†’ [0, 0, 1]
    
    pd.get_dummies(df['food'])
    # ๐Ÿ”ง scikit-lear
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(sparse_output=False)
    cat_cols = ['C1', 'C2', 'C3', 'C4']
    
    ohe.fit(X_trian[cat_cols])
    # DataFrame = get_feature_names_out() : ์—ด ์ด๋ฆ„ ๋ณต์›
    X_train_ohe = pd.DataFrame(ohe.transform(X_trian[cat_cols]), columns=ohe.get_feature_names_out(cat_cols), index=X_trian.index)
    X_test_ohe = pd.DataFrame(ohe.transform(X_test[cat_cols]), columns=ohe.get_feature_names_out(cat_cols), index=X_test.index)
    
    # ๊ธฐ์กด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ œ๊ฑฐํ•˜๊ณ  ์ธ์ฝ”๋”ฉ๋œ ๊ฒƒ ์ถ”๊ฐ€
    X_trian = pd.concat([X_trian.drop(columns=cat_cols), X_train_ohe], axis=1)
    X_test = pd.concat([X_test.drop(columns=cat_cols), X_test_ohe], axis=1)
    • ๊ฐ ๋ฒ”์ฃผ๋ฅผ ์ด์ง„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜
    • ์„ ํ˜• ๋ชจ๋ธ(Linear Regression, Logistic ๋“ฑ)์— ์ฃผ๋กœ ์‚ฌ์šฉ๋จ
    • ๋ฒ”์ฃผ๋งˆ๋‹ค ์ƒˆ๋กœ์šด ์—ด ์ƒ์„ฑ

3. ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ: train_test_split()

  • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ํ›ˆ๋ จ(train)๊ณผ ๊ฒ€์ฆ(validation)์œผ๋กœ ๋‚˜๋ˆ„์–ด:
    • ๋ชจ๋ธ ํ•™์Šต์€ X_trian, y_train์œผ๋กœ ์ง„ํ–‰ํ•˜๊ณ ,
    • ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ X_val, y_val์—์„œ ํ‰๊ฐ€
์š”์†Œ์„ค๋ช…
X_trianํ›ˆ๋ จ์šฉ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ (๋…๋ฆฝ ๋ณ€์ˆ˜)
X_val๊ฒ€์ฆ์šฉ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
y_trainํ›ˆ๋ จ์šฉ ํƒ€๊นƒ ๊ฐ’ (์ข…์† ๋ณ€์ˆ˜, ์˜ˆ: ์ง€ํ•˜์ฒ  ์ด์šฉ์ž ์ˆ˜)
y_val๊ฒ€์ฆ์šฉ ํƒ€๊นƒ ๊ฐ’
from sklearn.model_selection import train_test_split
X_trian, X_val, y_train, y_val = train_test_split(X_trian, y, test_size=0.2)
print(X_trian.shape, X_val.shape, y_train.shape, y_val.shape)
  • train_test_split(X_trian, y, test_size=0.2):
    - X_trian: ๋…๋ฆฝ ๋ณ€์ˆ˜ ๋ฐ์ดํ„ฐ (ํŠน์ง• ๋ณ€์ˆ˜)
    - y: ์ข…์† ๋ณ€์ˆ˜ (๋ชฉํ‘œ๊ฐ’, ์˜ˆ: ์ง€ํ•˜์ฒ  ์ด์šฉ์ž ์ˆ˜)
    - test_size=0.2: ์ „์ฒด ๋ฐ์ดํ„ฐ ์ค‘ 20%๋ฅผ ๊ฒ€์ฆ์šฉ(validation)์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ
    โžก๏ธ ์ฆ‰, 80%๋Š” ํ›ˆ๋ จ์šฉ, 20%๋Š” ๊ฒ€์ฆ์šฉ

4. ๋ชจ๋ธ ํ•™์Šต ๋ฐ ๊ฒ€์ฆ

โ–ช๏ธ ๋ถ„๋ฅ˜(RandomForestClassifier)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train) # ๋ชจ๋ธ ํ•™์Šต
y_val_pred = model.predict(X_val)
  • ๋‹ค์ค‘ ๋ถ„๋ฅ˜: LabelEncoder โ†’ A B C D E โ†’ 0 1 2 3 4 โ†’ inverser_transform โ†’ A B C D E

โ–ช๏ธ ํšŒ๊ท€(RandomForestRegressor)

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_trian, y_train)
y_val_pred = model.predict(X_val)

๋กœ์ง€์Šคํ‹ฑ / ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ

import statsmodels.api as sm
X = sm.add_constant(X) # ์ƒ์ˆ˜ํ•ญ ์ถ”๊ฐ€

model = 
print(f"ํšŒ๊ท€๊ณ„์ˆ˜: {model.params[]}")
  • ๋กœ์ง€์Šคํ‹ฑ: sm.Logit(y, X).fit()
  • ๋‹ค์ค‘ ์„ ํ˜•: sm.OLS(y, X).fit()

๐Ÿ†š ์ฐจ์ด

๊ธฐ์ค€ํšŒ๊ท€ (Regression)๋ถ„๋ฅ˜ (Classification)
Y ๊ฐ’์˜ ํ˜•ํƒœ์—ฐ์†ํ˜• ์ˆซ์ž (์‹ค์ˆ˜, ์ •์ˆ˜)๋ฒ”์ฃผํ˜• (ํด๋ž˜์Šค, ๋ผ๋ฒจ)
์˜ˆ:์ง‘๊ฐ’, ์˜จ๋„, ๋งค์ถœ์•ก, ์Šน๊ฐ ์ˆ˜์ŠคํŒธ/ํ–„, ์งˆ๋ณ‘ ์œ ๋ฌด, ์ˆซ์ž 0~9
์˜ˆ์ธก ๊ฒฐ๊ณผ์‹ค์ˆ˜ ๊ฐ’ ์ถœ๋ ฅํด๋ž˜์Šค(๋ฒ”์ฃผ) ์ถœ๋ ฅ
ํ‰๊ฐ€์ง€ํ‘œMAE, MSE, RMSE, Rยฒ ๋“ฑ์ •ํ™•๋„, F1-score, ROC-AUC ๋“ฑ

5. ํ‰๊ฐ€

โ–ช๏ธ ๋ถ„๋ฅ˜: roc_auc_score, accuracy_score

  • ROC ๊ณก์„  ์•„๋ž˜ ๋ฉด์ (ROC Auc Score):
    • ROC ๊ณก์„ (Receiver Operating Characteristic curve)์€ ์ฐธ ์–‘์„ฑ ๋น„์œจ(TPR)๊ณผ ๊ฑฐ์ง“ ์–‘์„ฑ ๋น„์œจ(FPR)์˜ ๋ณ€ํ™”๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ณก์„ ์œผ๋กœ, ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š”์ง€(ํด๋ž˜์Šค ๊ฐ„ ๋ถ„๋ฆฌ๋„)๋ฅผ ์ˆ˜์น˜๋กœ ๋‚˜ํƒ€๋ƒ„.
    • ์ตœ๋Œ€ 1.0์œผ๋กœ, 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์Œ.
  • ์ •ํ™•๋„(Accuracy Score):
    • ๋งž๊ฒŒ ์˜ˆ์ธกํ•œ ์ƒ˜ํ”Œ์˜ ๋น„์œจ
    • ์ตœ๋Œ€ 1.0์œผ๋กœ, 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์Œ.
Accuracy=์ •ํ™•ํžˆย ๋งž์ถ˜ย ์ƒ˜ํ”Œย ์ˆ˜์ „์ฒดย ์ƒ˜ํ”Œย ์ˆ˜\text{Accuracy} = \frac{\text{์ •ํ™•ํžˆ ๋งž์ถ˜ ์ƒ˜ํ”Œ ์ˆ˜}}{\text{์ „์ฒด ์ƒ˜ํ”Œ ์ˆ˜}}
from sklearn.metrics import roc_auc_score, accuracy_score
auc_score = roc_auc_score(y_val, y_val_pred)
acc = accuracy_score(y_val, y_val_pred)
print(f'auc_score: {auc_score}, acc: {acc}')

โ–ช๏ธ ํšŒ๊ท€: rmse, r2_score

  • RSEM: ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’ ์‚ฌ์ด ์˜ค์ฐจ์˜ ํ‰๊ท ์  ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋ƒ„
    ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ์•ˆ์ •์ 
  • Rยฒ(๊ฒฐ์ •๊ณ„์ˆ˜): ์ตœ๋Œ€ 1.0์œผ๋กœ, 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์Œ
from sklearn.metrics import root_mean_squared_error, r2_score
rmse = root_mean_squared_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
print(rmse, r2)

6. ๊ฒฐ๊ณผ ์ €์žฅ

y_pred = model.predict(X_test)
result = pd.DataFrame(y_pred, columns=['pred'])
result.to_csv('result.csv', index=False)

7. ์ƒ์„ฑ ๊ฒฐ๊ณผ ํ™•์ธ

result = pd.read_csv('result.csv')
print(result)

๐Ÿฆ‹ ์ œ 3์œ ํ˜•(15์ , ๋ฌธ์ œ 2๊ฐœ(3๋ฌธํ•ญ)): ํ†ต๊ณ„์  ๊ฐ€์„ค ๊ฒ€์ •

โ–ช๏ธ ์ƒ๊ด€๊ณ„์ˆ˜: df.corr()

correlations = df.corr(numeric_only=True)['Traget'].drop('Traget') # ์ž๊ธฐ์ž์‹  ์ œ์™ธ

# ๊ฐ€์žฅ ๋†’์€ ์ƒ๊ด€๊ณ„์ˆ˜์™€ ๋ณ€์ˆ˜ ์ด๋ฆ„
max_corr_var = correlations.abs().idxmax()
max_corr_value = correlations[max_corr_var]

print(f"๐Ÿ“Œ Target๊ณผ ๊ฐ€์žฅ ์„ ํ˜•๊ด€๊ณ„๊ฐ€ ํฐ ๋ณ€์ˆ˜: {max_corr_var}")
print(f"๐Ÿ”ข ์ƒ๊ด€๊ณ„์ˆ˜: {max_corr_value:.3f}")

โ–ช๏ธ F-๊ฒ€์ •(F-test): ๋‘ ์ง‘๋‹จ์˜ ๋ถ„์‚ฐ์ด ๊ฐ™์€๊ฐ€

F=s12s22F = \frac{s_1^2}{s_2^2}
  • s12s_1^2 : ๋ถ„์‚ฐ์ด ๋” ํฐ ์ชฝ (๋ถ„๋ชจ๊ฐ€ ์ž‘๊ฒŒ ๋˜๋ฉด F ๊ฐ’์ด ์ปค์ ธ์„œ ์œ ์˜์„ฑ์ด ๋†’์•„์ง)
  • s22s_2^2 : ๋ถ„์‚ฐ์ด ๋” ์ž‘์€ ์ชฝ

๐Ÿ‘‰ ์ด ๊ฐ’์€ F๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฆ„ (์ž์œ ๋„๋Š” ๊ฐ ์ง‘๋‹จ์˜ ํ‘œ๋ณธ ์ˆ˜ โ€“ 1)

  • ๊ท€๋ฌด๊ฐ€์„ค(Hโ‚€) ๋‘ ์ง‘๋‹จ์˜ ๋ถ„์‚ฐ์€ ๊ฐ™๋‹ค:โ€ƒฯƒโ‚ยฒ = ฯƒโ‚‚ยฒ
  • ๋Œ€๋ฆฝ๊ฐ€์„ค(Hโ‚) ๋‘ ์ง‘๋‹จ์˜ ๋ถ„์‚ฐ์€ ๋‹ค๋ฅด๋‹ค:โ€ƒฯƒโ‚ยฒ โ‰  ฯƒโ‚‚ยฒ
var1 = group1.var()
var2 = group2.var()

# ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜ -1 = ์ž์œ ๋„
dof_1 = len(group1) -1
dof_2 = len(group2) -1
print(dof_1, dof_2) # 51 63 โ†’ group2: ๋ถ„์ž, group1: ๋ถ„๋ชจ

f_stat = var2/var1
print(round(f_stat,3))

โ–ช๏ธ ๋ถ„์‚ฐ ์ถ”์ •๋Ÿ‰(Sample Variance)

s2=1nโˆ’1โˆ‘(xiโˆ’xห‰)2s^2 = \frac{1}{n - 1} \sum (x_i - \bar{x})^2
  • nn: ํ‘œ๋ณธ์˜ ํฌ๊ธฐ
  • xix_i: i๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ๊ฐ’
  • xห‰\bar{x}: ํ‘œ๋ณธ ํ‰๊ท 

โ–ช๏ธ ํ•ฉ๋™ ๋ถ„์‚ฐ ์ถ”์ •๋Ÿ‰ (Pooled Variance) : ๋“ฑ๋ถ„์‚ฐ์ผ ๋•Œ ๊ฐ€๋Šฅ

sp2=(n1โˆ’1)s12+(n2โˆ’1)s22n1+n2โˆ’2s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}
  • s12s_1^2, s22s_2^2: ๋‘ ์ง‘๋‹จ์˜ ํ‘œ๋ณธ ๋ถ„์‚ฐ
  • n1n_1, n2n_2: ๊ฐ ์ง‘๋‹จ์˜ ํ‘œ๋ณธ ์ˆ˜
var1 = group1.var()
var2 = group2.var()
n1 = len(group1)
n2 = len(group2)

pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)
print(round(pooled_var, 3)) 

โ–ช๏ธ t-๊ฒ€์ •(t-test)

t=xห‰1โˆ’xห‰2spโ‹…1n1+1n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
mean1 = group1.mean()
mean2 = group2.mean()

t_stat = (mean1 - mean2) / np.sqrt(pooled_var * (1/n1 + 1/n2))
  • ๋…๋ฆฝํ‘œ๋ณธ โ†’ ttest_ind()
  • ๋Œ€์‘ํ‘œ๋ณธ โ†’ ttest_rel()
  • ์›์ƒ˜ํ”Œ โ†’ ttest_1samp()
  • ํ•œํ‘œ๋ณธ โ†’

โ–ช๏ธ p-value(์œ ์˜ํ™•๋ฅ )

  • p<0.05p < 0.05: ์œ ์˜์ˆ˜์ค€ 5% โ†’ Significant(์œ ์˜ํ•จ)
  • pโ‰ฅ0.05p \geq 0.05: Not Significant
  • model.pvalues.max()
from scipy import stats 

ttest_result = stats.ttest_ind(group1, group2, equal_var=True)

โ–ช๏ธ ์˜ค์ฆˆ๋น„(Odds Ratio):

์–ด๋–ค ์‚ฌ๊ฑด์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ ๊ณผ ๋ฐœ์ƒํ•˜์ง€ ์•Š์„ ํ™•๋ฅ ์˜ ๋น„์œจ์„ ๋น„๊ตํ•˜๋Š” ๊ฐ’์ด๋‹ค.

import numpy as np
coef = model.params['age']
# ์˜ค์ฆˆ๋น„ (odds ratio) = exp(ํšŒ๊ท€๊ณ„์ˆ˜)
print(np.exp(coef))

"์˜ค์ฆˆ๋น„๊ฐ€ ๋ช‡ ๋ฐฐ๋กœ ๋ณ€ํ™”ํ•˜๋Š”๊ฐ€?" โ†’ ์˜ค์ฆˆ๋น„๋Š” ๊ณ„์ˆ˜์˜ ์ง€์ˆ˜ํ•จ์ˆ˜:

์˜ค์ฆˆ๋น„=eฮฒ\text{์˜ค์ฆˆ๋น„} = e^{\beta}

5๋‹จ์œ„๋กœ ์ฆ๊ฐ€ํ•˜๋ฉด:

๋ณ€ํ™”ํ•œย ์˜ค์ฆˆ๋น„=eฮฒร—5\text{๋ณ€ํ™”ํ•œ ์˜ค์ฆˆ๋น„} = e^{\beta \times 5}
  • odds_ratio > 1: ํ™•๋ฅ ์ด ์ฆ๊ฐ€
  • odds_ratio < 1: ํ™•๋ฅ ์ด ๊ฐ์†Œ
  • odds_ratio โ‰ˆ 1: ์˜ํ–ฅ ๊ฑฐ์˜ ์—†์Œ

โ–ช๏ธ ์ž”์ฐจ ์ดํƒˆ๋„(residual deviance): .deviance

residual_deviance = -2 * model.llf
				  = model.deviance

โ–ช๏ธ ๋กœ์ง“ ์šฐ๋„๊ฐ’(Log-Likelihood of the model: .llf

print(model.llf)

์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ

ํšŒ๊ท€ ๊ณ„์ˆ˜(Regression Coefficient)

ํšŒ๊ท€ ๊ณ„์ˆ˜๋Š” ๊ฐ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์ข…์†๋ณ€์ˆ˜์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๋ ฅ์˜ ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

PIQ=ฮฒ0+ฮฒ1โ‹…Brain+ฮฒ2โ‹…Height\text{PIQ} = \beta_0 + \beta_1 \cdot \text{Brain} + \beta_2 \cdot \text{Height}
  • ฮฒโ‚€: ์ƒ์ˆ˜ํ•ญ (์ ˆํŽธ)
  • ฮฒโ‚: Brain์˜ ํšŒ๊ท€ ๊ณ„์ˆ˜
  • ฮฒโ‚‚: Height์˜ ํšŒ๊ท€ ๊ณ„์ˆ˜

์˜๋ฏธ:

  • ฮฒโ‚ = 1.2์ด๋ฉด โ†’ Brain ๊ฐ’์ด 1 ๋‹จ์œ„ ์ฆ๊ฐ€ํ•  ๋•Œ, PIQ๋Š” ํ‰๊ท ์ ์œผ๋กœ 1.2 ์ฆ๊ฐ€.
  • ฮฒโ‚‚ = -3.5์ด๋ฉด โ†’ Height๊ฐ€ 1 ์ฆ๊ฐ€ํ•˜๋ฉด PIQ๋Š” ํ‰๊ท ์ ์œผ๋กœ 3.5 ๊ฐ์†Œ.

์ƒ์ˆ˜ํ•ญ(constant or intercept) ์ถ”๊ฐ€

๐Ÿ“Œ ์ƒ์ˆ˜ํ•ญ์ด ์—†์œผ๋ฉด:

๋ชจ๋ธ์ด (0, 0)์„ ๋ฐ˜๋“œ์‹œ ์ง€๋‚˜์•ผ ํ•œ๋‹ค๋Š” ์ œ์•ฝ์ด ์ƒ๊ธด๋‹ค. ์ฆ‰,

PIQ=ฮฒ1โ‹…Brain+ฮฒ2โ‹…Height\text{PIQ} = \beta_1 \cdot \text{Brain} + \beta_2 \cdot \text{Height}

โ†’ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๋ชจ๋‘ 0์ผ ๋•Œ ์ข…์†๋ณ€์ˆ˜๋„ ๋ฌด์กฐ๊ฑด 0์ด์–ด์•ผ ํ•จ.

๐Ÿ“Œ ์ƒ์ˆ˜ํ•ญ์„ ์ถ”๊ฐ€ํ•˜๋ฉด:

PIQ=ฮฒ0+ฮฒ1โ‹…Brain+ฮฒ2โ‹…Height\text{PIQ} = \beta_0 + \beta_1 \cdot \text{Brain} + \beta_2 \cdot \text{Height}

โ†’ ๋ฐ์ดํ„ฐ์— ๋” ์ž˜ ๋งž๋Š” ์œ ์—ฐํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ณ , ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—๋„ ๋” ์ ํ•ฉํ•จ.

  • statsmodels๋‚˜ scikit-learn์—์„œ ํšŒ๊ท€ ๋ถ„์„ํ•  ๋•Œ, ๋ฐ˜๋“œ์‹œ ์ƒ์ˆ˜ํ•ญ์„ ํฌํ•จ์‹œํ‚ค๋Š” ๊ฒŒ ์ผ๋ฐ˜์ .
  • statsmodels๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์ƒ์ˆ˜ํ•ญ์„ ํฌํ•จํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, sm.add_constant(X)๋กœ ๋ช…์‹œ์ ์œผ๋กœ ์ถ”๊ฐ€ํ•ด์ค˜์•ผ ํ•จ.
profile
สšศ‰ษž

0๊ฐœ์˜ ๋Œ“๊ธ€