โ‘ฆ ๐Ÿค– Machine Learning 2์ผ์ฐจ - ์ •๊ทœํ™”(Normalization/Scaling)

JItzelยท2025๋…„ 12์›” 11์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
7/14

์ •๊ทœํ™” (Normalization/Scaling)

1. ์ •๊ทœํ™”(Scaling)๊ฐ€ ํ•„์š”ํ•œ ์ด์œ 

๋ฌธ์ œ์ : ์Šค์ผ€์ผ(๋‹จ์œ„)์˜ ์ฐจ์ด

์˜ˆ๋ฅผ ๋“ค์–ด, 'ํ‚ค(170cm)'์™€ '๋ชธ๋ฌด๊ฒŒ(65kg)'๋ฅผ ๊ฐ€์ง€๊ณ  ํŠน์„ฑ์„ ๋ถ„์„ํ•œ๋‹ค๊ณ  ํ•˜์ž.

  • ํ‚ค: 150 ~ 190 (๋ฒ”์œ„๊ฐ€ ํผ)
  • ๋ชธ๋ฌด๊ฒŒ: 40 ~ 100 (์ƒ๋Œ€์ ์œผ๋กœ ๋ฒ”์œ„๊ฐ€ ์ž‘์Œ)
    ๋ฐ์ดํ„ฐ์˜ ๋‹จ์œ„(Scale) ์ฐจ์ด๊ฐ€ ํฌ๋ฉด, ์ˆซ์ž๊ฐ€ ํฐ ํŠน์„ฑ(ํ‚ค)์ด ๊ฒฐ๊ณผ์— ๊ณผ๋„ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ฒŒ ๋œ๋‹ค.

ํšจ๊ณผ

  1. ํ•™์Šต ์†๋„ ํ–ฅ์ƒ: ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(SGD) ์‹œ, ๋ฐ์ดํ„ฐ๊ฐ€ ํƒ€์›ํ˜•์ด ์•„๋‹Œ '์›ํ˜•'์œผ๋กœ ๋ถ„ํฌํ•˜๊ฒŒ ๋˜์–ด ์ตœ์ ์ (Global Minimum)์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด.
  2. ๊ณผ์ ํ•ฉ(Overfitting) ๋ฐฉ์ง€: ํŠน์ • ํŠน์„ฑ์— ๊ฐ€์ค‘์น˜๊ฐ€ ์ ๋ฆฌ๋Š” ๊ฒƒ์„ ๋ง‰์•„์ค€๋‹ค.
  3. ํ•„์ˆ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜: SVM, ์„ ํ˜• ํšŒ๊ท€, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, KNN, Neural Network (๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค)
  • ์ฐธ๊ณ : ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(Decision Tree, Random Forest)์€ ์Šค์ผ€์ผ๋ง์˜ ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค.

2. ๋Œ€ํ‘œ์ ์ธ ์Šค์ผ€์ผ๋ง ๊ธฐ๋ฒ• 3๊ฐ€์ง€

  • Scikit-learn์˜ preprocessing ๋ชจ๋“ˆ์—์„œ ์ œ๊ณต
์ข…๋ฅ˜์„ค๋ช…์ˆ˜์‹ํŠน์ง•
Min-Max Scaling๊ฐ’์„ 0 ~ 1 ์‚ฌ์ด๋กœ ๋ณ€ํ™˜x' = (x - min) / (max - min)๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋Š” ์œ ์ง€๋˜๋‚˜ ์ด์ƒ์น˜์— ๋งค์šฐ ๋ฏผ๊ฐํ•จ
Standard Scalingํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1๋กœ ๋ณ€ํ™˜x' = (x - ฮผ) / ฯƒ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ํ‘œ์ค€ํ™”, ์ •๊ทœ๋ถ„ํฌ ๊ฐ€์ • ๋ชจ๋ธ์— ์œ ๋ฆฌ
Robust Scaling์ค‘์•™๊ฐ’(Median)๊ณผ IQR ์‚ฌ์šฉx' = (x - Q2) / (Q3 - Q1)์ด์ƒ์น˜ ์˜ํ–ฅ ์ตœ์†Œํ™”, ๋ถ„ํฌ๊ฐ€ ๋น„๋Œ€์นญ์ผ ๋•Œ ์œ ๋ฆฌ

3. ์‹ค์Šต 1: SGDRegressor์™€ ์ •๊ทœํ™” (MinMax)

์„ ํ˜• ํšŒ๊ท€ ์ค‘ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์“ฐ๋Š” SGDRegressor๋Š” ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ์— ๋งค์šฐ ๋ฏผ๊ฐํ•˜๋ฏ€๋กœ ์ •๊ทœํ™”๊ฐ€ ํ•„์ˆ˜์ด๋‹ค

1) ๋ฐ์ดํ„ฐ ์ค€๋น„ (์Šค์ผ€์ผ ์ฐจ์ด๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ)

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import SGDRegressor

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (3๋ฒˆ์งธ ์ปฌ๋Ÿผ์˜ ๋‹จ์œ„๊ฐ€ ๋งค์šฐ ํผ)
data = [[828, 920, 1234567, 1020, 1111],
        [824, 910, 2345612, 1090, 1234],
        [880, 900, 3456123, 1010, 1000],
        [870, 990, 2312123, 1001, 1122],
        [860, 980, 3223123, 1008, 1133],
        [850, 970, 2432123, 1100, 1221]]

# ์—ฐ์‚ฐ์„ ์œ„ํ•ด float32๋กœ ๋ณ€ํ™˜
df = pd.DataFrame(np.float32(data))

x_data = df.iloc[:, :-1].values # ํŠน์„ฑ (๋…๋ฆฝ๋ณ€์ˆ˜)
y_data = df.iloc[:, [-1]].values # ๋ผ๋ฒจ (์ข…์†๋ณ€์ˆ˜)

2) ์ •๊ทœํ™” ์ ์šฉ (fit_transform)

  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” fit(๊ธฐ์ค€ ์ฐพ๊ธฐ)๊ณผ transform(๋ณ€ํ™˜ํ•˜๊ธฐ)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•œ๋‹ค.
# 1. ํŠน์„ฑ ๋ฐ์ดํ„ฐ(X) ์ •๊ทœํ™”
scaleF = MinMaxScaler()
x_dataN = scaleF.fit_transform(x_data)

print(x_dataN[:2]) 
# 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์˜ˆ์˜๊ฒŒ ๋ณ€ํ™˜๋จ

# 2. ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ(Y) ์ •๊ทœํ™” (ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ SGD ์‚ฌ์šฉ ์‹œ ๊ถŒ์žฅ)
# ๋ณดํ†ต ๋ถ„๋ฅ˜๋ฌธ์ œ์—์„œ๋Š” Y๋ฅผ ์Šค์ผ€์ผ๋ง ํ•˜์ง€ ์•Š์ง€๋งŒ, 
# ๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ ํฐ ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” ์ˆ˜๋ ด์„ ๋•๊ธฐ ์œ„ํ•ด Y๋„ ์Šค์ผ€์ผ๋ง ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
scaleL = MinMaxScaler()
y_dataN = scaleL.fit_transform(y_data)

3) ํ•™์Šต ๋ฐ ์˜ˆ์ธก (์ฃผ์˜: ๋ณ€ํ™˜๋œ ๊ฐ’ ๋„ฃ๊ธฐ)

  • ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฏ€๋กœ, ์˜ˆ์ธก ์‹œ์—๋„ ๋ฐ˜๋“œ์‹œ ์ •๊ทœํ™”๋œ ๊ฐ’์„ ๋„ฃ์–ด์•ผํ•œ๋‹ค.
# ํ•™์Šต
model = SGDRegressor(verbose=True, max_iter=200)
model.fit(x_dataN, y_dataN.ravel()) # .ravel()๋กœ 1์ฐจ์› ๋ณ€ํ™˜ ๊ถŒ์žฅ

# ์˜ˆ์ธก ์‹œ๋‚˜๋ฆฌ์˜ค: 
# [828, 920, 1234567, 1020] ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด?

# 1. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ •๊ทœํ™” (transform๋งŒ ์‚ฌ์šฉ!)
# ์ด๋ฏธ fit์œผ๋กœ ๊ธฐ์ค€(min, max)์„ ์žก์•˜์œผ๋ฏ€๋กœ transform๋งŒ ํ•ฉ๋‹ˆ๋‹ค.
new_data = [[828.0, 920.0, 1234567.0, 1020.0]]
xN = scaleF.transform(new_data) 

# 2. ๋ชจ๋ธ ์˜ˆ์ธก
pred = model.predict(xN)
print(f"์˜ˆ์ธก๋œ ์Šค์ผ€์ผ ๊ฐ’: {pred}") 
# array([0.30035559]) -> 0~1 ์‚ฌ์ด์˜ ๊ฐ’์ด๋ผ ์šฐ๋ฆฌ๊ฐ€ ์•Œ์•„๋ณผ ์ˆ˜ ์—†์Œ

4) ์—ญ์ •๊ทœํ™” (Inverse Transform)

  • ๋ชจ๋ธ์ด ๋ฑ‰์–ด๋‚ธ 0.30... ์ด๋ผ๋Š” ๊ฐ’์€ ์ •๊ทœํ™”๋œ ์„ธ๊ณ„์˜ ๊ฐ’์ด๋ฏ€๋กœ ์ด๋ฅผ ๋‹ค์‹œ ์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” ์‹ค์ œ ๊ฐ’์œผ๋กœ ๋˜๋Œ๋ ค์•ผ ํ•œ๋‹ค.
# 3. ์—ญ์ •๊ทœํ™” (์›๋ž˜ ๋‹จ์œ„๋กœ ๋ณต์›)
# ์˜ˆ์ธก๊ฐ’์€ 1์ฐจ์›์ด๋ฏ€๋กœ 2์ฐจ์› ํ˜•ํƒœ๋กœ ๋„ฃ์–ด์ค˜์•ผ ํ•จ [pred]
original_val = scaleL.inverse_transform([pred])

print(f"์‹ค์ œ ์˜ˆ์ธก๊ฐ’: {original_val}")
# array([[1070.28315213]]) -> ์ด์ œ์•ผ ์‹ค์ œ ๊ฐ€๊ฒฉ/์ˆ˜์น˜๋กœ ๋ณด์ž„!

๐Ÿ’ก Pipeline (ํŒŒ์ดํ”„๋ผ์ธ)
์œ„์ฒ˜๋Ÿผ scale -> fit -> predict -> inverse ๊ณผ์ •์ด ๋ฒˆ๊ฑฐ๋กญ๋‹ค๋ฉด, Scikit-learn์˜ Pipeline์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๊ณผ์ •์„ ํ•˜๋‚˜๋กœ ๋ฌถ์–ด ์ž๋™ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

4. ์‹ค์Šต 2: ๋ถ„๋ฅ˜ ๋ชจ๋ธ(Logistic)๊ณผ ์ •๊ทœํ™”

Pima Indians Diabetes ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ ๋ถ„๋ฅ˜ ๋ฌธ์ œ ์‹ค์Šต
๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ๋ผ๋ฒจ(yy)์€ 0, 1์ด๋ฏ€๋กœ ์Šค์ผ€์ผ๋งํ•˜์ง€ ์•Š๋Š”๋‹ค.

1) ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์Šค์ผ€์ผ๋ง

df = pd.read_csv('data/pima-indians-diabetes.data.csv')

x_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values

# ์Šค์ผ€์ผ๋Ÿฌ ์ƒ์„ฑ
scaler = MinMaxScaler()

# ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋ง
x_dataN = scaler.fit_transform(x_data)

2) ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ๋ฐ ํ•™์Šต

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ(x_dataN)๋ฅผ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
x_train, x_test, y_train, y_test = train_test_split(
    x_dataN, y_data, 
    test_size=0.3, 
    stratify=y_data
)

# ํ•™์Šต
model = LogisticRegression(max_iter=500, verbose=True)
model.fit(x_train, y_train)

3) ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์˜ˆ์ธก

# ์˜ˆ์ธกํ•  ์‹ค์ œ ๋ฐ์ดํ„ฐ
new_sample = [[6, 148, 72, 35, 0, 33.6, 0.627, 50]]

# ๋ฐ˜๋“œ์‹œ ํ•™์Šต ๋•Œ ์‚ฌ์šฉํ•œ ์Šค์ผ€์ผ๋Ÿฌ๋กœ ๋ณ€ํ™˜(transform) ํ›„ ์ž…๋ ฅ!
xN = scaler.transform(new_sample)

result = model.predict(xN)
print(f"๋‹น๋‡จ ์—ฌ๋ถ€ ์˜ˆ์ธก: {result}") # array([1]) -> ๋‹น๋‡จ(1)๋กœ ์˜ˆ์ธก

Data Leakage (์ •๋ณด ๋ˆ„์ˆ˜)
Note: ์œ„ ์˜ˆ์ œ์—์„œ๋Š” ํŽธ์˜์ƒ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ fit_transform ํ•œ ํ›„ split ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค๋ฌด์—์„œ ๊ถŒ์žฅํ•˜๋Š” ๋ฐฉ๋ฒ• X
1. train_test_split์„ ๋จผ์ € ํ•œ๋‹ค.
2. x_train ๋ฐ์ดํ„ฐ๋กœ๋งŒ scaler๋ฅผ fit ํ•œ๋‹ค. (scaler.fit(x_train))
3. ๊ทธ ๊ธฐ์ค€์œผ๋กœ x_train๊ณผ x_test๋ฅผ ๊ฐ๊ฐ transform ํ•œ๋‹ค.
- ์ด์œ : ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ(๋ฏธ๋ž˜ ๋ฐ์ดํ„ฐ)์˜ ์ •๋ณด(min, max, mean ๋“ฑ)๊ฐ€ ํ•™์Šต ๊ณผ์ •์— ๋ฏธ๋ฆฌ ๋ฐ˜์˜๋˜๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•จ.

์š”์•ฝ

  • ์ •๊ทœํ™”(Scaling)๋Š” ํŠน์„ฑ ๊ฐ„์˜ ๋‹จ์œ„ ์ฐจ์ด๋ฅผ ์—†์•  ํ•™์Šต ์„ฑ๋Šฅ์„ ๋†’์ธ๋‹ค.
  • MinMax(0~1), Standard(ํ‰๊ท 0, ํ‘œ์ค€ํŽธ์ฐจ1), Robust(์ค‘์•™๊ฐ’)๊ฐ€ ์žˆ๋‹ค.
  • ํ•™์Šต ์‹œ ์ •๊ทœํ™”๋ฅผ ํ–ˆ๋‹ค๋ฉด, ์˜ˆ์ธกํ•  ๋ฐ์ดํ„ฐ๋„ ๋ฐ˜๋“œ์‹œ ์ •๊ทœํ™”๋ฅผ ๊ฑฐ์ณ์•ผ ํ•œ๋‹ค. (transform)
  • ํƒ€๊ฒŸ(yy)๊ฐ’๊นŒ์ง€ ์ •๊ทœํ™”ํ–ˆ๋‹ค๋ฉด, ๊ฒฐ๊ณผ ํ™•์ธ ์‹œ ์—ญ์ •๊ทœํ™”(inverse_transform)๊ฐ€ ํ•„์š”
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€