โ‘ง ๐Ÿค– Machine Learning 2์ผ์ฐจ - ํŒŒ์ดํ”„๋ผ์ธ(Pipeline) 1

JItzelยท2025๋…„ 12์›” 11์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
8/14

ํŒŒ์ดํ”„๋ผ์ธ(Pipeline): ์ „์ฒ˜๋ฆฌ์™€ ํ•™์Šต์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌ

1. ํŒŒ์ดํ”„๋ผ์ธ(Pipeline)์ด๋ž€?

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Preprocessing) ๋‹จ๊ณ„์™€ ๋ชจ๋ธ ํ•™์Šต(Modeling) ๋‹จ๊ณ„๋ฅผ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋กœ ๋ฌถ์–ด์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰ํ•ด์ฃผ๋Š” ๊ธฐ๋Šฅ

์‚ฌ์šฉํ•˜๋Š” ์ด์œ 

  1. ์ฝ”๋“œ ๊ฐ„์†Œํ™”: ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ํ•œ ์ค„์˜ ์ฝ”๋“œ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
  2. ์˜ค๋ฅ˜ ๋ฐฉ์ง€ (Data Leakage): ํ•™์Šต ๋ฐ์ดํ„ฐ(Train)์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ(Test)์— ์‹ค์ˆ˜ ์—†์ด ๋™์ผํ•œ ๋ณ€ํ™˜(Transform)์„ ์ ์šฉํ•ด์ค€๋‹ค.
  3. ์žฌํ˜„์„ฑ ํ–ฅ์ƒ: ์ „์ฒด ์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ๋ช…ํ™•ํ•ด์ ธ์„œ, ๋‚˜์ค‘์— ๋‹ค์‹œ ์‹คํ–‰ํ•ด๋„ ๋˜‘๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์‰ฝ๋‹ค.

2. ๋ฐฉ๋ฒ• 1: make_pipeline (๊ฐ„ํŽธํ•จ)

  • ํ•จ์ˆ˜ ์•ˆ์— ์‚ฌ์šฉํ•  ํด๋ž˜์Šค(๊ฐ์ฒด)๋“ค์„ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์—ดํ•˜๋ฉด ๋. ์ด๋ฆ„์€ ํ•จ์ˆ˜์ด๋ฆ„์ด ์ž๋™์œผ๋กœ(์†Œ๋ฌธ์ž) ์ง€์ •๋œ๋‹ค.

1) ๋ฐ์ดํ„ฐ ์ค€๋น„ (์Šค์ผ€์ผ ์ฐจ์ด๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline # โœจ ํ•ต์‹ฌ!

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
data = [[828, 920, 1234567, 1020, 1111],
        [824, 910, 2345612, 1090, 1234],
        [880, 900, 3456123, 1010, 1000],
        [870, 990, 2312123, 1001, 1122],
        [860, 980, 3223123, 1008, 1133],
        [850, 970, 2432123, 1100, 1221]]
data = np.float32(data)
df = pd.DataFrame(data)

x_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values

2) ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต

  • ์ˆœ์„œ ์ค‘์š”! ( ์ „์ฒ˜๋ฆฌ โ†’\rightarrow ... โ†’\rightarrow ๋ชจ๋ธ )
# ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ: MinMax์Šค์ผ€์ผ๋ง -> SGDํšŒ๊ท€๋ถ„์„
model_pipeline = make_pipeline(MinMaxScaler(), SGDRegressor(max_iter=500))

print(model_pipeline)
# ์ถœ๋ ฅ ์˜ˆ์‹œ:
# Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
#                 ('sgdregressor', SGDRegressor(max_iter=500))])

# ํ•™์Šต (fit)
# ๋‚ด๋ถ€์ ์œผ๋กœ x_data๋ฅผ ์Šค์ผ€์ผ๋ง(fit_transform) ํ•œ ๋’ค ๋ชจ๋ธ์— ์ „๋‹ฌ
model_pipeline.fit(x_data, y_data)

3) ์˜ˆ์ธก (Predict)

  • ์˜ˆ์ธกํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ ๊ฒƒ(Raw Data) ๊ทธ๋Œ€๋กœ ๋„ฃ์–ด๋„, ํŒŒ์ดํ”„๋ผ์ธ์ด ์•Œ์•„์„œ ์Šค์ผ€์ผ๋ง(transform) ํ›„ ๋ชจ๋ธ์— ๋„ฃ์–ด์คŒ
# ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ (์Šค์ผ€์ผ๋ง ์•ˆ ๋œ ์›๋ณธ ๊ฐ’)
new_data = [[828.0, 920.0, 1234567.0, 1020.0]]

# ์•Œ์•„์„œ ๋ณ€ํ™˜ ํ›„ ์˜ˆ์ธก ์ˆ˜ํ–‰
pred = model_pipeline.predict(new_data)
print(pred)
# ๊ฒฐ๊ณผ: array([762.95549553])

3. ๋‚ด๋ถ€ ๋ชจ๋ธ ๊บผ๋‚ด๋ณด๊ธฐ (named_steps)

  • ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๊ฐ์‹ธ์ ธ ์žˆ์–ด๋„ ๋‚ด๋ถ€์˜ ๊ธฐ์šธ๊ธฐ(ww)๋‚˜ ์ ˆํŽธ(bb)์„ ํ™•์ธํ•˜๋Š”๊ฒŒ ๊ฐ€๋Šฅํ•˜๋‹ค.
# make_pipeline์€ ํด๋ž˜์Šค ์ด๋ฆ„์„ ์†Œ๋ฌธ์ž๋กœ ์ž๋™ ์ง€์ •ํ•จ ('sgdregressor')
model_reg = model_pipeline.named_steps['sgdregressor']

print("๊ธฐ์šธ๊ธฐ:", model_reg.coef_)
print("์ ˆํŽธ:", model_reg.intercept_)

# ์ถœ๋ ฅ ์˜ˆ์‹œ
# ๊ธฐ์šธ๊ธฐ: [175.15834 240.64082 246.2967  280.61044]
# ์ ˆํŽธ: [643.11381433]

4. ๋ฐฉ๋ฒ• 2: Pipeline (์ด๋ฆ„ ์ง€์ • ๊ฐ€๋Šฅ)

Pipeline ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ๊ณ„๋ณ„ ์ด๋ฆ„์„ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋Œ€๋กœ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
(์ด๋ฆ„, ๊ฐ์ฒด) ํŠœํ”Œ ๋ฆฌ์ŠคํŠธ ํ˜•์‹์„ ์‚ฌ์šฉ. ๋‚˜์ค‘์— ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹(GridSearchCV)์„ ํ•  ๋•Œ๋‚˜ ๋ณต์žกํ•œ ๋ชจ๋ธ ๊ด€๋ฆฌ ์‹œ ์ด ๋ฐฉ๋ฒ•์ด ๋” ์„ ํ˜ธ๋œ๋‹ค.

1) ๋ฐ์ดํ„ฐ ์ค€๋น„ (Pima Indians)

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
df = pd.read_csv('data/pima-indians-diabetes.data.csv')
x_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values

# ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ถ„๋ฆฌ
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

2) ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ ๋ฐ ํ•™์Šต

# ๋ฆฌ์ŠคํŠธ ์•ˆ์— (์ด๋ฆ„, ๊ฐ์ฒด) ํŠœํ”Œ๋กœ ์ •์˜
model_pipe = Pipeline([
    ('scaler', StandardScaler()),       # 1๋‹จ๊ณ„: ํ‘œ์ค€ํ™”
    ('regress', LogisticRegression(max_iter=500)) # 2๋‹จ๊ณ„: ๋กœ์ง€์Šคํ‹ฑ
])

# ํ•™์Šต
model_pipe.fit(x_train, y_train)

3) ์˜ˆ์ธก ๋ฐ ๋‚ด๋ถ€ ํ™•์ธ

# ์˜ˆ์ธก (์›๋ณธ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ)
print(model_pipe.predict([[6, 148, 72, 35, 0, 33.6, 0.627, 50]]))
# ๊ฒฐ๊ณผ: array([1])

# ๋‚ด๊ฐ€ ์ง€์€ ์ด๋ฆ„('regress')์œผ๋กœ ๋‚ด๋ถ€ ๋ชจ๋ธ ์ ‘๊ทผ
model_logi = model_pipe.named_steps['regress']

print("๊ฐ€์ค‘์น˜:\n", model_logi.coef_)
print("์ ˆํŽธ:", model_logi.intercept_)

5. make_pipeline vs Pipeline ์ฐจ์ด์ 

ํŠน์ง•make_pipelinePipeline
์‚ฌ์šฉ๋ฒ•๊ฐ์ฒด๋งŒ ๋‚˜์—ด โ†’ make_pipeline(A(), B())(์ด๋ฆ„, ๊ฐ์ฒด) ํ˜•ํƒœ์˜ ๋ฆฌ์ŠคํŠธ โ†’ Pipeline([('a', A()), ('b', B())])
์ด๋ฆ„ ์ง€์ •์ž๋™ ์ƒ์„ฑ (์†Œ๋ฌธ์ž ํด๋ž˜์Šค๋ช…)์‚ฌ์šฉ์ž ์ง€์ • ๊ฐ€๋Šฅ (Custom Name)
ํ™œ์šฉ๋„๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•œ ์‹คํ—˜์šฉ๋ณธ๊ฒฉ์ ์ธ ํ”„๋กœ์ ํŠธ ๊ฐœ๋ฐœ, ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹(GridSearch) ๋“ฑ์— ์ ํ•ฉ

์š”์•ฝ

  1. Pipeline์€ ์ „์ฒ˜๋ฆฌ(Scaler)์™€ ๋ชจ๋ธ(Estimator)์„ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ๋„๊ตฌ๋‹ค.

  2. fit ํ•œ ๋ฒˆ์œผ๋กœ ์ „์ฒ˜๋ฆฌ์™€ ํ•™์Šต์„ ์™„๋ฃŒํ•˜๊ณ , predict ์‹œ์— ์ž๋™์œผ๋กœ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ ์šฉ. (์‹ค์ˆ˜ ๋ฐฉ์ง€)

  3. make_pipeline()์€ ์ด๋ฆ„์„ ์ž๋™ ์ƒ์„ฑํ•˜๊ณ , Pipeline([])์€ ์ด๋ฆ„์„ ์ง์ ‘ ์ง€์ •ํ•œ๋‹ค.

  4. ๋‚ด๋ถ€ ์†์„ฑ์ด ๊ถ๊ธˆํ•  ๋• .named_steps['์ด๋ฆ„']์œผ๋กœ ์ ‘๊ทผ

profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€