โ‘ช ๐Ÿค– Machine Learning 2์ผ์ฐจ - ํŒŒ์ดํ”„๋ผ์ธ(Pipeline) 2

JItzelยท2025๋…„ 12์›” 13์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
11/14

ํŒŒ์ดํ”„๋ผ์ธ ์‹ฌํ™” (ColumnTransformer & Imputer)

Pipeline์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋˜‘๊ฐ™์€ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ ์šฉํ•œ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ์ˆซ์ž(Scaling ํ•„์š”)์™€ ๋ฌธ์ž(Encoding ํ•„์š”)๊ฐ€ ์„ž์—ฌ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ณตํ•ฉ์ ์ธ ํ•„์š”๊ฐ€ ํ•„์š”.

1. ๋‹จ์ˆœ Pipeline์˜ ํ•œ๊ณ„

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder

# ๋ฐ์ดํ„ฐ ์ค€๋น„
data = {
    '์ˆ˜์น˜ํ˜•_ํŠน์ง•': [10, 20, 30, 40, 50],
    '๋ฒ”์ฃผํ˜•_ํŠน์ง•': ['A', 'B', 'A', 'C', 'B'],
    '๊ทธ๋Œ€๋กœ_์œ ์ง€': [1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

# ๋ฌธ์ œ ์ƒํ™ฉ: ํŒŒ์ดํ”„๋ผ์ธ์— ์ธ์ฝ”๋”๋งŒ ๋„ฃ์Œ
model_pipe = Pipeline([('encode', OrdinalEncoder()),
                       ('logi', LogisticRegression())])

x_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values

model_pipe.fit(x_data, y_data)

# ์ธ์ฝ”๋”๊ฐ€ ํ•™์Šตํ•œ ๋ฒ”์ฃผ ํ™•์ธ
enc = model_pipe.named_steps['encode']
print(enc.categories_)
# ๊ฒฐ๊ณผ:
# [array([10, 20, 30, 40, 50], dtype=object),  <-- ์ˆซ์žํ˜• ์ปฌ๋Ÿผ๊นŒ์ง€ ์ธ์ฝ”๋”ฉํ•ด๋ฒ„๋ฆผ!
#  array(['A', 'B', 'C'], dtype=object)]
  • ๋ฌธ์ œ์ : Pipeline์€ ๋“ค์–ด์˜จ ๋ชจ๋“  ๋ฐ์ดํ„ฐ(x_data)์— ์ผ๊ด„์ ์œผ๋กœ ๋ณ€ํ™˜์„ ์‹œ๋„.
    ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ๊นŒ์ง€ ์ธ์ฝ”๋”ฉ ๋˜๋ฒ„๋ฆผ.

ColumnTransformer

  • ์„œ๋กœ ๋‹ค๋ฅธ ์—ด(Column)์— ์„œ๋กœ ๋‹ค๋ฅธ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ , ๋‚˜์ค‘์— ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์ค€๋‹ค.

๊ตฌ์กฐ
transformers ๋ฆฌ์ŠคํŠธ: (์ด๋ฆ„, ๋ณ€ํ™˜๊ธฐ, [์ ์šฉํ•  ์ปฌ๋Ÿผ]) ํ˜•ํƒœ์˜ ํŠœํ”Œ์„ ๋‹ด์Œ.
remainder='passthrough': ์ง€์ •ํ•˜์ง€ ์•Š์€ ๋‚˜๋จธ์ง€ ์ปฌ๋Ÿผ๋“ค์€ ๊ฑด๋“œ๋ฆฌ์ง€ ๋ง๊ณ  ํ†ต๊ณผ์‹œํ‚ค๋ผ๋Š” ์˜ต์…˜ (๊ธฐ๋ณธ๊ฐ’์€ 'drop'์ด๋ผ ๋‹ค ๋ฒ„๋ ค์ง‘๋‹ˆ๋‹ค. ์ฃผ์˜!)

1) ๋ฐ์ดํ„ฐ ํ˜•ํƒœ์— ๋”ฐ๋ฅธ ์ ์šฉ ๋ฐฉ๋ฒ•

  • Case A: Numpy Array (์ธ๋ฑ์Šค ์‚ฌ์šฉ)
from sklearn.compose import ColumnTransformer

# 1๋ฒˆ์งธ ์—ด(๋ฒ”์ฃผํ˜•_ํŠน์ง•)๋งŒ ์ธ์ฝ”๋”ฉํ•˜๊ณ , ๋‚˜๋จธ์ง€๋Š” ํ†ต๊ณผ
# [1] ์ฒ˜๋Ÿผ ๋ฆฌ์ŠคํŠธ๋กœ ๊ฐ์‹ธ์•ผ ํ•จ
column_preprocessor = ColumnTransformer(
    [('enc', OrdinalEncoder(), [1])], 
    remainder='passthrough'
)

# ํŒŒ์ดํ”„๋ผ์ธ ๊ฒฐํ•ฉ
model_cpipe = Pipeline([
    ('ct', column_preprocessor),
    ('logi', LogisticRegression(max_iter=500))
])

# ์˜ˆ์ธก ์‹œ์—๋„ Array ํ˜•ํƒœ๋กœ ์ž…๋ ฅ
model_cpipe.fit(x_data, y_data)
print(model_cpipe.predict([[10, 'A']])) # array([1])
  • Case B: Pandas DataFrame (์ปฌ๋Ÿผ๋ช… ์‚ฌ์šฉ)
# DataFrame์œผ๋กœ ๋ถ„๋ฆฌ
x_df = df.iloc[:, :-1] # DataFrame ์œ ์ง€
y_df = df.iloc[:, -1]

# ์ปฌ๋Ÿผ๋ช…์œผ๋กœ ์ง€์ • ๊ฐ€๋Šฅ
column_preprocessor = ColumnTransformer(
    [('enc', OrdinalEncoder(), ['๋ฒ”์ฃผํ˜•_ํŠน์ง•'])],
    remainder='passthrough'
)

model_cpipe = Pipeline([
    ('ct', column_preprocessor),
    ('logi', LogisticRegression(max_iter=500))
])

model_cpipe.fit(x_df, y_df)

# ์˜ˆ์ธก ์‹œ DataFrame ์ƒ์„ฑํ•ด์„œ ์ž…๋ ฅ (์ปฌ๋Ÿผ๋ช… ๋งค์นญ ํ•„์ˆ˜)
new_df = pd.DataFrame({'์ˆ˜์น˜ํ˜•_ํŠน์ง•':[10], '๋ฒ”์ฃผํ˜•_ํŠน์ง•':['A']})
print(model_cpipe.predict(new_df)) # array([1])

2) ๋ณตํ•ฉ ์ „์ฒ˜๋ฆฌ ์ ์šฉ (Encoding + Scaling)

  • ์ˆ˜์น˜ํ˜•์€ ์Šค์ผ€์ผ๋ง, ๋ฒ”์ฃผํ˜•์€ ์ธ์ฝ”๋”ฉ์„ ๋™์‹œ์— ์ ์šฉ
from sklearn.preprocessing import StandardScaler

# ์ „์ฒ˜๋ฆฌ ๋ฆฌ์ŠคํŠธ ์ •์˜
transformers_list = [
    ('enc', OrdinalEncoder(), ['๋ฒ”์ฃผํ˜•_ํŠน์ง•']),      # ๋ฒ”์ฃผํ˜• -> ์ธ์ฝ”๋”ฉ
    ('scale', StandardScaler(), ['์ˆ˜์น˜ํ˜•_ํŠน์ง•'])     # ์ˆ˜์น˜ํ˜• -> ์Šค์ผ€์ผ๋ง
]

# ColumnTransformer ์ƒ์„ฑ
ct = ColumnTransformer(transformers_list, remainder='passthrough')

3. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: SimpleImputer

SimpleImputer : ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” NaN์„ ์ฑ„์›Œ์คŒ.
์ „๋žต (Strategy)

  • mean: ํ‰๊ท ๊ฐ’ (์ˆ˜์น˜ํ˜•)
  • median: ์ค‘์•™๊ฐ’ (์ˆ˜์น˜ํ˜•, ์ด์ƒ์น˜์— ๊ฐ•ํ•จ)
  • most_frequent: ์ตœ๋นˆ๊ฐ’ (๋ฒ”์ฃผํ˜•/์ˆ˜์น˜ํ˜• ๋ชจ๋‘ ๊ฐ€๋Šฅ)
  • constant: ํŠน์ • ์ƒ์ˆ˜ ๊ฐ’ (0, 'Unknown' ๋“ฑ)
from sklearn.impute import SimpleImputer

# ์ „์ฒ˜๋ฆฌ๊ธฐ ์•ˆ์— Imputer ํฌํ•จ์‹œํ‚ค๊ธฐ ์˜ˆ์‹œ
# (์ฃผ์˜: Imputer๋Š” ๋ณดํ†ต ๋‹จ๋…์œผ๋กœ ์“ฐ๊ฑฐ๋‚˜ Pipeline ๋‚ด๋ถ€์— ํฌํ•จ๋จ)

# ์˜ˆ: NaN ์ปฌ๋Ÿผ์€ ์ค‘์•™๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ณ , ๋ฒ”์ฃผํ˜•์€ ์ธ์ฝ”๋”ฉ, ์ˆ˜์น˜ํ˜•์€ ์Šค์ผ€์ผ๋ง
model_list = [
    ('nan', SimpleImputer(strategy='median'), ['NaN์ปฌ๋Ÿผ']),
    ('enc', OrdinalEncoder(), ['๋ฒ”์ฃผํ˜•_ํŠน์ง•']),
    ('scale', StandardScaler(), ['์ˆ˜์น˜ํ˜•_ํŠน์ง•'])
]

์˜ˆ์ œ : ํ˜„๋Œ€์ฐจ ๊ฐ€๊ฒฉ ์˜ˆ์ธก

๋ชฉํ‘œ: ๋…„์‹, ์ข…๋ฅ˜, ์—ฐ๋น„, ๋งˆ๋ ฅ, ํ† ํฌ, ์—ฐ๋ฃŒ ์ •๋ณด๋ฅผ ํ†ตํ•ด ๊ฐ€๊ฒฉ ์˜ˆ์ธก
๋ฐ์ดํ„ฐ: hyundaiCar.xlsx

1) ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„๋ฆฌ

import pandas as pd
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
hDF = pd.read_excel('data/hyundaiCar.xlsx', sheet_name='train')

# ํŠน์„ฑ(X)๊ณผ ํƒ€๊ฒŸ(y) ๋ถ„๋ฆฌ
# ์ข…๋ฅ˜, ์—ฐ๋ฃŒ -> ๋ฒ”์ฃผํ˜• (OneHotEncoder)
# ๋…„์‹, ์—ฐ๋น„, ๋งˆ๋ ฅ, ํ† ํฌ -> ์ˆ˜์น˜ํ˜• (StandardScaler)
x_data = hDF[['๋…„์‹', '์ข…๋ฅ˜', '์—ฐ๋น„', '๋งˆ๋ ฅ', 'ํ† ํฌ', '์—ฐ๋ฃŒ']]
y_data = hDF[['๊ฐ€๊ฒฉ']]

2) ์ „์ฒ˜๋ฆฌ๊ธฐ(Preprocessor) ๊ตฌ์„ฑ

# ์ „์ฒ˜๋ฆฌ ๊ทœ์น™ ์ •์˜
# 1. ๋ฒ”์ฃผํ˜•(์ข…๋ฅ˜, ์—ฐ๋ฃŒ) -> ์›-ํ•ซ ์ธ์ฝ”๋”ฉ
# 2. ์ˆ˜์น˜ํ˜•(๋…„์‹, ์—ฐ๋น„, ๋งˆ๋ ฅ, ํ† ํฌ) -> ํ‘œ์ค€ํ™”(StandardScaler)
#    (๋…„์‹์€ ์ˆซ์ž๊ฐ€ ํด์ˆ˜๋ก ์ƒˆ ์ฐจ์ด๋ฏ€๋กœ ์ˆ˜์น˜ํ˜•์œผ๋กœ ์Šค์ผ€์ผ๋ง ์ฒ˜๋ฆฌํ•จ)

m_list = [
    ('enc', OneHotEncoder(), ['์ข…๋ฅ˜', '์—ฐ๋ฃŒ']),
    ('scale', StandardScaler(), ['๋…„์‹', '์—ฐ๋น„', '๋งˆ๋ ฅ', 'ํ† ํฌ'])
]

# ๋‚˜๋จธ์ง€ ์ปฌ๋Ÿผ์ด ์—†์œผ๋ฏ€๋กœ remainder๋Š” ์ƒ๋žตํ•˜๊ฑฐ๋‚˜ passthrough
h_preprocessor = ColumnTransformer(m_list, remainder='passthrough')

3) ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต

  • ๋ชจ๋ธ๋กœ SGDRegressor๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด๋ฏ€๋กœ ์Šค์ผ€์ผ๋ง(StandardScaler)์ด ํ•„์ˆ˜
# ํŒŒ์ดํ”„๋ผ์ธ: ์ „์ฒ˜๋ฆฌ -> ๋ชจ๋ธ
h_pipe = Pipeline([
    ('ctp', h_preprocessor),
    ('model', SGDRegressor(max_iter=500, verbose=1))
])

# ํ•™์Šต (์ „์ฒ˜๋ฆฌ๊ฐ€ ๋‚ด๋ถ€์—์„œ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰๋จ)
h_pipe.fit(x_data, y_data)

4) ์‹ ์ฐจ ๊ฐ€๊ฒฉ ์˜ˆ์ธก

  • 2015๋…„์‹ ์ค€์ค‘ํ˜• ๊ฐ€์†”๋ฆฐ ์ฐจ๋Ÿ‰์˜ ๊ฐ€๊ฒฉ ์˜ˆ์ธก
# ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋„ DataFrame์œผ๋กœ ๋งŒ๋“ค์–ด์•ผ ์ปฌ๋Ÿผ๋ช… ๋งค์นญ์ด ๋จ
new_car = pd.DataFrame({
    '๋…„์‹': [2015],
    '์ข…๋ฅ˜': ['์ค€์ค‘ํ˜•'],
    '์—ฐ๋น„': [12.3],
    '๋งˆ๋ ฅ': [204],
    'ํ† ํฌ': [27],
    '์—ฐ๋ฃŒ': ['๊ฐ€์†”๋ฆฐ']
})

# ์˜ˆ์ธก
predicted_price = h_pipe.predict(new_car)
print(f"์˜ˆ์ธก ๊ฐ€๊ฒฉ: {predicted_price}")

# ๊ฒฐ๊ณผ ์˜ˆ์‹œ
# array([2788.91...]) -> ์•ฝ 2,788๋งŒ ์› ์˜ˆ์ธก

์š”์•ฝ

  1. ColumnTransformer: ์ปฌ๋Ÿผ๋ณ„๋กœ ๋‹ค๋ฅธ ์ „์ฒ˜๋ฆฌ(์ธ์ฝ”๋”ฉ, ์Šค์ผ€์ผ๋ง)๋ฅผ ์ ์šฉํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ•„์ˆ˜ ๋„๊ตฌ
  2. remainder='passthrough': ์ง€์ •ํ•˜์ง€ ์•Š์€ ์ปฌ๋Ÿผ์„ ๋ฒ„๋ฆฌ์ง€ ์•Š๊ณ  ์œ ์ง€ํ•˜๋Š” ์ค‘์š”ํ•œ ์˜ต์…˜
  3. SimpleImputer: ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ธก์น˜(NaN)๋ฅผ ํ‰๊ท , ์ค‘์•™๊ฐ’, ์ตœ๋นˆ๊ฐ’ ๋“ฑ์œผ๋กœ ์ฑ„์›Œ์คŒ
  4. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ: ํ•™์Šตํ•  ๋•Œ DataFrame(์ปฌ๋Ÿผ๋ช…)์„ ์ผ๋‹ค๋ฉด, ์˜ˆ์ธกํ•  ๋•Œ๋„ DataFrame์œผ๋กœ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์ด ์•ˆ์ „
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€