โ‘จ ๐Ÿค– Machine Learning 2์ผ์ฐจ - ์ธ์ฝ”๋”ฉ(Encoding)์˜ ์ดํ•ด

JItzelยท2025๋…„ 12์›” 11์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
9/14

๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ: ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ vs ์›-ํ•ซ ์ธ์ฝ”๋”ฉ

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ '๊ธ€์ž'๋ฅผ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์ž๋กœ ๋œ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜๋“œ์‹œ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์ค˜์•ผ ํ•œ๋‹ค(Ex. "์‚ฌ๊ณผ", "๋ฐ”๋‚˜๋‚˜" โ†’\rightarrow Error!)
์ด ์ž‘์—…์„ ์ธ์ฝ”๋”ฉ(Encoding)๋ผ๊ณ  ํ•˜๋Š”๋ฐ ๋Œ€ํ‘œ์ ์ธ ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์„ ์•Œ์•„๋ณด์ž. ๐Ÿ”  โ†’\rightarrow ๐Ÿ”ข

1. ๋‘ ๊ฐ€์ง€ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹ ๋น„๊ต

  • ๋ฐ์ดํ„ฐ์— '์„œ์—ด(์ˆœ์„œ)'์ด ์žˆ๋А๋ƒ ์—†๋А๋ƒ๊ฐ€ ์„ ํƒ์˜ ๊ธฐ์ค€

1) ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ (Label/Ordinal Encoding)

  • ๋ฒ”์ฃผํ˜• ๊ฐ’์„ ๊ณ ์œ ํ•œ ์ •์ˆ˜(Integer)๋กœ 1:1 ๋งคํ•‘
  1. ๋ฐฉ์‹: A โ†’\rightarrow 0, B โ†’\rightarrow 1, C โ†’\rightarrow 2ํŠน์ง•:์ˆœ์„œ(Order)๊ฐ€ ๋ถ€์—ฌ๋จ: 0 < 1 < 2 ๋ผ๋Š” ์ˆ˜ํ•™์  ํฌ๊ธฐ๊ฐ€ ์ƒ๊ธด๋‹ค.
  2. ๊ถŒ์žฅ ๋Œ€์ƒ: ์‹ค์ œ ์„œ์—ด์ด ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ (ํ•™์  A/B/C, ๋“ฑ๊ธ‰ 1๊ธ‰/2๊ธ‰/3๊ธ‰)
  3. ์ถ”์ฒœ ๋ชจ๋ธ: ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (Decision Tree, Random Forest, XGBoost ๋“ฑ)
    ์ฃผ์˜: ์ˆœ์„œ๊ฐ€ ์—†๋Š” ๋ฐ์ดํ„ฐ(์‚ฌ๊ณผ, ๋ฐฐ)์— ์ ์šฉํ•˜๋ฉด ๋ชจ๋ธ์ด "๋ฐฐ(1)๊ฐ€ ์‚ฌ๊ณผ(0)๋ณด๋‹ค ํฌ๋‹ค"๊ณ  ์ž˜๋ชป ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

2) ์›-ํ•ซ ์ธ์ฝ”๋”ฉ (One-Hot Encoding)

  • ๊ฐ ๋ฒ”์ฃผ๋ฅผ ๋…๋ฆฝ์ ์ธ ์ด์ง„(Binary) ํŠน์„ฑ(์ปฌ๋Ÿผ)์œผ๋กœ ๋ณ€ํ™˜
  1. ๋ฐฉ์‹: ๋‚จ โ†’\rightarrow [1, 0]
    ์—ฌ โ†’\rightarrow [0, 1]
  2. ํŠน์ง•:
    1) ์ˆœ์„œ ๊ด€๊ณ„ ์ œ๊ฑฐ: ๋ชจ๋“  ๋ฒ”์ฃผ๊ฐ€ ํ‰๋“ฑํ•ด์ง‘๋‹ˆ๋‹ค.
    2) ์ฐจ์›์˜ ์ €์ฃผ: ๋ฒ”์ฃผ ์ข…๋ฅ˜๊ฐ€ 1,000๊ฐœ๋ฉด ์ปฌ๋Ÿผ์ด 1,000๊ฐœ ์ƒ๊น๋‹ˆ๋‹ค. (๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์ฃผ์˜)
    3) ๊ถŒ์žฅ ๋Œ€์ƒ: ๋ช…๋ชฉํ˜• ๋ฐ์ดํ„ฐ (์„ฑ๋ณ„, ์ง€์—ญ, ์ƒ‰์ƒ ๋“ฑ ์ˆœ์„œ ์—†๋Š” ๋ฐ์ดํ„ฐ)
    4) ์ถ”์ฒœ ๋ชจ๋ธ: ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ชจ๋ธ (Linear/Logistic Regression, KNN, Neural Network)

2. ์‹ค์Šต ๋ฐ์ดํ„ฐ ์ค€๋น„ (ํ˜ผํ•ฉํ˜• ๋ฐ์ดํ„ฐ)

  • ์‹ค์ œ ๋ฐ์ดํ„ฐ์—๋Š” ์ˆซ์žํ˜• ์ปฌ๋Ÿผ๊ณผ ๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ์ด ์„ž์—ฌ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ColumnTransformer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
data = {
    '์ˆ˜์น˜ํ˜•_ํŠน์ง•': [10, 20, 30, 40, 50],       # ๋ณ€ํ™˜ X (๊ทธ๋Œ€๋กœ ๋‘˜ ์˜ˆ์ •)
    '๋ฒ”์ฃผํ˜•_ํŠน์ง•': ['A', 'B', 'A', 'C', 'B'],   # ๋ณ€ํ™˜ O (์ธ์ฝ”๋”ฉ ํ•„์š”)
    '๊ทธ๋Œ€๋กœ_์œ ์ง€': [1, 0, 1, 0, 1]              # ๋ณ€ํ™˜ X
}
df = pd.DataFrame(data)
print(df)

3. ์‹ค์Šต 1: Ordinal Encoding (์ˆœ์„œ๊ฐ€ ์žˆ์„ ๋•Œ)

  • Scikit-learn์˜ OrdinalEncoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒ”์ฃผํ˜•_ํŠน์ง• ์ปฌ๋Ÿผ๋งŒ ์ˆซ์ž๋กœ ๋ฐ”๊ฟˆ
# ColumnTransformer ์ •์˜
preprocessor = ColumnTransformer(
    transformers=[
        # (์ด๋ฆ„, ๋ณ€ํ™˜๊ธฐ ๊ฐ์ฒด, [์ ์šฉํ•  ์ปฌ๋Ÿผ๋ช… ๋ฆฌ์ŠคํŠธ])
        ('cat', OrdinalEncoder(), ['๋ฒ”์ฃผํ˜•_ํŠน์ง•'])
    ],
    remainder='passthrough' # ์ค‘์š”! ์ง€์ •ํ•˜์ง€ ์•Š์€ ๋‚˜๋จธ์ง€ ์ปฌ๋Ÿผ์€ ๋ฒ„๋ฆฌ์ง€ ๋ง๊ณ  ํ†ต๊ณผ์‹œ์ผœ๋ผ
)

print(preprocessor)

# ๋ณ€ํ™˜ ์ˆ˜ํ–‰
X_transformed = preprocessor.fit_transform(df)
print(X_transformed)

# ๊ฒฐ๊ณผ ํ•ด์„
# [[ 0. 10.  1.]   -> A๋Š” 0์œผ๋กœ ๋ณ€ํ™˜
#  [ 1. 20.  0.]   -> B๋Š” 1๋กœ ๋ณ€ํ™˜
#  [ 0. 30.  1.]   -> A๋Š” 0
#  [ 2. 40.  0.]   -> C๋Š” 2๋กœ ๋ณ€ํ™˜
#  [ 1. 50.  1.]]

4. ์‹ค์Šต 2: One-Hot Encoding (์ˆœ์„œ๊ฐ€ ์—†์„ ๋•Œ)

  • ๊ฐ€์žฅ ๋ณดํŽธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹. ๋ฒ”์ฃผํ˜•_ํŠน์ง• ์ปฌ๋Ÿผ์ด A, B, C 3๊ฐ€์ง€ ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ 3๊ฐœ์˜ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์œผ๋กœ ์ชผ๊ฐœ์ง„๋‹ค.
from sklearn.preprocessing import OneHotEncoder

# OneHotEncoder๋กœ ๋ณ€๊ฒฝ
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['๋ฒ”์ฃผํ˜•_ํŠน์ง•'])
    ],
    remainder='passthrough'
)

X_transformed = preprocessor.fit_transform(df)
print(X_transformed)

# ๊ฒฐ๊ณผ ํ•ด์„
# [[ 1.  0.  0. 10.  1.]  -> A: [1, 0, 0]
#  [ 0.  1.  0. 20.  0.]  -> B: [0, 1, 0]
#  [ 1.  0.  0. 30.  1.]  -> A: [1, 0, 0]
#  [ 0.  0.  1. 40.  0.]  -> C: [0, 0, 1]
#  [ 0.  1.  0. 50.  1.]]

โ†’\rightarrow ๊ฒฐ๊ณผ ๋ถ„์„
์•ž์˜ 3๊ฐœ ์ปฌ๋Ÿผ(1. 0. 0.)์€ OneHotEncoder๊ฐ€ ๋งŒ๋“  A, B, C ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ปฌ๋Ÿผ์ด๋‹ค.
๋’ค์˜ 2๊ฐœ ์ปฌ๋Ÿผ(10. 1.)์€ passthrough๋กœ ํ†ต๊ณผ๋œ ์›๋ž˜์˜ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์ด๋‹ค.
์ฐจ์› ์ฆ๊ฐ€: ์›๋ž˜ 3๊ฐœ์˜€๋˜ ์ปฌ๋Ÿผ์ด ์ด 5๊ฐœ(์›ํ•ซ 3๊ฐœ + ์ˆ˜์น˜ 2๊ฐœ)๋กœ ๋Š˜์–ด๋‚จ.


์š”์•ฝ

ํŠน์ง•Label (Ordinal) EncodingOne-Hot Encoding
๋ณ€ํ™˜ ๋ฐฉ์‹A โ†’ 1, B โ†’ 2 (์ •์ˆ˜ ๋ณ€ํ™˜)A โ†’ [1,0], B โ†’ [0,1] (๋ฒกํ„ฐ ๋ณ€ํ™˜)
์žฅ์ ์ฐจ์›์ด ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ์ˆœ์„œ ์™œ๊ณก ์—†์Œ
๋‹จ์ ์ˆซ์ž์˜ ํฌ๊ธฐ(์ˆœ์„œ)๊ฐ€ ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ฏธ์นจ์ปฌ๋Ÿผ ์ˆ˜ ์ฆ๊ฐ€(๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด)
์‚ฌ์šฉ์ฒ˜์„œ์—ด ๋ฐ์ดํ„ฐ(๋“ฑ๊ธ‰, ํ•™์ ), ํŠธ๋ฆฌ ๋ชจ๋ธ(Random Forest, XGBoost)๋ช…๋ชฉ ๋ฐ์ดํ„ฐ(์„ฑ๋ณ„, ์ง€์—ญ), ์„ ํ˜• ๋ชจ๋ธ(Regression, SVM)
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€