โ‘ฉ ๐Ÿค– Machine Learning 2์ผ์ฐจ - ์ธ์ฝ”๋”ฉ(Encoding)์˜ ์ดํ•ด

JItzelยท2025๋…„ 12์›” 13์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
10/14

์ธ์ฝ”๋”ฉ ์‹ค์ „: Label vs One-Hot (์ˆ˜๋™ ๋ณ€ํ™˜์˜ ์ดํ•ด)

1. ์‹ค์Šต ๋ฐ์ดํ„ฐ ์ค€๋น„

  • ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ์„ž์ธ ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
data = {
    '์ˆ˜์น˜ํ˜•_ํŠน์ง•': [10, 20, 30, 40, 50],       # ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
    '๋ฒ”์ฃผํ˜•_ํŠน์ง•': ['A', 'B', 'A', 'C', 'B'],   # ์ธ์ฝ”๋”ฉ ํ•„์š”
    '๊ทธ๋Œ€๋กœ_์œ ์ง€': [1, 0, 1, 0, 1]              # ๋ผ๋ฒจ(y)
}
df = pd.DataFrame(data)
print(df)

   ์ˆ˜์น˜ํ˜•_ํŠน์ง• ๋ฒ”์ฃผํ˜•_ํŠน์ง•  ๊ทธ๋Œ€๋กœ_์œ ์ง€
0      10    	 A       	1
1      20   	 B       	0
2      30   	 A       	1
3      40   	 C       	0
4      50   	 B       	1

2. ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ (Ordinal Encoding) ์ง์ ‘ ๊ตฌํ˜„

  • ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ˆ˜(0,1,2,..)๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

1) ๋ณ€ํ™˜ (Fit & Transform)

# dtype=int : ๋ณด๊ธฐ ์ข‹๊ฒŒ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜ (๊ธฐ๋ณธ๊ฐ’์€ float)
encode = OrdinalEncoder(dtype=int)

# ์ฃผ์˜: ์ž…๋ ฅ์€ ํ•ญ์ƒ 2์ฐจ์› ํ–‰๋ ฌ์ด์–ด์•ผ ํ•จ! df[['์ปฌ๋Ÿผ']]
# fit: A, B, C๋ฅผ ํ•™์Šต
# transform: A->0, B->1, C->2 ๋ณ€ํ™˜
rst = encode.fit_transform(df[['๋ฒ”์ฃผํ˜•_ํŠน์ง•']])

rst
# array([[0],
       	[1],
        [0],
        [2],
        [1]])

2) ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์ ์šฉ ๋ฐ ํ•™์Šต

df['์ธ์ฝ”๋”ฉ'] = rst

# ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ (์ˆ˜์น˜ํ˜• + ์ธ์ฝ”๋”ฉ๋œ ๋ฒ”์ฃผํ˜•)
x_data = df[['์ˆ˜์น˜ํ˜•_ํŠน์ง•', '์ธ์ฝ”๋”ฉ']].values
y_data = df['๊ทธ๋Œ€๋กœ_์œ ์ง€'].values

# ๋ชจ๋ธ ํ•™์Šต
model = LogisticRegression(max_iter=500)
model.fit(x_data, y_data)
print(df)

# ์ˆ˜์น˜ํ˜•_ํŠน์ง• ๋ฒ”์ฃผํ˜•_ํŠน์ง•  ๊ทธ๋Œ€๋กœ_์œ ์ง€  ์ธ์ฝ”๋”ฉ
0      10      A       		1    	0
1      20      B       		0    	1
2      30      A       		1    	0
3      40      C       		0    	2
4      50      B       		1    	1

3) ์˜ˆ์ธก์˜ ๋ฒˆ๊ฑฐ๋กœ์›€

"์ˆ˜์น˜ํ˜• 10, ๋ฒ”์ฃผํ˜• A"์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด?
๋ชจ๋ธ์€ 'A'๋ฅผ ๋ชจ๋ฅธ๋‹ค. ์ง์ ‘ ์ˆซ์ž๋กœ ๋ฐ”๊ฟ”์ค˜์•ผ ํ•œ๋‹ค.

# 1. 'A'๊ฐ€ ๋ช‡ ๋ฒˆ์ธ์ง€ ์ธ์ฝ”๋”์—๊ฒŒ ๋ฌผ์–ด๋ด์„œ ๋ณ€ํ™˜
# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋„ฃ์–ด์ค˜์•ผ ์—๋Ÿฌ ์•ˆ ๋‚จ
rst = int(encode.transform(pd.DataFrame({'๋ฒ”์ฃผํ˜•_ํŠน์ง•':['A']})))
print(f"A๋Š” ์ˆซ์ž {rst}์ž…๋‹ˆ๋‹ค.")

# 2. ๋ณ€ํ™˜๋œ ์ˆซ์ž(0)๋ฅผ ๊ฐ€์ง€๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
print(model.predict([[10, rst]])) 
# array([1])

โ†’\rightarrow ๋งค๋ฒˆ ์˜ˆ์ธกํ•  ๋•Œ๋งˆ๋‹ค encode.transform์„ ํ˜ธ์ถœํ•ด์•ผ ํ•˜๋Š” ๋ฒˆ๊ฑฐ๋กœ์›€ ๋ฐœ์ƒ

3. ์›-ํ•ซ ์ธ์ฝ”๋”ฉ (One-Hot Encoding) ์ง์ ‘ ๊ตฌํ˜„

  • A, B, C๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ๋œ ์ปฌ๋Ÿผ(0 or 1)์œผ๋กœ ์ชผ๊ฐœ๋ณด์ž.

๋ฐฉ๋ฒ• 1: Scikit-learn์˜ OneHotEncoder ์‚ฌ์šฉ

  • sparse_output=False : ๊ฒฐ๊ณผ๋ฅผ ์••์ถ•ํ•˜์ง€ ์•Š๊ณ  0๊ณผ 1์ด ๋‹ค ๋ณด์ด๋Š” ๋ฐฐ์—ด(Dense)๋กœ ๋ฐ˜ํ™˜
oencode = OneHotEncoder(sparse_output=False)

# ๋ณ€ํ™˜ ์ˆ˜ํ–‰
result = oencode.fit_transform(df[['๋ฒ”์ฃผํ˜•_ํŠน์ง•']])
print(result)

# ๊ฒฐ๊ณผ
# [[1. 0. 0.]   -> A
#  [0. 1. 0.]   -> B
#  [1. 0. 0.]   -> A
#  ...

๋ฐ์ดํ„ฐ ํ•ฉ์น˜๊ธฐ (Numpy hstack) ๊ธฐ์กด ๋ฐ์ดํ„ฐ์™€ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์˜†์œผ๋กœ ๋ถ™์ธ๋‹ค.

# df.values(๊ธฐ์กด) + result(์›ํ•ซ)
arr = np.hstack((df.values, result))

# array([[10, 'A', 1, 1, 0, 0],
       	[20, 'B', 0, 0, 1, 0],
       	[30, 'A', 1, 1, 0, 0],
       	[40, 'C', 0, 0, 0, 1],
       	[50, 'B', 1, 0, 1, 0]], dtype=object)
# ์Šฌ๋ผ์ด์‹ฑ์œผ๋กœ ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์„ ํƒ (์ˆ˜์น˜ํ˜• + ์›ํ•ซ3๊ฐœ)
# 0: ์ˆ˜์น˜ํ˜•, 3,4,5: ์›ํ•ซ์ธ์ฝ”๋”ฉ๋œ ์ปฌ๋Ÿผ๋“ค
x_data = arr[:, [0, 3, 4, 5]] 
y_data = arr[:, 2] # ๋ผ๋ฒจ

๋ฐฉ๋ฒ• 2: Pandas get_dummies ์‚ฌ์šฉ

# dtype=int: True/False ๋Œ€์‹  1/0์œผ๋กœ ๋ฐ˜ํ™˜
result = pd.get_dummies(df['๋ฒ”์ฃผํ˜•_ํŠน์ง•'], dtype=int)

# pd.concat์œผ๋กœ ์˜†์œผ๋กœ ๋ถ™์ด๊ธฐ (axis=1)
cdf = pd.concat([df, result], axis=1)

print(cdf)
#    ์ˆ˜์น˜ํ˜•  ๋ฒ”์ฃผํ˜•  ...  A  B  C
# 0    10     A   ...  1  0  0
# ...

ํ•™์Šต ๋ฐ ์˜ˆ์ธก์˜ ๋‚œ๊ด€

# ํ•™์Šต ๋ฐ์ดํ„ฐ ์ค€๋น„
x_data = cdf[['์ˆ˜์น˜ํ˜•_ํŠน์ง•', 'A', 'B', 'C']].values
y_data = cdf['๊ทธ๋Œ€๋กœ_์œ ์ง€']

model = LogisticRegression(max_iter=500)
model.fit(x_data, y_data)
  • ์˜ˆ์ธก ์‹œ๋‚˜๋ฆฌ์˜ค: (์ˆ˜์น˜: 10, ๋ฒ”์ฃผ: A) ์˜ˆ์ธกํ•˜๊ธฐ ๋ชจ๋ธ์€ [์ˆ˜์น˜, A, B, C] ํ˜•ํƒœ์˜ 4๊ฐœ ์ž…๋ ฅ์„ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ์Œ
# โŒ model.predict([[10, 'A']]) -> ์—๋Ÿฌ ๋ฐœ์ƒ!

# โœ… ์ง์ ‘ [10, 1, 0, 0] ํ˜•ํƒœ๋กœ ๋งŒ๋“ค์–ด์ค˜์•ผ ํ•จ

# 1. ์ธ์ฝ”๋”๋ฅผ ํ†ตํ•ด 'A'์— ํ•ด๋‹นํ•˜๋Š” ๋ฒกํ„ฐ ๊ตฌํ•˜๊ธฐ ([1, 0, 0])
res = oencode.transform([['A']])[0] 

# 2. ๋ฆฌ์ŠคํŠธ ํ•ฉ์น˜๊ธฐ
my_input = []
my_input.append(10)
my_input.extend(res) # [10, 1, 0, 0] ์™„์„ฑ

# 3. ์˜ˆ์ธก
print(model.predict([my_input]))
# array([1])

4. ๊ฒฐ๋ก : ์™œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์จ์•ผ ํ•˜๋Š”๊ฐ€?

์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ๋•Œ๋งˆ๋‹ค ํ•™์Šต ๋•Œ ์ผ๋˜ ์ธ์ฝ”๋” ๊ฐ์ฒด๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ ,
๋ฌธ์ž๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜(transform)ํ•˜๊ณ ,
๊ธฐ์กด ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ์™€ ํ•ฉ์ณ์„œ(Merge) ๋ชจ๋ธ์— ๋„ฃ์–ด์ฃผ๋Š” ์ด ๋ชจ๋“  ๊ณผ์ •์„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

Pipeline์„ ์“ด๋‹ค๋ฉด?

# ํŒŒ์ดํ”„๋ผ์ธ ์˜ˆ์‹œ
pipeline.fit(x_train, y_train)

# ์˜ˆ์ธก
# ๊ทธ๋ƒฅ ๋‚ ๊ฒƒ ๊ทธ๋Œ€๋กœ ๋„ฃ์œผ๋ฉด ์•Œ์•„์„œ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก๊นŒ์ง€ ๋!
pipeline.predict([[10, 'A']])

ํ•ต์‹ฌ ์š”์•ฝ:
ํ•™์Šต ๊ณผ์ •์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋™ ์ธ์ฝ”๋”ฉ์„ ํ•ด๋ณด๋Š” ๊ฒƒ์€ ๊ณต๋ถ€์— ํฐ ๋„์›€์ด ๋œ๋‹ค.
ํ•˜์ง€๋งŒ ์‹ค๋ฌด๋‚˜ ์‹ค์ œ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์ •์‹  ๊ฑด๊ฐ•๊ณผ ์ฝ”๋“œ ๊ฐ„๊ฒฐ์„ฑ์„ ์œ„ํ•ด Pipeline์ด๋‚˜ ColumnTransformer๋ฅผ ๊ถŒ์žฅ!


์š”์•ฝ

  • Label Encoding: OrdinalEncoder (ํŠธ๋ฆฌ ๋ชจ๋ธ์šฉ)
  • One-Hot Encoding: OneHotEncoder ๋˜๋Š” pd.get_dummies (์„ ํ˜•/๊ฑฐ๋ฆฌ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์šฉ)
  • ๊ฒฐํ•ฉ: np.hstack (๋ฐฐ์—ด ๊ฒฐํ•ฉ) ๋˜๋Š” pd.concat (๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ฒฐํ•ฉ)
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€