ML_Feature Scaling, One-hot Encoding

wnsdnlยท2025๋…„ 3์›” 6์ผ

ML ๋จธ์‹ ๋Ÿฌ๋‹

๋ชฉ๋ก ๋ณด๊ธฐ
5/11

๐Ÿ“Œ Feature Scaling

Feature Scaling ์ด๋ž€, ๋จธ์‹  ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ์‚ฌ์šฉํ•  ์ž…๋ ฅ ๋ณ€์ˆ˜๋“ค์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•ด์„œ ์ผ์ • ๋ฒ”์œ„ ๋‚ด์— ๋–จ์–ด์ง€๋„๋ก ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. Feature Scaling์„ ํ•˜๊ฒŒ ๋˜๋ฉด ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ๋” ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š”๋ฐ ๊ทธ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?

๐Ÿ“Œ Feature Scaling๊ณผ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•

๋จผ์ € 3์ฐจ์›์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ์†์‹ค ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋“ฑ๊ณ ์„ ์„ ์ด์šฉํ•ด์„œ 2์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋“ฑ๊ณ ์„ ์—์„œ ์ด์–ด์ง„ ํ•˜๋‚˜์˜ ์„ ์— ์žˆ๋Š” ์ ๋“ค์€ ๋ชจ๋‘ ๊ฐ™์€ ๋†’์ด์— ์žˆ๋Š” ์ ๋“ค ๋œปํ•œ๋‹ค. ์ด๋•Œ ์ค‘์š”ํ•œ ๊ฐœ๋… ํ•œ ๊ฐ€์ง€๋Š” ํŠน์ • ์ง€์ ์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅธ ๋ฐฉํ–ฅ์€ ๋“ฑ๊ณ ์„ ๊ณผ ์ˆ˜์ง์ด ๋˜๋Š” ๋ฐฉํ–ฅ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค. (์šฐ๋ฆฌ๋Š” ์ด๊ฑธ ๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ๋กœ ๊ตฌํ•ด์™”์—ˆ๋‹ค.)

๋งŒ์•ฝ ์ž…๋ ฅ ๋ณ€์ˆ˜์ธ ์—ฐ๋ด‰์„ ๊ฐ€์ง€๊ณ  ๋‚˜์ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์„ ํ˜• ํšŒ๊ท€๋ฅผ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ๊ทธ๋ฆฌ๊ณ  ๊ฐ€์„ค ํ•จ์ˆ˜๋Š” hฮธ(x)=ฮธ0+ฮธ1xh_{\theta}(x) = \theta_0 + \theta_1x ๋ผ ํ•˜์ž.

์—ฐ๋ด‰์€ ๋‹จ์œ„๊ฐ€ ์ฒœ ๋งŒ ๋‹จ์œ„์ด๊ธฐ ๋•Œ๋ฌธ์— ฮธ1\theta_1 ๊ฐ’์ด ์กฐ๊ธˆ๋งŒ ๋ฐ”๋€Œ์–ด๋„ ๊ฐ€์„ค ํ•จ์ˆ˜์˜ ๊ฒฐ๊ณผ๊ฐ’์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง„๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด ์—ฐ๋ด‰์ด 3000๋งŒ์›์ผ ๋•Œ๋ฅผ ๋ณด๊ณ  ์žˆ๋‹ค๊ณ  ํ•˜๋ฉด, ฮธ1=1\theta_1 = 1์ผ ๋•Œ๋Š” ฮธ1x\theta_1x๊ฐ€ 3000๋งŒ์›์ธ๋ฐ, ฮธ1=3\theta_1 = 3์ผ ๋•Œ๋Š” ฮธ1x\theta_1x๊ฐ€ 9000๋งŒ์›์ด๋‹ค. ฮธ1\theta_1์ด 1์—์„œ 3์œผ๋กœ ๋ฐ”๋€Œ์—ˆ์„ ๋ฟ์ธ๋ฐ ์˜ˆ์ธก๊ฐ’์€ 6000๋งŒ์› ์ด์ƒ์ด๋‚˜ ์ฐจ์ด๊ฐ€ ๋‚œ๋‹ค. ๋”ฐ๋ผ์„œ ฮธ1\theta_1๊ฐ’์ด ์กฐ๊ธˆ๋งŒ ๋ฐ”๋€Œ์–ด๋„ ์ž…๋ ฅ ๋ณ€์ˆ˜ ์—ฐ๋ด‰์˜ ์ฒœ๋งŒ ๋‹จ์œ„ ์ˆ˜ ๋•Œ๋ฌธ์— ๊ฐ€์„ค ํ•จ์ˆ˜์˜ ์•„์›ƒํ’‹์ด ํฌ๊ฒŒ ์ฐจ์ด๊ฐ€ ๋‚œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์•„์›ƒํ’‹์— ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค๋Š” ๊ฒƒ์€ ๊ฒฐ๊ตญ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ, ํ˜น์€ ์†์‹ค ํ•จ์ˆ˜์—๋„ ํฐ ์˜ํ–ฅ์„ ์ฃผ๊ฒŒ ๋œ๋‹ค.

๋ฐ˜๋ฉด ฮธ0\theta_0์€ ํ•ญ์ƒ 1๊ณผ ๊ณฑํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ฮธ0\theta_0์ด ๋ฐ”๋€Œ์–ด๋„ ๊ฐ€์„ค ํ•จ์ˆ˜์˜ ์•„์›ƒํ’‹์ด ํฌ๊ฒŒ ๋ฐ”๋€Œ์ง€๋Š” ์•Š๋Š”๋‹ค.

์ด๊ฑธ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์™ผ์ชฝ ๊ทธ๋ž˜ํ”„๋Š” Feature Scaling์„ ํ•˜๊ธฐ ์ „์ธ๋ฐ, ฮธ1\theta_1์ด ์กฐ๊ธˆ๋งŒ ์ปค์ ธ๋„ ์†์‹ค ํ•จ์ˆ˜์˜ ๋“ฑ๊ณ ์„ ์ด ๋น ๋ฅด๊ฒŒ ๋ฐ”๋€๋‹ค. ์ฆ‰, ฮธ1\theta_1์— ๋ฏผ๊ฐํ•˜๋‹ค. ๋งŽ์€ ์ง€์ ์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅธ ๋ฐฉํ–ฅ์€ ์ตœ์†Œ์ ์„ ํ–ฅํ•˜๋Š” ๋ฐฉํ–ฅ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ง„ํ–‰ํ•ด๋„ ์ง€๊ทธ์žฌ๊ทธ ๋ชจ์–‘์œผ๋กœ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•ํ•˜๊ฒŒ ๋˜๊ณ , ๊ฒฐ๊ตญ ์ˆ˜๋ ด๊นŒ์ง€์˜ ์‹œ๊ฐ„์ด ๋” ๊ฑธ๋ฆฌ๊ฒŒ ๋œ๋‹ค.

๋ฐ˜๋ฉด ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„๋Š” Feature Scaling์„ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ฮธ1\theta_1๊ณผ ฮธ0\theta_0 ์— 0๊ณผ 1์‚ฌ์ด์˜ ๋น„์Šทํ•œ ์ˆ˜๋“ค์ด ๊ณฑํ•ด์ง€๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์†์‹ค ํ•จ์ˆ˜์— ๋น„์Šทํ•œ ์˜ํ–ฅ์„ ์ฃผ๊ฒŒ ๋œ๋‹ค. ์ฆ‰, ์–ด๋–ค ์ง€์ ์ด๋“  ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅธ ๋ฐฉํ–ฅ์ด ์ตœ์†Œ์ ์„ ํ–ฅํ•˜๋Š” ๋ฐฉํ–ฅ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ์‚ฌ ํ•˜๊ฐ•์„ ๋” ๋นจ๋ฆฌ ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ˆ˜๋ ด์„ ๋” ๋นจ๋ฆฌ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

์ด๋Ÿฌํ•œ Feature Scaling์€ ํฌ๊ฒŒ Normalization๊ณผ Standardization ์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

๐Ÿ“Œ Normalization (์ •๊ทœํ™”)

์ •๊ทœํ™”๋Š” ์ˆซ์ž์˜ ํฌ๊ธฐ๋ฅผ 0๊ณผ 1์‚ฌ์ด๋กœ ๋งŒ๋“ ๋‹ค. ์ •๊ทœํ™” ์ค‘ ๊ฐ€์žฅ ์ง๊ด€์ ์ธ Min-Max Normalization ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

๐Ÿ“ Min-Max Normalization

Min-Max Normalization์€ ์ตœ๋Œ€ ์ตœ์†Œ๋ฅผ ์ด์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ 0๊ณผ 1์‚ฌ์ด๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ •๊ทœํ™”๋ฅผ ํ•˜๊ธฐ ์ „์˜ ๋ฐ์ดํ„ฐ๋ฅผ xoldx_{old}, ์ •๊ทœํ™”๋ฅผ ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ xnewx_{new}, ๋ฐ์ดํ„ฐ์˜ ์ตœ๋Œ“๊ฐ’์„ xmaxx_{max}, ๋ฐ์ดํ„ฐ์˜ ์ตœ์†Ÿ๊ฐ’์„ xminx_{min}์ด๋ผ ํ•œ๋‹ค๋ฉด Min-Max Normalization์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

xnew=xoldโˆ’xminxmaxโˆ’xminx_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}}

์ด๊ฑธ ์ด์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ 0๊ณผ 1 ์‚ฌ์ด ๋ฒ”์œ„์˜ ์ˆซ์ž๋“ค๋กœ ํฌ๊ธฐ ์กฐ์ • ์ฆ‰, scaling ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.

scikit-learn ์œผ๋กœ ํ•œ ๋ฒˆ ์‹ค์Šต์„ ํ•ด๋ณด์ž.

from sklearn import preprocessing
import pandas as pd
import numpy as np

nba_player_of_the_week_df = pd.read_csv('../data/NBA_player_of_the_week.csv')

height_weight_age_df = nba_player_of_the_week_df[['Height CM', 'Weight KG', 'Age']]

scaler = preprocessing.MinMaxScaler()
normalized_data = scaler.fit_transform(height_weight_age_df)
normalized_df = pd.DataFrame(normalized_data, columns=['Height', 'Weight', 'Age'])

๐Ÿ“Œ Standardization (ํ‘œ์ค€ํ™”)

์ด nn๊ฐœ์˜ ๋ฐ์ดํ„ฐ x1,x2,x3,โ‹ฏโ€‰,xnx_1, x_2, x_3, \cdots, x_n ์˜ ํ‰๊ท ์„ xห‰\bar{x}์ด๋ผ ํ•˜๋ฉด ํ‘œ์ค€ ํŽธ์ฐจ ฯƒ\sigma๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ฯƒ=(x1โˆ’xห‰)2+(x2โˆ’xห‰)2+โ‹ฏ+(xnโˆ’xห‰)2n\sigma = \sqrt{\frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots +(x_n - \bar{x})^2}{n}}

ํ†ต๊ณ„ํ•™์—์„œ ํ‘œ์ค€ ํŽธ์ฐจ๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด ํ‰๊ท  ๊ฐ’์—์„œ ์–ผ๋งˆ๋‚˜ ๋ฒ—์–ด๋‚˜ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ํ‰๊ท ์— ๊ฐ€๊นŒ์šด ๊ฐ’๋“ค์ด ๋งŽ์„์ˆ˜๋ก ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ ์ž‘๊ณ , ํ‰๊ท ๊ณผ ์ฐจ์ด๊ฐ€ ๋งŽ์ด ๋‚˜๋Š” ๊ฐ’๋“ค์ด ๋งŽ์„์ˆ˜๋ก ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ ํฌ๋‹ค.

ํ‘œ์ค€ํ™”๋Š” ์ด๋Ÿฐ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ด์šฉํ•œ ๊ฒƒ์œผ๋กœ ๊ณต์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

xnew=xoldโˆ’xห‰ฯƒx_{new} = \frac{x_{old} - \bar{x}}{\sigma}

xnewx_{new}๋Š” ํ‘œ์ค€ํ™” ํ•œ ํ›„์˜ ๋ฐ์ดํ„ฐ, xoldx_{old}๋Š” ํ‘œ์ค€ํ™” ํ•˜๊ธฐ ์ „์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

ํ‘œ์ค€ํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•ญ์ƒ ๋ฐ์ดํ„ฐ๋“ค์˜ ํ‰๊ท ์€ 0, ํ‘œ์ค€ ํŽธ์ฐจ๋Š” 1์ด ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ‘œ์ค€ํ™”๋ฅผ ํ•ด ์ค€ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ต๊ณ„ํ•™์—์„œ๋Š” z-score๋ผ๊ณ ๋„ ํ•˜๋Š”๋ฐ, z-score๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ํ‰๊ท  ๊ฐ’์—์„œ ๋ช‡ ํ‘œ์ค€ ํŽธ์ฐจ๋งŒํผ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ๋œปํ•œ๋‹ค.

scikit-learn ์œผ๋กœ ํ•œ ๋ฒˆ ์‹ค์Šต์„ ํ•ด๋ณด์ž.

from sklearn import preprocessing
import pandas as pd
import numpy as np

nba_player_of_the_week_df = pd.read_csv('../data/NBA_player_of_the_week.csv')

height_weight_age_df = nba_player_of_the_week_df[['Height CM', 'Weight KG', 'Age']]

scaler = preprocessing.StandardScaler()
normalized_data = scaler.fit_transform(height_weight_age_df)
normalized_df = pd.DataFrame(normalized_data, columns=['Height', 'Weight', 'Age'])

๐Ÿ“Œ One-hot Encoding

๋จธ์‹  ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

  • ์ˆ˜์น˜ํ˜•(numerical) ๋ฐ์ดํ„ฐ: ๋‚˜์ด, ๋ชธ๋ฌด๊ฒŒ, ํ‚ค, ๋“ฑ
  • ๋ฒ”์ฃผํ˜•(categorical) ๋ฐ์ดํ„ฐ: ํ˜ˆ์•กํ˜•, ์„ฑ๋ณ„, ๋“ฑ

๋งŽ์€ ๋จธ์‹  ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ธํ’‹ ๋ฐ์ดํ„ฐ, ์ฆ‰ ์ž…๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์—ฌ์•ผ ํ•˜๋Š”๋ฐ, ๋งŒ์•ฝ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ํ• ๊นŒ?

๊ฐ€์žฅ ๋จผ์ € ๋“œ๋Š” ์ƒ๊ฐ์€ ๊ทธ๋ƒฅ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋งˆ๋‹ค 1, 2, 3, ... ๋“ฑ์˜ ์ˆซ์ž๋ฅผ ๋ถ™์ด๋Š” ๊ฒƒ์ด๋‹ค. ๊ทผ๋ฐ ์ด๋ ‡๊ฒŒ ํ•˜๊ฒŒ ๋˜๋ฉด ์นดํ…Œ๊ณ ๋ฆฌ๋งˆ๋‹ค ํฌ๊ณ  ์ž‘๋‹ค๋Š” ๊ฐœ๋…์ด ์ƒ๊ฒจ๋ฒ„๋ฆฐ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ด ๋ฐฉ์‹์„ ํ˜ˆ์•กํ˜•์— ์ ์šฉํ•œ๋‹ค๋ฉด Aํ˜•์€ 1์ด๋‹ˆ๊นŒ ๊ฐ€์žฅ ์ž‘๊ณ , Oํ˜•์€ 4๋‹ˆ๊นŒ ๊ฐ€์žฅ ํฌ๊ณ , ๋“ฑ๋“ฑ์˜ ๊ฐœ๋…์ด ์ƒ๊ธฐ๊ฒŒ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
๋จธ์‹  ๋Ÿฌ๋‹์€ ์ด๋Ÿฌํ•œ ์—‰๋šฑํ•œ ๊ฐœ๋…๊นŒ์ง€ ํ•™์Šตํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋ฐฉ์‹์€ ์ข‹์ง€ ์•Š๋‹ค.

๋”ฐ๋ผ์„œ One-hot Encoding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹๋‹ค. One-hot encoding์€ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ•˜๋‚˜์˜ ์ƒˆ๋กœ์šด ์—ด๋กœ ๋งŒ๋“ค์–ด ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์•„๋ž˜ ์‚ฌ์ง„์„ ํ†ตํ•ด ๋ณด๋Š” ๊ฒƒ์ด ์ดํ•ด๊ฐ€ ๋น ๋ฅด๋‹ค.

A, B, O, AB ๋ผ๋Š” ํ˜ˆ์•กํ˜• ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ๊ฐ์„ ์—ด๋กœ ๋งŒ๋“  ํ›„ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ํ˜ˆ์•กํ˜•์ธ์ง€์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด ์—ด๋“ค์˜ ๊ฐ’์„ 0 ๋˜๋Š” 1๋กœ ์ฑ„์›Œ์„œ One-hot vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์—๊ฒŒ ํฌ๊ณ  ์ž‘์€ ๊ด€๊ณ„๊ฐ€ ์ƒ๊ธฐ๋Š” ๊ฒƒ์„ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค.
์ •๋ฆฌํ•˜์ž๋ฉด, One-hot Encoding์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์—๊ฒŒ ํฌ๊ณ  ์ž‘์Œ์˜ ์—‰๋šฑํ•œ ๊ด€๊ณ„๊ฐ€ ์ƒ๊ธฐ๋Š” ๊ฑธ ๋ฐฉ์ง€ํ•˜๋ฉด์„œ๋„ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ฐ”๊ฟ” ์ค„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

pandas ๋ฅผ ์ด์šฉํ•ด One-hot Encoding์„ ์‹ค์Šตํ•ด๋ณด์ž.

import pandas as pd

titanic_df = pd.read_csv('../data/titanic.csv')
titanic_sex_embarked = titanic_df[['Sex', 'Embarked']] # One-hot Encoding ํ•  ์ปฌ๋Ÿผ๋งŒ ๋”ฐ๋กœ ์ถ”์ถœ

one_hot_encoded_df = pd.get_dummies(titanic_sex_embarked, dtype=int) # One-hot Encoding ์ง„ํ–‰

์œ„์—์„œ๋Š” One-hot Encoding์„ ํ•  ์ปฌ๋Ÿผ์„ ๋”ฐ๋กœ ๋ถ„๋ฆฌํ•œ ๋‹ค์Œ์— ์ง„ํ–‰์„ ํ•˜์˜€๋Š”๋ฐ, ์ „์ฒด ๋ฐ์ดํ„ฐ titanic_df ์—์„œ ๋ฐ”๋กœ One-hot Encoding ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์ฒ˜๋Ÿผ ํ•˜๋ฉด ๋œ๋‹ค.

one_hot_encoded_df = pd.get_dummies(data=titanic_df, columns=['Sex', 'Embarked'])

์ถœ์ฒ˜: ์ฝ”๋“œ์ž‡

0๊ฐœ์˜ ๋Œ“๊ธ€