[Deep Learning] Data Preprocessing

๊น€ํฌ์ง„ยท2021๋…„ 4์›” 8์ผ
0

DeepLearning

๋ชฉ๋ก ๋ณด๊ธฐ
12/12
post-thumbnail

๐Ÿ“– ์ผ€๋ผ์Šค ์ฐฝ์‹œ์ž์—๊ฒŒ ๋ฐฐ์šฐ๋Š” ๋”ฅ๋Ÿฌ๋‹ (ํ”„๋ž‘์†Œ์™€ ์ˆ„๋ ˆ, ๋ฐ•ํ•ด์„ , ๊ธธ๋ฒ—) ์ฐธ๊ณ 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

Vectorization

์‹ ๊ฒฝ๋ง์—์„œ ๋ชจ๋“  ์ž…๋ ฅ๊ณผ ํƒ€๊นƒ์€ ๋ถ€๋™ ์†Œ์ˆ˜ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง„ ํ…์„œ์—ฌ์•ผ ํ•œ๋‹ค. ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋ฐ์ดํ„ฐ ๋ฒกํ„ฐํ™”๋ผ๊ณ  ํ•œ๋‹ค.

Normalization

๋ฐ์ดํ„ฐ์˜ ๊ฐ ํŠน์„ฑ๋“ค์ด ๋ฒ”์œ„๊ฐ€ ์ œ๊ฐ๊ฐ์ด๋ฉด ํ•™์Šต์— ์˜ํ–ฅ์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๋น„์Šทํ•œ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋„๋ก ์ฒ˜๋ฆฌํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ๊ฐ ํŠน์„ฑ์„ ํ‰๊ท ์ด 0์ด๊ณ  ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ 1์ด ๋˜๋„๋ก ์ฒ˜๋ฆฌํ•ด์ค€๋‹ค.

from sklearn.preprocessing import MinMaxScaler

scaler1 = MinMaxScaler()
X_normalization = scaler1.fit_transform(X)
  • ํ‘œ์ค€ํ™”
from sklearn.preprocessing import StandardScaler

scaler2 = StandardScaler()
X_standardization = scaler2.fit_transform(X)

Missing Value

๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€ ๊ฐ’์ด ๋ˆ„๋ฝ๋œ ๊ฒฝ์šฐ๊ฐ€ ์ข…์ข… ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์ฒ˜๋ฆฌํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.

DF.isnull() # ๊ฒฐ์ธก์น˜๋ฅผ True๋กœ ์ถœ๋ ฅ

DF.isnull().sum(axis=0) # ๊ฐ ์—ด ๋ณ„ ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ํ™•์ธ, ํ–‰ ๋ฐฉํ–ฅ
DF.isnull().sum(axis=1) # ์—ด ๋ฐฉํ–ฅ

DF.dropna(thresh=100, axis=1) # ๊ฒฐ์ธก์น˜ 100๊ฐœ ์ด์ƒ์ธ ์—ด ์‚ญ์ œ
DF.dropna(subset=['column_name'], how='any', axis=0) # ๊ฒฐ์ธก์น˜๊ฐ€ ํ•œ ๊ฐœ๋ผ๋„ ์žˆ๋Š” ํ–‰ ์‚ญ์ œ

# ๊ฒฐ์ธก์น˜๋ฅผ ํ‰๊ท ๊ฐ’์œผ๋กœ ์น˜ํ™˜
DF['column_name'].fillna(int(DF['column_name'].mean(axis=0)), inplace=True)

# ๊ฒฐ์ธก์น˜๋ฅผ ์ตœ๋นˆ๊ฐ’์œผ๋กœ ์น˜ํ™˜
most_freq = DF['column_name'].value_counts(dropna=True).idxmax()
DF['column_name'].fillna(most_freq, inplace=True)

# ๊ฒฐ์ธก์น˜๋ฅผ ์ด์ „ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋กœ ์น˜ํ™˜
DF['column_name'].fillna(method='ffill', inplace=True)

# ๊ฒฐ์ธก์น˜๋ฅผ ๋‹ค์Œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋กœ ์น˜ํ™˜
DF['column_name'].fillna(method='bfill', inplace=True)

0๊ฐœ์˜ ๋Œ“๊ธ€

๊ด€๋ จ ์ฑ„์šฉ ์ •๋ณด