python -6. data preprocessing(5)

dbwls2ยท2023๋…„ 11์›” 16์ผ
0

python

๋ชฉ๋ก ๋ณด๊ธฐ
5/8
post-thumbnail

์Šค์ผ€์ผ๋ง

1. Scikit-Learn

๐Ÿ“Œ 1. Scikit-Learn

  • python์„ ๋Œ€ํ‘œํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • ๋งค์šฐ ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ๋„๊ตฌ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์„ ๋ฐฐ์šฐ๋Š” ๋ฐ ์ ํ•ฉ

    • ๋ถ„๋ฅ˜, ํšŒ๊ท€, ํด๋Ÿฌ์Šคํ„ฐ๋ง, ์ฐจ์› ์ถ•์†Œ ๋“ฑ์„ ํฌํ•จํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ๊ณต
    • ์˜ˆ์ œ์™€ ์‚ฌ์šฉ ์„ค๋ช…์„œ๊ฐ€ ์ž˜ ๋˜์–ด์žˆ์–ด ์ฐธ๊ณ ํ•˜์—ฌ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์šฉ์ด
  • ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ๋„๊ตฌ๋ฅผ ์ œ๊ณต

    • ๊ฐ„๋‹จํ•˜๊ณ  ์ง๊ด€์ ์ธ API๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ์ „๋ฌธ ์ง€์‹์„ ๊ฐ€์ง„ ์‚ฌ์šฉ์ž๊ฐ€ ์ ‘๊ทผ ๊ฐ€๋Šฅ
    • fit(), transform(), predict() ๋“ฑ ์ฒด๊ณ„์ ์ด๊ณ  ์ผ๊ด€๋œ ๋ถ„์„ ๋ฐ ํ•™์Šต๋ชจํ˜• ์šด์šฉ ์ฒด๊ณ„๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Œ
    • ๋‹ค๋ฅธ ๋งŽ์€ ํŒจํ‚ค์ง€๋„ scikit-learn๊ณผ ๋™์ผํ•œ ์ฒด๊ณ„๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์œ ์‚ฌํ•œ ํ”„๋ ˆ์ž„์—์„œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅ
  • NumPy, Pandas, SciPy ๋ฐ matplotlib๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜์–ด ์žˆ์–ด ๋‹ค๋ฅธ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๊ธฐ ์šฉ์ด

    • NumPy: ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์œ„ํ•œ ๊ธฐ๋ณธ ํŒจํ‚ค์ง€
    • Pandas: ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์œ„ํ•œ ๊ธฐ๋ณธ ํŒจํ‚ค์ง€
    • SciPy: ๊ณผํ•™ ๊ณ„์‚ฐ์šฉ ํ•จ์ˆ˜๋ฅผ ๋ชจ์•„๋†“์€ ํŒจํ‚ค์ง€
    • matplotlib: ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํŒจํ‚ค์ง€
  • ๋‹จ์ ์€
    ๋”ฅ๋Ÿฌ๋‹, ๊ฐ•ํ™”ํ•™์Šต, ์‹œ๊ณ„์—ด ๋ชจํ˜•์€ ๋งค์šฐ ์•ฝํ•จ
    ์ตœ๊ทผ ๊ฐœ๋ฐœ๋œ ๋Œ€์šฉ๋Ÿ‰์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ Polars์™€ ๊ฐ™์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€๋Š” ์—ฐ๋™์ด ์ž˜ ์•ˆ๋จ

  • ์ฃผ์š” ๊ธฐ๋Šฅ

    • ๋ถ„๋ฅ˜ : ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ๊ฒฐ์ก ํŠธ๋ฆฌ, ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ (SVM)
    • ํšŒ๊ท€ : ์„ ํ˜• ํšŒ๊ท€, ๋ฆฟ์ง€ ํšŒ๊ท€ ๋“ฑ
    • ๊ตฐ์ง‘ํ™” : k-means ๊ตฐ์ง‘ํ™”, ๊ณ„์ธต์  ๊ตฐ์ง‘ํ™” ๋“ฑ
    • ์ฐจ์› ์ถ•์†Œ : ์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA), t-๋ถ„์‚ฐ ํ™•๋ฅ ์  ์ด์›ƒ ๋‚ด์žฌํ™”(t-SNE) ๋“ฑ
    • ์ „์ฒ˜๋ฆฌ : ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”, ์Šค์ผ€์ผ๋ง, ์ธ์ฝ”๋”ฉ ๋“ฑ

๐Ÿ“Œ 2. Scikit-Learn preprocessing

  • Scikit-Learn์˜ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ
    • ์Šค์ผ€์ผ๋ง(scaling) : ์„œ๋กœ ๋‹ค๋ฅธ ๋ณ€์ˆ˜์˜ ๊ฐ’ ๋ฒ”์œ„๋ฅผ ์ผ์ •ํ•œ ์ˆ˜์ค€์œผ๋กœ ๋งž์ถ”๋Š” ๊ฒƒ
    • ์ด์ง„ํ™”(binarization) : ์—ฐ์†์ ์ธ ๊ฐ’์„ 0 ๋˜๋Š” 1๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ. ์—ฐ์†ํ˜• ๋ณ€์ˆ˜ -> ์ด์ง„ํ˜• ๋ณ€์ˆ˜
    • ์ธ์ฝ”๋”ฉ(encoding) : ๋ฒ”์ฃผํ˜• ๊ฐ’์„ ์ ์ ˆํ•œ ์ˆซ์žํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ -> ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜
    • ๋ณ€ํ™˜(transformation) : ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ๋ณ€ํ™˜ํ•˜์—ฌ ์ •๊ทœ์„ฑ์„ ํ™•๋ณดํ•˜๋Š” ๊ฒƒ
  1. ์Šค์ผ€์ผ๋ง(scaling)
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋ณ€์ˆ˜(feature)์˜ ๊ฐ’ ๋ฒ”์œ„๋ฅผ ์„ ํ˜•๋ณ€ํ™˜์„ ํ†ตํ•ด ์ผ์ •ํ•œ ์ˆ˜์ค€์œผ๋กœ ๋งž์ถ”๋Š” ์ž‘์—…
    : ๋…๋ฆฝ๋ณ€์ˆ˜(feature)๋ณ„๋กœ ๊ฐ’์˜ ๋ณ€์œ„๊ฐ€ ์ƒ์ดํ•˜๋ฉด
    ์ข…์†๋ณ€์ˆ˜(target)์— ๋Œ€ํ•œ ์˜ํ–ฅ์ด ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ๋ณ€์œ„์— ๋”ฐ๋ผ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง -> ๋จธ์‹ ๋Ÿฌ๋‹ ์‹œ ํ•™์Šต ํšจ๊ณผ๊ฐ€ ๋–จ์–ด์ง
    : ์ปดํ“จํ„ฐ์˜ ๋น„ํŠธ์ˆ˜๋กœ ์ธํ•ด ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ธ์‹๋˜๋Š” ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ(overflow)๋‚˜ ์–ธ๋”ํ”Œ๋กœ์šฐ(underflow) ๋ฐฉ์ง€
    : k-means ๋“ฑ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์—์„œ๋Š” ์Šค์ผ€์ผ๋ง ๋งค์šฐ ์ค‘์š”
  • ํ‘œ์ค€ํ™”(Standardization) : ํ‘œ์ค€๋ถ„ํฌํ™”
    • StandardScaler() : ๊ธฐ๋ณธ ์Šค์ผ€์ผ๋Ÿฌ, ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ ์‚ฌ์šฉ
    • RobustScaler() : ์ค‘์•™๊ฐ’๊ณผ IQR(Q3-Q1)์„ ์‚ฌ์šฉ. ์ด์ƒ์น˜์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”
  • ์ •๊ทœํ™”(Normalization) : ๊ทœ๊ฒฉํ™”(์ฃผ๋กœ[0,1]๋กœ ์Šค์บ์ผ๋ง)
    • MinMaxScaler() : ๋ฒ”์œ„๊ฐ€ [0,1]์ด ๋˜๋„๋ก ์Šค์ผ€์ผ๋ง
    • MaxAbsXcaler() : ์–‘์ˆ˜๋Š” [0,1], ์Œ์ˆ˜๋Š” [-1,0], ์–‘์Œ์ˆ˜๋Š” [-1,1]์ด ๋˜๋„๋ก ์Šค์ผ€์ผ๋ง
  • ๋ณ€ํ™˜(Transformation) : ํŠน์ •ํ•œ ๋ถ„ํฌ๋‚˜ ๋ชจ์–‘์„ ๋”ฐ๋ฅด๋„๋ก ์Šค์ผ€์ผ๋ง
    • PowerTransformer() : ์ •๊ทœ๋ถ„ํฌํ™”(Box-Cox๋ณ€ํ™˜, Yeo-Johnson ๋ณ€ํ™˜)
    • QuantileTransformer() : ๊ท ์ผ(Uniform) ๋˜๋Š” ์ •๊ทœ(Gaussian)๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
    • Normalizer() : ํ•œ ํ–‰์˜ ๋ชจ๋“  ํ”ผ์ฒ˜๋“ค ์‚ฌ์ด์˜ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ฐ€ 1์ด ๋˜๋„๋ก ๋ณ€ํ™˜
  1. ์Šค์ผ€์ผ๋ง ์ ˆ์ฐจ
  • scaler ๊ฐ์ฒด๋ฅผ ์ด์šฉ
  • fit() : ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋งž์ถ”์–ด ํ•™์Šต
    • ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜์„ ์œ„ํ•œ ๊ธฐ์ค€ ์ •๋ณด ์„ค์ •์„ ์ ์šฉ
  • transform() : scaler ์ ์šฉ, fit()๋œ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
  • fit_transform() : fit๊ณผ transform ํ•œ๋ฒˆ์— ์‹คํ–‰
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” fit()๊ณผ transform() ๋ชจ๋‘ ์ ์šฉ
  • ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ๋Š” transform()๋งŒ ์ ์šฉ
  1. ํ‘œ์ค€ํ™”(Standardization)
  • RBF(Radial Basis Function) ์ปค๋„์„ ์ด์šฉํ•˜๋Š” ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ (SVM, Support Vector Machine), ์„ ํ˜•ํšŒ๊ท€(Linear Regression)๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •
  • ์ด์ƒ์น˜์— ๋ฏผ๊ฐํ•˜๊ณ  ๋ถ„๋ฅ˜๋ณด๋‹ค๋Š” ํšŒ๊ท€์— ์œ ์šฉ

๐Ÿ“‹ ๊ธฐ์ €ํ•จ์ˆ˜์™€ ์ปค๋„

1) ๊ธฐ์ €ํ•จ์ˆ˜
: ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„์„ ํ˜•์ด๋ฉด ์„ ํ˜•ํšŒ๊ท€๋ชจํ˜•์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ๋ฐ์ดํ„ฐ์— ๋งž๋Š” ๋น„์„ ํ˜• ๋ชจํ˜•์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•œ ๋น„์„ ํ˜• ํ•จ์ˆ˜๋ฅผ ์ƒ๊ฐํ•ด ๋‚ผ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ๊ฒƒ์ด ๊ธฐ์ €ํ•จ์ˆ˜(basis function)๋ชจํ˜•์ด๋‹ค.
1. ๋‹คํ•ญ ๊ธฐ์ €ํ•จ์ˆ˜(polynomial basis function)
: global function์œผ๋กœ, ํ•˜๋‚˜์˜ region์ด๋ผ์„œ ๋ฐ์ดํ„ฐ ํ•˜๋‚˜์˜ ๋ณ€๊ฒฝ์ด ์ „์ฒด region์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ ์˜ˆ์‹œํ•จ์ˆ˜

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

num = 100
X = np.linspace(-1, 1, num).reshape(100, 1)
Y = np.sin(2*np.pi*X)
plt.plot(X, Y, 'g-')

num = 100
i = 9

x = np.linspace(-1, 1, num).reshape(100, 1)

M = 9
for i in range(M+1):
  y = X**i
  plt.plot(X, y)
  plt.title('polynomial curve fitting')


  • #loss : 24.098
  1. ๊ฐ€์šฐ์‹œ์•ˆ ๋ฐฉ์‚ฌ ๊ธฐ์ €ํ•จ์ˆ˜((Gaussian) Radial basis function)
    : ui ; governing the locations of the basis functions in input space
    : spline๊ณผ ๊ฐ™์ด ui๊ฐ’์„ ์กฐ์ •ํ•ด์„œ ๊ฐ ๊ตฌ๊ฐ„๋ณ„ ๊ธฐ์ €ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
num = 100
i = 9
X = np.linspace(0,1,num).reshape(100,1)

M = 9
for interval in range(2, M+1):
  for j in range(interval):
    y = np.exp(-(X - j / (interval - 1)) ** 2 / (2 * 0.1 ** 2))
    plt.plot(X, y)
    plt.title("Radial basis function")


  • #loss : 10.192
  • y basis function์ด ๊ธฐ์กดํ•จ์ˆ˜์™€ ์œ ์‚ฌํ•˜๋ฉฐ ๋ชจํ˜•๋ณต์žก๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด loss๊ฐ€ ๊ฐ์†Œํ•จ

2) ์ปค๋„(Kernel)
: ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ์ฐจ์›์œผ๋กœ ๋ณด๋‚ด ์„œํฌํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๊ณ  ์ €์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•˜๋Š” ๊ณผ์ •์€ ๋ณต์žกํ•˜๊ณ  ๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Kernel Trick์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • Kernel trick : ์„ ํ˜•๋ถ„๋ฆฌ๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ €์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ์ฐจ์›์œผ๋กœ ๋ณด๋‚ด ์„ ํ˜• ๋ถ„๋ฆฌ๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ๊ณ ์ฐจ์› Mapping๊ณผ ๊ณ ์ฐจ์›์—์„œ์˜ ๋‚ด์  ์—ฐ์‚ฐ์„ ํ•œ ๋ฒˆ์— ํ•  ์ˆ˜ ์žˆ๋‹ค.


  • ํ‘œ์ค€ํ™” ํŒŒ์ด์ฌ ์˜ˆ์‹œ
import pandas as pd
import seaborn as sns

#์†Œ์ˆ˜์  4์งธ์ž๋ฆฌ ์ดํ•˜์— ๋ฐ˜์˜ฌ๋ฆผ
pd.set_option("display.float_format", lambda x: f'{x:.4f}')

#iris ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = sns.load_dataset('iris')

#iris์˜ ์ˆ˜์ฐจํ˜• ๋ณ€์ˆ˜๋งŒ ์ถ”์ถœ
iris = iris.select_dtypes(exclude = 'object')

#iris์˜ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰ ํ™•์ธ
iris.describe()

#sepal_length์™€ petal_length์˜ joinplot๊ทธ๋ฆผ
sns.jointplot(data = iris, x = 'petal_length', y= 'petal_width', kind = 'reg')

  • ํ‘œ์ค€ํ™”ํ•˜๊ธฐ
from sklearn.preprocessing import StandardScaler, RobustScaler

#scaler๊ฐ์ฒด ์ƒ์„ฑ
standard_scaler = StandardScaler()
robust_scaler   = RobustScaler()

#๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
iris_standard = pd.DataFrame(standard_scaler.fit_transform(iris), columns = iris.columns)
iris_robust   = pd.DataFrame(robust_scaler.fit_transform(iris), columns = iris.columns)

#๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("Standard Scaled : \n", iris_standard.describe())
print()
print("Robust Scaled : \n", iris_robust.describe())
Standard Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean        -0.0000      -0.0000       -0.0000      -0.0000
std          1.0034       1.0034        1.0034       1.0034
min         -1.8700      -2.4339       -1.5676      -1.4471
25%         -0.9007      -0.5924       -1.2266      -1.1838
50%         -0.0525      -0.1320        0.3365       0.1325
75%          0.6745       0.5586        0.7628       0.7907
max          2.4920       3.0908        1.7858       1.7121

Robust Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.0333       0.1147       -0.1691      -0.0671
std          0.6370       0.8717        0.5044       0.5082
min         -1.1538      -2.0000       -0.9571      -0.8000
25%         -0.5385      -0.4000       -0.7857      -0.6667
50%          0.0000       0.0000        0.0000       0.0000
75%          0.4615       0.6000        0.2143       0.3333
max          1.6154       2.8000        0.7286       0.8000
  • seaborn์˜ jointplot์€ subplot ๊ทธ๋ฆฌ๊ธฐ ์–ด๋ ค์›€
  • patchwork ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด subplot ๊ทธ๋ฆผ
#pip install patchworklib

import seaborn as sns
import patchworklib as pw
pw.overwrite_axisgrid()

g1 = sns.jointplot(data = iris_standard, x = "petal_length", y = "petal_width", kind = "reg")
g1 = pw.load_seaborngrid(g1)
g1.set_suptitle("Standard Scaled")

g2 = sns.jointplot(data = iris_robust, x = "petal_length", y = "petal_width", kind = "reg")
g2 = pw.load_seaborngrid(g2)
g2.set_suptitle("Robust Scaled")

g3 = (g1|g2)
g3

  1. ์ •๊ทœํ™”(Normalization)
  • MinMaxScaler() : ๋ฒ”์œ„๊ฐ€ [0,1]์ด ๋˜๋„๋ก ์Šค์ผ€์ผ๋ง

  • MaxAbsScaler() : ์–‘์ˆ˜๋Š” [0,1], ์Œ์ˆ˜๋Š” [-1,0], ์–‘์Œ์ˆ˜๋Š” [-1,1]์ด ๋˜๋„๋ก ์Šค์ผ€์ผ๋ง

  • ์ •๊ทœํ™” ํŒŒ์ด์ฌ ์˜ˆ์‹œ

from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler

#scaler ๊ฐ์ฒด ์ƒ์„ฑ
minmax_scaler = MinMaxScaler()
maxabs_scaler = MaxAbsScaler()

#๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
iris_minmax = pd.DataFrame(minmax_scaler.fit_transform(iris), columns=iris.columns)
iris_maxabs = pd.DataFrame(maxabs_scaler.fit_transform(iris), columns=iris.columns)

#๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("MinMax Scaled : \n", iris_minmax.describe())
print()
print("MaxAbs Scaled : \n", iris_maxabs.describe())
MinMax Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.4287       0.4406        0.4675       0.4581
std          0.2300       0.1816        0.2992       0.3176
min          0.0000       0.0000        0.0000       0.0000
25%          0.2222       0.3333        0.1017       0.0833
50%          0.4167       0.4167        0.5678       0.5000
75%          0.5833       0.5417        0.6949       0.7083
max          1.0000       1.0000        1.0000       1.0000

MaxAbs Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.7397       0.6948        0.5446       0.4797
std          0.1048       0.0991        0.2558       0.3049
min          0.5443       0.4545        0.1449       0.0400
25%          0.6456       0.6364        0.2319       0.1200
50%          0.7342       0.6818        0.6304       0.5200
75%          0.8101       0.7500        0.7391       0.7200
max          1.0000       1.0000        1.0000       1.0000
  • ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
g1 = sns.jointplot(data = iris_standard, x = "petal_length", y = "petal_width", kind = "reg")
g1 = pw.load_seaborngrid(g1)
g1.set_suptitle("Standard Scaled")

g2 = sns.jointplot(data = iris_robust, x = "petal_length", y = "petal_width", kind = "reg")
g2 = pw.load_seaborngrid(g2)
g2.set_suptitle("Robust Scaled")

g3 = (g1|g2)
g3

5) ๋ณ€ํ™˜(Transformation)

  • PowerTransformer() : ์ •๊ทœ๋ถ„ํฌํ™”(Box-Cox๋ณ€ํ™˜, Yeo-Johnson ๋ณ€ํ™˜)
  • QuantileTransformer() : ๊ท ์ผ(Uniform) ๋˜๋Š” ์ •๊ทœ(Gaussian)๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
  • Normalizer() : ํ•œ ํ–‰์˜ ๋ชจ๋“  ํ”ผ์ฒ˜๋“ค ์‚ฌ์ด์˜ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ฐ€ 1์ด ๋˜๋„๋ก ๋ณ€ํ™˜
import numpy as np
from sklearn.preprocessing import PowerTransformer, Normalizer

#Scaler ๊ฐ์ฒด ์ƒ์„ฑ
power_scaler  = PowerTransformer()
normal_scaler = Normalizer()

#๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
iris_pow  = pd.DataFrame(power_scaler.fit_transform(iris), columns=iris.columns)
iris_norm = pd.DataFrame(normal_scaler.fit_transform(iris), columns=iris.columns)

#๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("Power Scaled : \n", iris_pow.describe())
print()
print("Normalizer Scaled : \n", iris_norm.describe())
#๊ฐ ํ–‰์˜ ๋ฒกํ„ฐ ํฌ๊ธฐ๊ฐ€ 1์ด ๋˜๋Š”์ง€ ํ™•์ธ
print("Eucledian Distance from 0 : \n", np.linalg.norm(iris_norm, axis = 1))
Power Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.0000      -0.0000       -0.0000       0.0000
std          1.0034       1.0034        1.0034       1.0034
min         -2.1378      -2.7591       -1.5456      -1.4768
25%         -0.8957      -0.5615       -1.2244      -1.1896
50%          0.0264      -0.0819        0.3226       0.1597
75%          0.7222       0.5959        0.7598       0.7965
max          2.1770       2.7432        1.8288       1.6585

Normalizer Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.7514       0.4052        0.4548       0.1411
std          0.0444       0.1056        0.1600       0.0780
min          0.6539       0.2384        0.1678       0.0147
25%          0.7153       0.3267        0.2509       0.0487
50%          0.7549       0.3544        0.5364       0.1641
75%          0.7869       0.5276        0.5800       0.1975
max          0.8609       0.6071        0.6370       0.2804
Eucledian Distance from 0 : 
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
  • ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
g7 = sns.jointplot(data = iris_pow, x = "petal_length", y = "petal_width", kind = "reg")
g7 = pw.load_seaborngrid(g7)
g7.set_suptitle("PowerTransformer Scaled")

g8 = sns.jointplot(data = iris_norm, x = "petal_length", y = "petal_width", kind = "reg")
g8 = pw.load_seaborngrid(g8)
g8.set_suptitle("Normalizer Scaled")

g9 = (g7|g8)
g9

from sklearn.preprocessing import QuantileTransformer

#scaler๊ฐ์ฒด ์ƒ์„ฑ
gaussian_scaler = QuantileTransformer(output_distribution = 'normal')
uniform_scaler  = QuantileTransformer(output_distribution = 'uniform')

#๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
iris_gaussian = pd.DataFrame(gaussian_scaler.fit_transform(iris), columns = iris.columns)
iris_uniform  = pd.DataFrame(uniform_scaler.fit_transform(iris), columns = iris.columns)

#๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("QuantileTransformer_Gaussian Scaled : \n", iris_gaussian.describe())
print()
print("QuantileTransformer_Uniform Scaled : \n", iris_uniform.describe())
 QuantileTransformer_Gaussian Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean        -0.0012       0.0014        0.0021      -0.0339
std          1.1311       1.1328        1.1331       1.4616
min         -5.1993      -5.1993       -5.1993      -5.1993
25%         -0.7011      -0.6175       -0.6175      -0.6798
50%          0.0252      -0.0842        0.0084      -0.0589
75%          0.6587       0.6277        0.6692       0.6277
max          5.1993       5.1993        5.1993       5.1993

QuantileTransformer_Uniform Scaled : 
        sepal_length  sepal_width  petal_length  petal_width
count      150.0000     150.0000      150.0000     150.0000
mean         0.5002       0.5002        0.5004       0.5001
std          0.2914       0.2900        0.2914       0.2912
min          0.0000       0.0000        0.0000       0.0000
25%          0.2416       0.2685        0.2685       0.2483
50%          0.5101       0.4664        0.5034       0.4765
75%          0.7450       0.7349        0.7483       0.7349
max          1.0000       1.0000        1.0000       1.0000
  • ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
g10 = sns.jointplot(data = iris_gaussian, x = "petal_length", y = "petal_width", kind = "reg")
g10 = pw.load_seaborngrid(g10)
g10.set_suptitle("QuantileTransformer_Gaussian Scaled")

g11 = sns.jointplot(data = iris_uniform, x = "petal_length", y = "petal_width", kind = "reg")
g11 = pw.load_seaborngrid(g11)
g11.set_suptitle("QuantileTransformer_Uniform Scaled")

g12 = (g10|g11)
g12

  • ๊ทธ๋ž˜ํ”„ ํ•ฉ์น˜๊ธฐ
(g1|g2|g4|g5)/(g7|g8|g10|g11)

0๊ฐœ์˜ ๋Œ“๊ธ€