๐Ÿ“’ Spark(10) - ML(1)

Kimdongkiยท2024๋…„ 6์›” 21์ผ

Spark

๋ชฉ๋ก ๋ณด๊ธฐ
10/22

๐Ÿ“Œ Spark ML

Spark ML ์†Œ๊ฐœ

  • ๋จธ์‹ ๋Ÿฌ๋‹ ๊ด€๋ จ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์œ ํ‹ธ๋ฆฌํ‹ฐ๋กœ ๊ตฌ์„ฑ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

    • Classification, Regression, Clustering, Collaborative Filtering, Dimensionality Reduction - ์ฐธ๊ณ 
    • ์•„์ง ๋”ฅ๋Ÿฌ๋‹์˜ ๋Œ€ํ•œ ์ง€์›์€ ๋ฏธ์•ฝํ•˜๋‹ค.
  • RDD ๊ธฐ๋ฐ˜๊ณผ DataFrame ๊ธฐ๋ฐ˜์˜ ๋‘ ๋ฒ„์ง„์ด ์กด์žฌํ•œ๋‹ค.

    • spark.mllib vs. spark.ml
      • spark.mllib -> RDD ๊ธฐ๋ฐ˜
      • spark.ml -> DataFrame ๊ธฐ๋ฐ˜
    • spark.mllib๋Š” RDD์œ„์—์„œ ๋™์ž‘ํ•˜๋Š” ์ด์ „ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ๋” ์ด์ƒ ์—…๋ฐ์ดํŠธ๊ฐ€ ์•ˆ๋œ๋‹ค.
    • spark.ml์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค
      -> import pyspark.ml

Spark ML ์žฅ์ 

  • DataFrame๊ณผ SparkSQL๋“ฑ์„ ์ด์šฉํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
  • Spakr MLlib๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋นŒ๋“œํ•œ๋‹ค.
  • ML Pipeline์„ ํ†ตํ•ด ๋ชจ๋ธ ๋นŒ๋”ฉ์„ ์ž๋™ํ™”ํ•œ๋‹ค.
  • MLflow๋กœ ๋ชจ๋ธ์„ ๊ด€๋ฆฌํ•˜๊ณ  ์„œ๋น™ํ•œ๋‹ค.
  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•˜๋‹ค.

MLflow

  • ๋ชจ๋ธ์˜ ๊ด€๋ฆฌ์™€ ์„œ๋น™์„ ์œ„ํ•œ Ops ๊ด€๋ จ ๊ธฐ๋Šฅ๋„ ์ œ๊ณตํ•œ๋‹ค.
  • MLflow
    • ๋ชจ๋ธ ๊ฐœ๋ฐœ, ํ…Œ์ŠคํŠธ, ๊ด€๋ฆฌ ๊ทธ๋ฆฌ๊ณ  ์„œ๋น™๊นŒ์ง€ ์ œ๊ณตํ•ด์ฃผ๋Š” End-to-End Framwork์ด๋‹ค.
    • MLflow๋Š” Python, Java, R, API๋ฅผ ์ง€์›ํ•œ๋‹ค.
    • MLflow๋Š” Tracking, Models, Projects๋ฅผ ์ง€์›ํ•œ๋‹ค.

Spark ML์—์„œ ์ œ๊ณตํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Classification(๋ถ„๋ฅ˜)
    -> Logistic regression, Decision tree, Random forest, Gradient-boosted tree, ...
    -> ๋ ˆ์ด๋ธ”(Lable)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ ธ์™€ ํ•ด๋‹น ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ ๋ฐ์ดํ„ฐ์— ๋ ˆ์ด๋ธ”์„ ์ง€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ๊ฐ๋… ํ•™์Šต ๊ธฐ๋ฒ•์ด๋‹ค. Yes & No ์™€ ๊ฐ™์ด ๋ถ„๋ฅ˜์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.

  • Collaborative Filtering(ํ˜‘์—… ํ•„ํ„ฐ๋ง)

    • ๊ถŒ์žฅ ์‚ฌํ•ญ์„ ๋งŒ๋“œ๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์‡ผํ•‘ ์›น ์‚ฌ์ดํŠธ์— "์ข‹์•„ํ•  ์ˆ˜ ๋„ ์žˆ๋Š”.." ๋“ฑ๋“ฑ์˜ ํ˜•ํƒœ๋กœ ํ‘œ๊ธฐ๋˜๋Š”๊ฒƒ์ด ์ด๊ฒƒ์ด๋‹ค.
    • ๋‹ค์ˆ˜์˜ ๋ฐ์ดํ„ฐ ๊ด€์ธก์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ์œ ์‚ฌํ•œ ํŠน์„ฑ์ด๋‚˜ ํŠน์ง•์„ ๊ฐ€์ง„ ์—”ํ‹ฐํ‹ฐ๋ฅผ ์ฐพ์€๋‹ค์Œ ์ด์ „ ๊ด€์ธก์— ๋”ฐ๋ผ ์ƒˆ๋กœ ๊ด€์ฐฐ๋œ ์—”ํ‹ฐํ‹ฐ์— ๊ถŒ์žฅ ์‚ฌํ•ญ ๋˜๋Š” ์ œ์•ˆ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
    • ๋ถ„๋ฅ˜์™€ ๋‹ฌ๋ฆฌ ๋น„๊ฐ๋… ํ•™์Šต ๊ธฐ๋ฒ•์ด๋‹ค. ์ด๋Š” ๋ ˆ์ด๋ธ”(Lable)์—†์ด ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ๋„์ถœํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค.
  • Clustering(ํด๋Ÿฌ์Šคํ„ฐ๋ง)
    -> K-means, LDA, GMM...

    • ๋ฐ์ดํ„ฐ ๊ด€์ธก ์ˆ˜์ง‘ ๋‚ด์—์„œ ๊ตฌ์กฐ๋ฅผ ๋ฐœ๊ฒฌํ•˜๋Š” ํ”„๋กœ์„ธ์Šค์ด๋‹ค.
    • ํ˜•์‹ & ๊ตฌ์กฐ๊ฐ€ ๋ช…ํ™•ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋”์šฑ ์œ ์šฉํ•˜๋‹ค.
    • ์ œ๊ณต๋œ ๋ฐ์ดํ„ฐ์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” ๊ทธ๋ฃน์„ ๋ฐœ๊ฒฌํ•œ๋‹ค.

๐Ÿ“Œ ๋ชจ๋ธ ๋นŒ๋”ฉ์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ

  • ์—ฌ๋А ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ ๋นŒ๋น™๊ณผ ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š๋‹ค.

    • Train dataset ์ „์ฒ˜๋ฆฌ
    • Model ๋นŒ๋“œ
    • Model ๊ฒ€์ฆ(confusion matrix)
  • Scikit-Learn(์‚ฌ์ดํ‚ท ๋Ÿฐ)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์žฅ์ 

    • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ

      • Scikit-Learn์€ ํ•˜๋‚˜์˜ ์ปดํ“จํ„ฐ์—์„œ ๋Œ์•„๊ฐ€๋Š” ๋ชจ๋ธ์„ ๋นŒ๋”ฉํ•œ๋‹ค.
      • Spark MLlib๋Š” ์—ฌ๋Ÿฌ ์„œ๋ฒ„ ์œ„์—์„œ ๋ชจ๋ธ์„ ๋นŒ๋”ฉํ•œ๋‹ค.
    • Training Set์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ์ „์ฒ˜๋ฆฌ์™€ ๋ชจ๋ธ ๋นŒ๋”ฉ์— ์žˆ์–ด์„œ Spark์ด ํฐ ์žฅ์ ์„ ๊ฐ–๋Š”๋‹ค.

    • Spark์€ ML PipeLine์„ ํ†ตํ•ด ๋ชจ๋ธ ๊ฐœ๋ฐœ์˜ ๋ฐ˜๋ณต์„ ์‰ฝ๊ฒŒ ํ•ด์ค€๋‹ค.

๐Ÿ“Œ Spark ML ํ”ผ์ณ ๋ณ€ํ™˜

Feature Extractor & Transformer

  • Feature๊ฐ’๋“ค์„ ๋ชจ๋ธ ํ›ˆ๋ จ์— ์ ํ•ฉํ•œ ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์„ ์ง€์นญํ•œ๋‹ค.

  • Feature Transformer๊ฐ€ ํ•˜๋Š” ์ผ

    • ๋จผ์ € Feature๊ฐ’๋“ค์€ ์ˆซ์ž ํ•„๋“œ์—ฌ์•ผ ํ•œ๋‹ค.
      -> ํ…์ŠคํŠธ ํ•„๋“œ(์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ’๋“ค)๋ฅผ ์ˆซ์ž ํ•„๋“œ๋กœ ๋ณ€ํ™˜ํ•ด์•ผํ•œ๋‹ค.
    • ์ˆซ์ž ํ•„๋“œ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ํ‘œ์ค€ํ™”ํ•œ๋‹ค.
      • ์ˆซ์žํ•„๋“œ๋ผ๊ณ  ํ•ด๋„ ๊ฐ€๋Šฅํ•œ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ํŠน์ • ๋ฒ”์œ„(0~1)๋กœ ๋ณ€ํ™˜ํ•ด์•ผํ•œ๋‹ค.
      • ์ด๋ฅผ Feature Scaling & Normalization์ด๋ผ๊ณ  ํ•œ๋‹ค.
    • ๋น„์–ด์žˆ๋Š” ํ•„๋“œ๋“ค์˜ ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์ฑ„์šธ ๊ฒƒ์ธ๊ฐ€?
      -> ์—ฌ๋Ÿฌ ๋ฐฉ์‹์œผ๋กœ ์ฑ„์šธ ์ˆ˜ ์žˆ๋‹ค. ํ‰๊ท ๊ฐ’, ์ตœ๋Œ€๊ฐ’, ์ตœ์†Ÿ๊ฐ’ ๋“ฑ๋“ฑ
  • Feature Extractor๊ฐ€ ํ•˜๋Š” ์ผ

    • ๊ธฐ์กด Feature์—์„œ ์ƒˆ๋กœ์šด Feature๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
    • TF-IDF, Word2Vec, ...
      -> ๋งŽ์€ ๊ฒฝ์šฐ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ค ํ˜•ํƒœ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•œ๋‹ค.

Feature Transformer - StringIndexer

  • ํ…์ŠคํŠธ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜
  • ์•„๋ž˜ ์™ผ์ชฝ๊ณผ ๊ฐ™์€ ๊ฐ’์„ ๊ฐ–๋Š” Color๋ผ๋Š” ์ด๋ฆ„์˜ Feature๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ์ด๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๊ฒƒ์ด Feature Trasformer์˜ ๋ชฉ์ ์ด๋‹ค.
๋ณ€ํ™˜ ์ „๋ณ€ํ™˜ ํ›„
Red1
Blue2
Orange3
White4
Black5
Gray6
Yellow7
  • Scikit-Learn์€ sklearn.preprocessing ๋ชจ๋“ˆ ์•„๋ž˜ ์—ฌ๋Ÿฌ ์ธ์ฝ”๋”๊ฐ€ ์กด์žฌํ•œ๋‹ค.
    -> OneHotEncoder, Labelencoder, OrdianlEncoder, ...

  • Spark MLlib์˜ ๊ฒฝ์šฐ pyspark.ml.feature ๋ชจ๋“ˆ ์•„๋ž˜ ๋‘ ๊ฐœ์˜ ์ธ์ฝ”๋”๊ฐ€ ์กด์žฌํ•œ๋‹ค.

    • StringIndexer, OneHotEncoder
    • ์‚ฌ์šฉ๋ฒ•์€ Indexer ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  (Fit), Indexer ๋ชจ๋ธ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ณ€ํ™˜(Transform)
from pyspark.ml.feature import StringIndexer

gender_indexer = StringIndexer(inputCol='Gender', outputCol='GenderIndexed')
gender_indexer_model = gender_indexer.fit(final_data)

final_data_with_transformed_gender_gender = gender_indexer_model.transform(final_data)

Feature Transformer - Scaler

  • ์ˆซ์ž ํ•„๋“œ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ํ‘œ์ค€ํ™”
  • ์ˆซ์ž ํ•„๋“œ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ํŠน์ • ๋ฒ”์œ„(0~1)๋กœ ๋ณ€ํ™˜ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • Feature Scaling & Normalization์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
๋ณ€ํ™˜ ์ „๋ณ€ํ™˜ ํ›„
-200
1001
400.5
250.375
150.125
  • Scikit-Learn์€ sklearn.preprocessing ๋ชจ๋“ˆ ์•„๋ž˜ ๋‘ ๊ฐœ์˜ ์Šค์ผ€์ผ๋Ÿฌ๊ฐ€ ์กด์žฌํ•œ๋‹ค.
    -> StandardScaler, MinMaxScaler

  • Spark MLlib์˜ ๊ฒฝ์šฐ pyspark.ml.feature ๋ชจ๋“ˆ ์•„๋ž˜ ๋‘ ๊ฐœ์˜ ์Šค์ผ€์ผ๋Ÿฌ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

    • StandardScaler, MinMaxScaler
    • ์‚ฌ์šฉ๋ฒ•์€ Scaler ๋ชจ๋ธ์„ ๋งŒ๋“  ํ›„ (fit), Scaler ๋ชจ๋ธ๋กœ DataFrame์„ ๋ณ€ํ™˜(Transform)
  • StandardScler
    -> ๊ฐ ๊ฐ’์—์„œ ํ‰๊ท ์„ ๋นผ๊ณ  ์ด๋ฅผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆˆ๋‹ค. ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๊ฒฝ์šฐ์— ์‚ฌ์šฉํ•œ๋‹ค.

  • MinMaxScaler
    -> ๋ชจ๋“  ๊ฐ’์„ 0๊ณผ 1์‚ฌ์ด๋กœ ์Šค์ผ€์ผํ•œ๋‹ค. ๊ฐ ๊ฐ’์—์„œ ์ตœ์†Œ๊ฐ’์„ ๋นผ๊ณ (์ตœ๋Œ€๊ฐ’-์ตœ์†Œ๊ฐ’)์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค.

Feature Transformer - Imputer

  • ๊ฐ’์ด ์—†๋Š” ํ•„๋“œ ์ฑ„์šฐ๊ธฐ
  • ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋ ˆ์ฝ”๋“œ๋“ค์ด ์กด์žฌํ•˜๋Š” ํ•„๋“œ๋“ค์˜ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์„ ์ •ํ•ด์„œ ์ฑ„์šฐ๋Š” ๊ฒƒ์ด๋‹ค. -> Imputeํ•œ๋‹ค. ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
๋ณ€ํ™˜ ์ „๋ณ€ํ™˜ ํ›„
1010
25
2020
3030
4040
  • Scikit-Learn์€ sklearn.preprocessing ๋ชจ๋“ˆ ์•„๋ž˜ ์กด์žฌํ•œ๋‹ค. -> Imputer

  • Spark MLlib์˜ ๊ฒฝ์šฐ pyspark.ml.feature ๋ชจ๋“ˆ ์•„๋ž˜์— ์กด์žฌํ•œ๋‹ค.

    • Imputer
    • ์‚ฌ์šฉ๋ฒ•์€ Imputer ๋ชจ๋ธ์„ ๋งŒ๋“  ํ›„(fit), Imputer ๋ชจ๋ธ๋กœ DataFrame์„ ๋ณ€ํ™˜(Transform)
from pyspark.ml.feature import Imputer

imputer = Imputer(strategy='mean', imputCols=['Age'], outputCols=['Agelmputed'])
imputer_model = imputer.fit(final_data)
final_data_age_transformed = imputer_model.transform(final_data)

0๊ฐœ์˜ ๋Œ“๊ธ€