Machine Learning - Decision Tree

ํ™”์ดํ‹ฐ ยท2023๋…„ 12์›” 18์ผ
0

Machine Learning

๋ชฉ๋ก ๋ณด๊ธฐ
7/23
1. ๋ฌธ์ œ์ •์˜
- ๋จธ์‹ ๋Ÿฌ๋‹์„ ์‚ฌ์šฉํ•ด์„œ ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ• ๊ฒƒ์ธ์ง€
- ์‹์šฉ๋ฒ„์„ฏ๊ณผ ๋…๋ฒ„์„ฏ์„ ๊ตฌ๋ถ„ํ•˜์ž

2. ๋ฐ์ดํ„ฐ์ˆ˜์ง‘
- ์ •์˜๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
- ์‚ฌ์ดํŠธ์—์„œ ๋‹ค์šด๋กœ๋“œ๋ฐ›๊ธฐ, ํฌ๋กค๋งํ•ด์„œ ์ฐพ๊ธฐ, DB์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ

3. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
- ๋ฐ์ดํ„ฐ ํฌ๊ธฐํ™•์ธ
- ๊ฒฐ์ธก์น˜ ํ™•์ธ
- ๋ฌธ์ œ์™€ ์ •๋‹ต์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ
- ํ†ต๊ณ„์น˜ ํ™•์ธํ•˜๊ธฐ
- ๊ฐ’์„ ์ˆซ์ž๋กœ ๋ณ€๊ฒฝ
 > ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ, ์›ํ•ซ์ธ์ฝ”๋”ฉ(๋งŽ์ด ์‚ฌ์šฉ)

4. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(์ƒ๋žต ๊ฐ€๋Šฅ)
- ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž์„ธํ•˜๊ฒŒ ๋ฐ”๋ผ๋ณด์ž 
- ํ†ต๊ณ„๊ธฐ๋ฒ• ์‚ฌ์šฉํ•˜๊ธฐ
- ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆฌ๊ธฐ

5. ๋ชจ๋ธ ์„ ํƒ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
- ๋ชจ๋ธ ์„ ํƒ : ๋ชฉ์ ๊ณผ ๋ฐ์ดํ„ฐ์— ๋งž๋Š” ๋ชจ๋ธ ๊ณ ๋ฅด๊ธฐ
- ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ : ๋ชจ๋ธ ์ ํ•ฉํ•˜๊ฒŒ ์ˆ˜์ •ํ•˜๊ธฐ
- train_test_split > ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์— ์ง„ํ–‰ > 5๋‹จ๊ณ„๋Š” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ์‹œ์ 
- DecisionTree ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

6. ํ•™์Šต
- 5๋‹จ๊ณ„์—์„œ ๋งŒ๋“  ๋ชจ๋ธ์— ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ๋œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜๊ธฐ

7. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
- ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
- ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธกํ•˜๊ธฐ

from character โ†’ to number

โ†’ separate into embarked โ†’ then just one-hot encoding 0 or 1

but weakness is need to use many columns

  1. Import library ๐Ÿ**
# ํ•„์š”ํ•œ library ๋ถˆ์–ด์˜ค๊ธฐ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #ํ›ˆ๋ จ๊ณผ ํ…Œ์ŠคํŠธ์šฉ ์…‹ํŠธ ๋ถ„๋ฆฌ
from sklearn.metrics import accuracy_score #ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ ์ •ํ™•๋„ ์ธก์ •
from sklearn.tree import DecisionTreeClassifier #๊ฒฐ์ •ํŠธ๋ฆฌ๋ชจ๋ธ ๊ฐ€์ ธ์˜ค๊ธฐ
  1. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๐ŸŠ
  • ์ •์˜๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
  • ์‚ฌ์ดํŠธ์—์„œ ๋‹ค์šด๋กœ๋“œ๋ฐ›๊ธฐ, ์ฟจ๋กค๋งํ•ด์„œ ์ฐพ๊ธฐ, db์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ
#1. data๋ฅผ ๋กœ๋“œํ•˜๊ธฐ
data = pd.read_csv('./data/mushroom.csv')
data
  1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
  • ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด๋ณด๊ณ  ์ด์ƒ์น˜, ๊ฒฝ์ธก์น˜ ํŒŒ์•…ํ›„ ์ˆ˜์ •ํ•˜๊ธฐ
  • ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๋ชจ๋ธ๋งํ•˜๊ธฐ ์ข‹์€ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
# dataํฌ๊ธฐ ํ™•์ธ
data.shape
#๊ฒฐ์ธก์น˜ ์—ฌ๋ถ€, ํƒ€์ž… ํ™•์ธ
#๊ฒฐ์ธก์น˜ > ์‚ญ์ œ or ์ฑ„์šฐ๊ธฐ
# ํƒ€์ž… > ์ˆซ์žํ˜•์„๋กœ ๋ณ€ํ™˜ (๊ธ€์ž๋Š” ์•ˆ๋œ๋‹ค)
data.info()
#๋ฌธ์ž์™€ ๋‹ต๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฆฌ
y = data['poisonous'] #๋‹ต์•ˆ
x = data.iloc[0:,1:] #๋ฌธ์ œ
print(x)
print(y)
print(x.shape)
print(y.shape)
#ํ†ต๊ณ„์น˜ํ™•์ธํ•˜๊ธฐ
#ํ‰๊ท , ๋ถ„์‚ฐ, 4๋ถ„์œ„์ˆ˜ > ์ˆซ์žํ˜• ๊ฐ’์—์„œ ํ™•์ธ ๊ฐ€๋Šฅ
data.describe()
# ๋‹ต ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜ ๊ตฌํ•˜๊ธฐ
# e(edible):์‹์šฉ๋ฒ„์„ฏ
#p(poison):๋…๋ฒ„์„ฏ
y.value_counts()

๊ฐ’์„ ์ˆซ์ž๋กœ ๋ณ€๊ฒฝ Label encoding ๐Ÿ“๐Ÿ…

  • ๋‹จ์ˆœ ์ˆ˜์น˜ ๊ฐ’์œผ๋กœ mappingํ•˜๋Š” ์ž‘์—…
  • ์ˆซ์ž ๊ฐ’์˜ ํฌ๊ณ  ์ž‘์Œ์— ๋Œ€ํ•œ ํŠน์„ฑ์œผ๋กœ ์ธ๋ž˜ ์˜ˆ์ธก์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ
x['habitat'].unique()
#๋ฐ์ดํ„ฐ๋ฅผ ์—ฐ๊ฒฝํ•  dictionary์ƒ์„ฑ
habitat_dic = {
    'u':0,
    'g':1,
    'm':2,
    'd':3,
    'p':4,
    'w':5,
    'l':6
}
x['habitat']=x['habitat'].map(habitat_dic)
x['habitat']

์›ํ•ซ์ธ์ฝ”๋”ฉ One-hot-encoding ๐Ÿฅ๐Ÿ‡

  • ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฐ€์žฅ ์ž๋ณธ์ ์ธ ํ‘œํ˜„๋ฐฉ๋ฒ•
  • ํŠน์„ค์„ ์„ธ๋ถ€์ž‘์œผ๋กœ ๋‚˜๋ˆ ์„œ ์ƒ์‚ฟํ•  ์ˆ˜ ์žˆ์Œ
  • ํ•„์š”ํ•œ ๊ณต๊ฐ„์ด ๊ณ„์† ๋Š˜์–ด๋‚˜ ์ž์žฅ๊ณต๊ฐ„ ์ธก๋ฉด์—์„œ๋Š” ๋น„ํšจ์œจ์ž‘์ธ ๋ฐฉ๋ฒ•
# get_dummies()๋ฉ”์†Œ๋“œ ํ™œ์šฉ
x_one_hot = pd.get_dummies(x)
x_one_hot.head()
print("์›๋ณธํŠน์ƒ", list(x.columns))
print("์›ํ•ซ์ธ์ฝ”๋”ฉ ์ดํ›„ ํŠน์„ฑ: ",list(x_one_hot.columns))
  1. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž์„ธํ•˜๊ฒŒ ๋ฐ”๋ผ๋ณด์ž
  • ํ†ต๊ณ„๊ธฐ๋ฒ• ์‚ฌ์šฉํ•˜๊ธฐ
  • ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆฌ๊ธฐ
  1. ๋ชจ๋ธ Modeling ๐Ÿฅ›๐Ÿง๐Ÿงƒ ์„ ํƒ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
  • ๋ชจ๋ธ ์„ ํƒ : ๋ชฉ์ ๊ณผ ๋ฐ์ดํ„ฐ์— ๋งž๋Š” ๋ชจ๋ธ ๊ณ ๋ฅด๊ธฐ
  • hyper parameter: ๋ชจ๋ธ ์ ํ•ฉํ•˜๊ฒŒ ์ˆ˜์ •ํ•˜๊ฒŒ
#ํ›ˆ๋ จ์šฉ ์„ธํŠธ์™€ ํ‰๊ฐ€ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ
#x_one_hot, y
#train_test_splitํ•จ์ˆ˜ ์‚ฌ์šฉ

x_train,x_test,y_train,y_test=train_test_split(x_one_hot,y,
                       test_size = 0.3)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
  • decision Tree ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
#decisionTree ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
tree = DecisionTreeClassifier(max_depth = 3)
  1. ํ•™์Šต
tree.fit(x_train,y_train)
  1. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€ (Predict and judge)
pre = tree.predict(x_test)
pre
accuracy_score(pre,y_test)
print("์˜ˆ์ธก ์ •ํ™•๋„: {0: .4f}".format(accuracy_score(pre,y_test)))

mushroom.csv

profile
์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•ฉ์‹œ๋‹ค! The best is yet to come! ๐Ÿ’œ

0๊ฐœ์˜ ๋Œ“๊ธ€