이전 시간을 통해 Categorical Decision Tree를 만들어보는 연습을 해보았다. 그렇다면 연속적인 변수를 기준으로 feature를 설정해 Classification Decision Tree를 만들 순 없을까?
아래의 순서를 통해 그 활용을 해볼 수 있다.
ID | Stream | Slope | Elevation | Vegetation |
---|---|---|---|---|
1 | false | steep | 3,900 | chapparal |
2 | True | moderate | 300 | riparian |
3 | True | steep | 1,500 | riparian |
4 | false | steep | 1,200 | chapparal |
5 | false | flat | 4,450 | conifer |
6 | True | steep | 5,000 | conifer |
7 | True | steep | 3,000 | chapparal |
ID | Stream | Slope | Elevation | Vegetation |
---|---|---|---|---|
2 | True | moderate | 300 | riparian |
4 | false | steep | 1,200 | chapparal |
3 | True | steep | 1,500 | riparian |
7 | True | steep | 3,000 | chapparal |
1 | false | steep | 3,900 | chapparal |
5 | false | flat | 4,450 | conifer |
6 | True | steep | 5,000 | conifer |
target level이 변하는 구간 파악하려면 어떻게 해야할까? feature인 Elevation을 기준으로 target class의 level이 나뉘는 기점을 살펴보면 된다. 위의 표를 기준으로 보자.
이를 통해 총 4개의 threshold가 생긴다는 걸 알 수 있다
#1단계: root node 구하기 -> 각 feauture로 분기했을 때, IG를 구해보기
ev_H = -((3/7)*np.log2(3/7)+(2/7)*np.log2(2/7)+(2/7)*np.log2(2/7))
print(ev_H)
ev_750_H_ = -((6/7)*(1/2*np.log2(1/2)+1/6*np.log2(1/6)+2/3*np.log2(2/3)))
print(ev_750_H_)
ev_750_IG = ev_H - ev_750_H_
print("ev_750_IG: ", ev_750_IG)
ev_1350_H_ = -((2/7*2*1/2*np.log2(1/2))+((5/7)*((1/5*np.log2(1/5))+(2*2/5*np.log2(2/5)))))
print(ev_1350_H_)
ev_1350_IG = ev_H - ev_1350_H_
print("ev_1350_IG: ", ev_1350_IG)
ev_2250_H_ = -(3/7*(2/3*np.log2(2/3)+1/3*np.log2(1/3))+4/7*2*1/2*np.log2(1/2))
print(ev_2250_H_)
ev_2250_IG = ev_H - ev_2250_H_
print("ev_2250_IG: ", ev_2250_IG)
ev_4175_h = -5/7*((2/5*np.log2(2/5)+3/5*np.log2(3/5)))
print(ev_4175_h)
ev_4175_IG = ev_H - ev_4175_h
print("ev_4175_IG: ", ev_4175_IG)
stream_h = -(4/7*((1/2*np.log2(1/2))+(1/4*2*np.log2(1/4)))+3/7*(2/3*np.log2(2/3)+(1/3)*np.log2(1/3)))
print(stream_h)
stream_IG = ev_H - stream_h
print("stream_IG: ", stream_IG)
slope_h = -(5/7*(3/5*np.log2(3/5))+1/5*np.log2(1/5)+1/5*np.log2(1/5))
print(slope_h)
slope_IG = ev_H - slope_h
print("slope_IG: ", slope_IG)
0.4245406355191399
0.18385092540042125
0.5916727785823274
0.863120568566631
0.30595849286804166
0.31204307200807513
⇒root node에서 사용할 feature는 IG가 0.8631로 가장 높은 elevation_4715를 설정해준다. 그 다음 분기의 IG를 다시 계산해보자.
H = -((3/5*np.log2(3/5))+(2/5*np.log2(2/5)))
print(H)
ev_750_h = -(4/5*(3/4*np.log2(3/4)+1/4*np.log2(1/4)))
print(ev_750_h)
ev_750_IG = H - ev_750_h
print("ev_750_IG: ", ev_750_IG)
ev_1350_h = -((2/5*2*1/2*np.log2(1/2))+(3/5*(1/3*np.log2(1/3)+2/3*np.log2(2/3))))
print(ev_1350_h)
ev_1350_IG = H - ev_1350_h
print("ev_1350_IG: ", ev_1350_IG)
ev_2250_h = -(3/5*(2/3*np.log2(2/3)+1/3*np.log2(1/3)))
ev_2250_IG = H - ev_2250_h
print("ev_2250_IG: ", ev_2250_IG)
stream_h = -(3/5*(2/3*np.log2(2/3)+1/3*np.log2(1/3)))
stream_IG = H - stream_h
print(stream_h)
print("stream_IG: ", stream_IG)
slope_h = -(4/5*(3/4*np.log2(3/4)+1/4*np.log2(1/4)))
slope_IG = H - slope_h
print(slope_h)
print("slope_IG: ", slope_IG)
그런데 계산을 해보면 elevation 2250의 IG와 stream의 IG값이 동일하다
두 feature 각각 기준으로 그 다음 분기할 feature들을 계산해주자
case1: Stream
H = -(1/3*np.log2(1/3)+2/3*np.log2(2/3))
ev_750_h = -(2/3*2*1/2*np.log2(1/2))
ev_750_IG = H - ev_750_h
print("ev_750_IG: ", ev_750_IG)
ev_1350_h = ev_750_h
ev_1350_IG = H - ev_1350_h
print("ev_1350_IG: ", ev_1350_IG)
ev_2250_h = -(2/3*np.log2(1)+1/3*np.log2(1))
ev_2250_IG = H - ev_2250_h
print(H)
print("ev_2250_IG: ", ev_2250_IG)
slope_h = -(2/3*2*1/2*np.log2(1/2))
slope_IG = H - slope_h
print("slope_IG: ", slope_IG)
첫번째 케이스는 Stream이고, 시각화를 위처럼 된다.
case2: ev_2250
H = -(1/3*np.log2(1/3)+2/3*np.log2(2/3))
ev_750_h = -(2/3*2*1/2*np.log2(1/2))
ev_750_IG = H - ev_750_h
print(ev_750_h)
print("ev_750_IG: ", ev_750_IG)
ev_1350_h = -(2/3*2*1/2*np.log2(1/2))
ev_1350_IG = H - ev_1350_h
print(ev_1350_h)
print("ev_1350_IG: ", ev_1350_IG)
stream_h = -((2/3*np.log2(1))+(1/3*np.log2(1)))
stream_IG = H - stream_h
print(stream_h)
print("stream_IG: ", stream_IG)
slope_h = -(2/3*2*1/2*np.log2(1/2))
print(slope_h)
slope_IG = H - slope_h
print("slope_IG: ", slope_IG)
따라서 최종 결과물은 위와 같음! 분기 나뉜 것 좀 봐 이쁘다.
target feature가 continuous한 경우는 Regression(회귀)다. 따라서 Regression Decision Tree를 그려야하며, Regression Decision Tree 에서는 분기 후 subset안에서 남은 값들의 평균이 leaf node의 대표값이 된다.
#️⃣dataset
#️⃣ Continuous target (연속형 숫자)인 경우 IG 구하는 방법
❓왜 분모에 1이?
#️⃣ Regression Decision Tree(continuous target features인 경우)
winter_mean = (800+826+900)/3
#print(winter_mean)
winter_var = (((winter_mean-800)**2)+((winter_mean-826)**2)+((winter_mean-900)**2))/(3-1)*1/4
print("winter_var: ", winter_var)
spring_mean = (2100+4740+4900)/3
#print(spring_mean)
spring_var = (((spring_mean-2100)**2)+((spring_mean-4740)**2)+((spring_mean-4900)**2))/(3-1)*1/4
print("spring_var: ", spring_var)
summer_mean = (3000+5800+6200)/3
summer_var = (((summer_mean-3000)**2)+((summer_mean-5800)**2)+((summer_mean-6200)**2))/(3-1)*1/4
print("summer_var: ", summer_var)
autumn_mean = (2910+2880+2820)/3
autumn_var = (((autumn_mean-2910)**2)+((autumn_mean-2880)**2)+((autumn_mean-2820)**2))/(3-1)*1/4
print("autumn_var: ", autumn_var)
season_var = winter_var + spring_var + summer_var + autumn_var
print("season_var: ", season_var)
true_mean = (900+4740+4900+5800+6200+2820)/6
true_var = (((true_mean-900)**2)+((true_mean-4740)**2)+((true_mean-4900)**2)+((true_mean-5800)**2)+((true_mean-6200))**2+((true_mean-2820)**2))/(6-1)*1/2
print("true_var: ", true_var)
false_mean = (800+826+2100+3000+2910+2880)/6
false_var = (((false_mean-800)**2)+((false_mean-826)**2)+((false_mean-2100)**2)+((false_mean-3000)**2)+((false_mean-2910)**2)+((false_mean-2880)**2)/(6-1))*1/2
print("false_var: ", false_var)
day_var = true_var + false_var
print(day_var)
#️⃣sklearn을 이용하여 iris data의 decision tree 그려보기
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
data, targets = iris.data, iris.target
print("data / target shape")
print(data.shape, targets.shape, '\n')
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
test_size=0.2, random_state=11)
# print(f"{type(X_train) = } / {X_train.shape = }")
# print(f"{type(X_test) = } / {X_test.shape = }")
# print(f"{type(y_train) = } / {y_train.shape = }")
# print(f"{type(y_test) = } / {y_test.shape = }")
model = DecisionTreeClassifier()
# for attr in dir(model):
# if not attr.startswith("__"):
# print(attr)
model.fit(X_train, y_train)
print("depth: ", model.get_depth())
print("number of leaves: ", model.get_n_leaves())
accuracy = model.score(X_test, y_test)
# print(f"{accuracy = :.4f}")
plt.figure(figsize=(20, 15))
tree.plot_tree(model,
class_names=iris.target_names,
feature_names=iris.featunames,
impurity=True, filled=True,
rounded=True)
plt.show()