Importance of Splitting Train & Test Set

Ji Kim·2021년 1월 5일

Machine Learning

목록 보기

3/15

On previous post, I had emphasized the importance of splitting data into train and test data-sets. On this post, let us see what happens to the estimation without splitting the data.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()

dt_clf = DecisionTreeClassifier()

train_data = iris.data
train_label = iris.target

dt_clf.fit(train_data, train_label)

pred = dt_clf.predict(train_data)
print('Accuracy Score : ', accuracy_score(train_label, pred))

Output
Accuracy Score : 1.0

The reason that the model has returned 100% accuracy is because the model performed prediction based on the train data-set that the model has already trained through.

In other words, it is simply giving an exam which is identical to the given practice problem sets.

Hence, we must split the data using train_test_split() API to accurately perform the prediction.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

dt_clf = DecisionTreeClassifier()
iris_data = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.3, random_state=121)

dt_clf.fit(X_train, y_train)

pred = dt_clf.predict(X_test)
print('Accuracy : {0:4f}'.format(accuracy_score(y_test, pred)))

Output

Accuracy : 0.955556

Ji Kim

if this then that

이전 포스트

Dissecting the Practice Dataset

다음 포스트

Importance of Splitting Train & Test Set

Machine Learning

Dissecting the Practice Dataset

Cross Validation in Scikit-Learn

0개의 댓글

관련 채용 정보