On previous post, I had emphasized the importance of splitting data into train and test data-sets. On this post, let us see what happens to the estimation without splitting the data.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
iris = load_iris()
dt_clf = DecisionTreeClassifier()
train_data = iris.data
train_label = iris.target
dt_clf.fit(train_data, train_label)
pred = dt_clf.predict(train_data)
print('Accuracy Score : ', accuracy_score(train_label, pred))
Output
Accuracy Score : 1.0
The reason that the model has returned 100% accuracy is because the model performed prediction based on the train data-set that the model has already trained through.
In other words, it is simply giving an exam which is identical to the given practice problem sets.
Hence, we must split the data using train_test_split() API to accurately perform the prediction.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
dt_clf = DecisionTreeClassifier()
iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.3, random_state=121)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
print('Accuracy : {0:4f}'.format(accuracy_score(y_test, pred)))
Output
Accuracy : 0.955556