Scikit-Learn is one of the most-used open-source machine learning library for Python. Scikit-Learn provides various unsupervised and supervised learning algorithms which many data-scientists rely on.
conda install scikit-learn
pip install scikit-learn
import sklearn
print(sklearn.__version__)
Output
0.21.3
We will try to classify types of irises based on the imported feature dataset (i.e - sepal length, sepal width, petal length, petal width).
Classification
supervised-learning problem where a class label is predicted for a given exmaple of input data (i.e - classify COVID-19, classify spam mails)
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
# load iris dataset
iris = load_iris()
# iris.data contains feature-data in a numpy format
iris_data = iris.data
# iris.target contains label-data in a numpy format
iris_label = iris.target
print('Iris Target Values : \n', iris_label)
print('Iris Target Names : \n', iris.target_names)
# convert data-set to DataFrame
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names)
iris_df['label'] = iris.target
iris_df.head(3)
Output
Iris Target Values :
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Iris Target Names :
['setosa' 'versicolor' 'virginica']
Train and test data must be splitted in order to evaluate the performance of the trained model. Scikit-Learn provies train_test_split() API to easily split dataset.
X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size=0.2, random_state=11)
# craete Decision Tree Classifier object
dt_clf = DecisionTreeClassifier(random_state=11)
# perform train
# fit() calls train feature data set & train label data set
dt_clf.fit(X_train, y_train)
Now, DecisionTreeClassifier has completed its training on data based on train data-set. Prediction must use another dataset (test data-set) by calling predict().
# perform prediction on dt_clf using test data-set
pred = dt_clf.predict(X_test)
Now import accuracy_score to evaluate the performance of the model
from sklearn.metrics import accuracy_score
print('Accuracy Score : {0:4f}'.format(accuracy_score(y_test, pred)))
Output
Accuracy Score : 0.933333
The trained algorithm of decision tree classifer is measured to have 93.33% of accuracy.
- Split Data-set : split data to train and test data-set
- Train Model : train the model by applying ML-algorithm based on the train data-set
- Perform Prediction : Predict classification based on the trained ML-model
- Evaluation : Evaluate the accuracy of the prediction by comparing results to label test data