Introducing Scikit-Learn

Ji Kim·2021년 1월 5일
0

Machine Learning

목록 보기
1/15

About Scikit-Learn

Scikit-Learn is one of the most-used open-source machine learning library for Python. Scikit-Learn provides various unsupervised and supervised learning algorithms which many data-scientists rely on.

Install Scikit-Learn

conda install scikit-learn
pip install scikit-learn
import sklearn

print(sklearn.__version__) 

Output

0.21.3

Predict Types of Irises

We will try to classify types of irises based on the imported feature dataset (i.e - sepal length, sepal width, petal length, petal width).

Classification

supervised-learning problem where a class label is predicted for a given exmaple of input data (i.e - classify COVID-19, classify spam mails)

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split 
import pandas as pd

# load iris dataset
iris = load_iris()

# iris.data contains feature-data in a numpy format
iris_data = iris.data

# iris.target contains label-data in a numpy format
iris_label = iris.target
print('Iris Target Values : \n', iris_label)
print('Iris Target Names : \n', iris.target_names)

# convert data-set to DataFrame
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names)
iris_df['label'] = iris.target
iris_df.head(3)

Output

Iris Target Values : 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
Iris Target Names : 
 ['setosa' 'versicolor' 'virginica']

Split to Train & Test Data

Train and test data must be splitted in order to evaluate the performance of the trained model. Scikit-Learn provies train_test_split() API to easily split dataset.

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size=0.2, random_state=11)
# craete Decision Tree Classifier object
dt_clf = DecisionTreeClassifier(random_state=11)

# perform train 

# fit() calls train feature data set & train label data set 
dt_clf.fit(X_train, y_train)

Now, DecisionTreeClassifier has completed its training on data based on train data-set. Prediction must use another dataset (test data-set) by calling predict().

# perform prediction on dt_clf using test data-set

pred = dt_clf.predict(X_test)

Now import accuracy_score to evaluate the performance of the model

from sklearn.metrics import accuracy_score

print('Accuracy Score : {0:4f}'.format(accuracy_score(y_test, pred)))

Output

Accuracy Score : 0.933333

The trained algorithm of decision tree classifer is measured to have 93.33% of accuracy.

To Summarize

  1. Split Data-set : split data to train and test data-set
  2. Train Model : train the model by applying ML-algorithm based on the train data-set
  3. Perform Prediction : Predict classification based on the trained ML-model
  4. Evaluation : Evaluate the accuracy of the prediction by comparing results to label test data
profile
if this then that

0개의 댓글