[ML] sklearn.pipeline

강동연·2022년 1월 7일

[ML]

목록 보기

2/2

Pipeline in sklearn

👀 오늘은 sklearn의 Pipeline 메서드에 대해 이야기 해볼 예정입니다. 일반적으로 단순 모델 학습만으로는 최적의 예측결과를 도출하기 어렵습니다. 그래서 우리는 정규화, 교차검증 등 방법을 모델에 적용해 학습을 진행하게 됩니다. Pipeline는 이러한 과정을 하나의 워크플로우로 가능케 해줍니다. 결론적으로 파이프라인을 사용하면 데이터 사전 처리 및 분류의 모든 단계를 포함하는 단일 개체를 만들 수 있습니다.

아래의 링크에서 Pipeline에 대한 설명을 들으실 수 있습니다.
Liz Sander - Software Library APIs: Lessons Learned from scikit-learn - PyCon 2018

Pipeline Example

✔ 기본적인 Pipeline 예제 입니다. Pipeline() 객체에 Scaler와 모델을 단일 객체로서 사용할 수 있습니다. 아래의 코드를 보시면 코드의 가독성 높아지고, 간결해집니다. 또한 재현성이 증가합니다.

  
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# add your data here
data = load_iris()
X_train,X_test,y_train, y_test = train_test_split(data.data, data.target,
                                  test_size = 0.2, random_state = 42)

# it takes a list of tuples as parameter
pipeline = Pipeline([
    ('scaler',StandardScaler()), # 표준화
    ('clf', LogisticRegression(random_state = 42)) # 
])

# use the pipeline object as you would
# a regular classifier
pipeline.fit(X_train,y_train)

Text Classification/NLP with pipeline

  
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)


X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

vect = CountVectorizer() # 각 텍스트에서 단어 출현 횟수를 카운팅한 벡터
 
tfidf = TfidfTransformer() # TF-IDF라는 값을 사용하여 CountVectorizer의 단점을 보완함

# this is a linear SVM classifier
clf = LinearSVC()

pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf),
    ('clf',clf)
])



# call fit as you would on any classifier
pipeline.fit(X_train,y_train)

# predict test instances
y_preds = pipeline.predict(X_test)

# calculate f1
mean_f1 = f1_score(y_test, y_preds, average='micro')

print(classification_report(y_preds,y_test))
print("mean_f1 :", mean_f1)
  
>>>
precision    recall  f1-score   support

           0       0.95      0.99      0.97       307
           1       0.99      0.96      0.98       406

    accuracy                           0.97       713
   macro avg       0.97      0.98      0.97       713
weighted avg       0.98      0.97      0.97       713

mean_f1 : 0.9747545582047685

Pipeline with Cross-Validation (GridSearchCV)

✔ 아래 코드와 같이 교차 검증과, GridSearchCV 방법들 또한 함께 사용할 수 있다. 개인적으로 Pipeline의 가장 큰 장점은 가시성이라고 생각한다.

  
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)

X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

pipeline = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',LinearSVC())
])

# this is where you define the values for
# GridSearchCV to iterate over

# l1 penalty is incompatible with other configs
param_grid = [
    {
        'vect__max_df':[0.8,0.9,1.0],
        'clf__penalty':['l2'],
        'clf__dual':[True,False]
    },
    {
        'vect__max_df':[0.8,0.9,1.0],
        'clf__penalty':['l1'],
        'clf__dual': [False]
    }
]

# do 3-fold cross validation for each of the 6 possible
# combinations of the parameter values above
grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid,scoring='f1_micro')
grid.fit(X_train,y_train)

# summarize results
print("Best: %f using %s" % (grid.best_score_, 
    grid.best_params_)) 

# print("result : \n",  grid.cv_results_ )

means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
    
# now train and predict test instances
# using the best configs
pipeline.set_params(clf__penalty='l2',vect__max_df=0.9,clf__dual=True)
pipeline.fit(X_train,y_train)
y_preds = pipeline.predict(X_test)

# calculate f1
print(classification_report(y_preds,y_test))
f1_score(y_test, y_preds, average='micro')
>>>
Best: 0.992546 using {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.8}
0.992546 (0.001309) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 1.0}
0.990681 (0.001311) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 0.8}
0.992546 (0.001309) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 1.0}
0.971112 (0.003459) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 0.8}
0.972051 (0.008202) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 0.9}
0.971115 (0.004719) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 1.0}
              precision    recall  f1-score   support

           0       0.96      0.99      0.97       308
           1       0.99      0.97      0.98       405

    accuracy                           0.98       713
   macro avg       0.97      0.98      0.98       713
weighted avg       0.98      0.98      0.98       713

0.9761570827489481

참고 자료
https://queirozf.com/entries/scikit-learn-pipeline-examples

강동연

Maybe I will be an AI Engineer?

이전 포스트