[ML] sklearn.pipeline

๊ฐ•๋™์—ฐยท2022๋…„ 1์›” 7์ผ

[ML]

๋ชฉ๋ก ๋ณด๊ธฐ
2/2

Pipeline in sklearn

๐Ÿ‘€ ์˜ค๋Š˜์€ sklearn์˜ Pipeline ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•ด๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹จ์ˆœ ๋ชจ๋ธ ํ•™์Šต๋งŒ์œผ๋กœ๋Š” ์ตœ์ ์˜ ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” ์ •๊ทœํ™”, ๊ต์ฐจ๊ฒ€์ฆ ๋“ฑ ๋ฐฉ๋ฒ•์„ ๋ชจ๋ธ์— ์ ์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Pipeline๋Š” ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ํ•˜๋‚˜์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ๊ฐ€๋Šฅ์ผ€ ํ•ด์ค๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ ์‚ฌ์ „ ์ฒ˜๋ฆฌ ๋ฐ ๋ถ„๋ฅ˜์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ํฌํ•จํ•˜๋Š” ๋‹จ์ผ ๊ฐœ์ฒด๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜์˜ ๋งํฌ์—์„œ Pipeline์— ๋Œ€ํ•œ ์„ค๋ช…์„ ๋“ค์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Liz Sander - Software Library APIs: Lessons Learned from scikit-learn - PyCon 2018

Pipeline Example

โœ” ๊ธฐ๋ณธ์ ์ธ Pipeline ์˜ˆ์ œ ์ž…๋‹ˆ๋‹ค. Pipeline() ๊ฐ์ฒด์— Scaler์™€ ๋ชจ๋ธ์„ ๋‹จ์ผ ๊ฐ์ฒด๋กœ์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ๋ณด์‹œ๋ฉด ์ฝ”๋“œ์˜ ๊ฐ€๋…์„ฑ ๋†’์•„์ง€๊ณ , ๊ฐ„๊ฒฐํ•ด์ง‘๋‹ˆ๋‹ค. ๋˜ํ•œ ์žฌํ˜„์„ฑ์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# add your data here
data = load_iris()
X_train,X_test,y_train, y_test = train_test_split(data.data, data.target,
                                  test_size = 0.2, random_state = 42)

# it takes a list of tuples as parameter
pipeline = Pipeline([
    ('scaler',StandardScaler()), # ํ‘œ์ค€ํ™”
    ('clf', LogisticRegression(random_state = 42)) # 
])

# use the pipeline object as you would
# a regular classifier
pipeline.fit(X_train,y_train)
  

Text Classification/NLP with pipeline

  
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)


X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

vect = CountVectorizer() # ๊ฐ ํ…์ŠคํŠธ์—์„œ ๋‹จ์–ด ์ถœํ˜„ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŒ…ํ•œ ๋ฒกํ„ฐ
 
tfidf = TfidfTransformer() # TF-IDF๋ผ๋Š” ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ CountVectorizer์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•จ

# this is a linear SVM classifier
clf = LinearSVC()

pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf),
    ('clf',clf)
])



# call fit as you would on any classifier
pipeline.fit(X_train,y_train)

# predict test instances
y_preds = pipeline.predict(X_test)

# calculate f1
mean_f1 = f1_score(y_test, y_preds, average='micro')

print(classification_report(y_preds,y_test))
print("mean_f1 :", mean_f1)
  
>>>
precision    recall  f1-score   support

           0       0.95      0.99      0.97       307
           1       0.99      0.96      0.98       406

    accuracy                           0.97       713
   macro avg       0.97      0.98      0.97       713
weighted avg       0.98      0.97      0.97       713

mean_f1 : 0.9747545582047685
  

Pipeline with Cross-Validation (GridSearchCV)

โœ” ์•„๋ž˜ ์ฝ”๋“œ์™€ ๊ฐ™์ด ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ, GridSearchCV ๋ฐฉ๋ฒ•๋“ค ๋˜ํ•œ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ Pipeline์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์€ ๊ฐ€์‹œ์„ฑ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

  
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)

X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

pipeline = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',LinearSVC())
])

# this is where you define the values for
# GridSearchCV to iterate over

# l1 penalty is incompatible with other configs
param_grid = [
    {
        'vect__max_df':[0.8,0.9,1.0],
        'clf__penalty':['l2'],
        'clf__dual':[True,False]
    },
    {
        'vect__max_df':[0.8,0.9,1.0],
        'clf__penalty':['l1'],
        'clf__dual': [False]
    }
]

# do 3-fold cross validation for each of the 6 possible
# combinations of the parameter values above
grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid,scoring='f1_micro')
grid.fit(X_train,y_train)

# summarize results
print("Best: %f using %s" % (grid.best_score_, 
    grid.best_params_)) 

# print("result : \n",  grid.cv_results_ )

means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
    
# now train and predict test instances
# using the best configs
pipeline.set_params(clf__penalty='l2',vect__max_df=0.9,clf__dual=True)
pipeline.fit(X_train,y_train)
y_preds = pipeline.predict(X_test)

# calculate f1
print(classification_report(y_preds,y_test))
f1_score(y_test, y_preds, average='micro')
>>>
Best: 0.992546 using {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.8}
0.992546 (0.001309) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': True, 'clf__penalty': 'l2', 'vect__max_df': 1.0}
0.990681 (0.001311) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 0.8}
0.992546 (0.001309) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 0.9}
0.990681 (0.001311) with: {'clf__dual': False, 'clf__penalty': 'l2', 'vect__max_df': 1.0}
0.971112 (0.003459) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 0.8}
0.972051 (0.008202) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 0.9}
0.971115 (0.004719) with: {'clf__dual': False, 'clf__penalty': 'l1', 'vect__max_df': 1.0}
              precision    recall  f1-score   support

           0       0.96      0.99      0.97       308
           1       0.99      0.97      0.98       405

    accuracy                           0.98       713
   macro avg       0.97      0.98      0.98       713
weighted avg       0.98      0.98      0.98       713

0.9761570827489481
  

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€