Cross Validation

been_29·2024년 7월 30일

한국경제신문 with Toss bank MLOps 과정

목록 보기

10/26

💡 Cross Validation

A technique used to evaluate the performance and generalizability of a machine learning model.

Training Set : The subset of the data used to train the model.
Validation Set : The subset of the data used to evaluate the model's performance.
Test Set : An independent subset of data used for the final evaluation of the model after cross-validation.

K-Fold Cross Validation

The dataset is divided into $k$ eqaul-sized folds.
The model is traind $k$ times, each time using $k-1$ folds for training and the remaining fold for validation.
The performance metric is averaged across all $k$ trials.
Example : if $k=5$ , the data is split into 5 folds. The model is trained 5 times, each time with a different fold as the validation set.

Stratified K-Fold

Similar to K-Fold Cross-Validation, but ensures each fold has a proportional representation of different classes.
This method is particularly useful for imbalanced datasets.
Example : In a binary classification probelm with a 70-30 class split, each fold will maintain this ratio.

Leave-One-Out Cross-Validation (LOOCV)

Each data point is used once as the validation set while the remaining data points form the training set.
The process is repeated for each data point.
This method can be computationally expensive for large datasets.
Example : If the dataset has 100 points, the model is trained 100 items, each time with one point as the validation set.

Time Series Cross-Validation

Used for time series data where the order of data points is important.
The training set always includes observations from the past, and the validation set includes observations from the future.
Example : For time series data split into 5 folds, the first fold might use data from the first 3 months for training and the next month for validation.

Example Code for K-Fold Cross-Validation

Use 'Iris' dataset and a 'Logistic Regression'

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define the model
model = LogisticRegression(max_iter=200)

# Define the K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Print the results
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))

💡 cross_val_score

A utility for evaluating the performance of a model using cross-validation.

Automates Cross-Validation : Handles the splitting of data into folds, training, and evaluating the model for each fold.
Flexible Scroing : Allows the use of various scoring metrics to evalute the performance of the model.
Supports Different Cross-Validation Strategies : Works with different cross-validation strategies such as K-Fold, stratified K-Fold, Leave-One-Out, and more.

Function Signature

cross_val_score(estimator, X, y=None, *, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

Parameters

estimator : The model to be evaluated, e.g., LogisticRegression(), DecisionTreeClassifier(), etc.
x : Features data (array-like or matrix).
y : Target variable (array-like), default is None for unsupervised learning tasks.
scoring : A string or callable that defines the metric to evaluate the model, e.g., 'accuracy', 'precision', 'recall', etc. Default is None, which uses the estimator's default scorer.
cv : Cross-validation splitting strategy. This can be an integer (number of folds for K-Fold), a cross-validation object (like KFold or StratifiedKFold), or a generator.
n_jobs : Number of CPU cores to use for parallel processing. $-1$ means using all available cores.
verbose : Controls the verbosity of the output.
fit_params : Parameters to pass to the fit method of the estimator.
pre_dispatch : Controls the number of jobs dispatched during parallel processing.
error_score : Value to assign to the score if an error occurs in an estimator’s fit method.

Example Usage

Simple using the iris dataset and a LogisticRegression model.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define the model
model = LogisticRegression(max_iter=200)

# Define the cross-validation strategy
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

# Print the results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))

💡 GridSearch

A technique used in machine learning to systematically explore a predefined space of hyperparameters for a model, in order to find the optimal combination that results in the best performance.

Key Concepts

Hyperparameters : These are parameters that are not learned from the data but set prior to the training process. Examples include the learning rate for training neural networks, the depth of a decision tree, or the number of clusters in K-means.
Parameter Grid : A predefined set of hyperparameter values to be explored. For example, a parameter grid might include different values for the number of trees in a random forest and different maximum depths.
Exhaustive Search : Grid search evaluates every possible combination of hyperparameters in the parameter grid, ensuring that the

Using GridSearchCV

A tool in the scikit-learn library that helps in tuning hyperparameters of an estimator (model) to find the best possible combination for the highest performance.

Parameters
- estimator : The model(estimator) for which hyperparameters need to be tuned, e.g., LogisticRegression(), RandomForestClassifier().
- param_grid : Dictionary or list of dictionaries with hyperparameters to try, e.g., {'c': 0.1, 1, 10], 'solver': ['liblinear', 'saga']}.
- scoring : A string or callable to evaluate the performance, e.g., 'accuracy', 'f1', make_scorer().
- n_jobs : Number fo CPU cores to use for parallel processing. $-1$ means using all available cores.
- cv : Cross-validation splitting strategy, like K-Fold. This can be an integer (number of folds) or a cross-validation object.
- verbose : Controls the verbosity of the output.
- refit : Whether to refit the model with the best parameters on the entire dataset after search.
- error_score : Value to assign to the score if an error occurs in an estimator's fit method.
- return_train_score : If True, the training scorew will be returned along with the validation scores.
Example Usage
An example using GridSearchCV to tune a RandomForestClassifier on the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, KFold

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define the model
model = RandomForestClassifier()

# Define the parameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Define the cross-validation strategy
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kf, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X, y)

# Print the best parameters and the best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)