๐ก Cross Validation
A technique used to evaluate the performance and generalizability of a machine learning model.
- Training Set : The subset of the data used to train the model.
- Validation Set : The subset of the data used to evaluate the model's performance.
- Test Set : An independent subset of data used for the final evaluation of the model after cross-validation.
K-Fold Cross Validation
- The dataset is divided into k eqaul-sized folds.
- The model is traind k times, each time using kโ1 folds for training and the remaining fold for validation.
- The performance metric is averaged across all k trials.
- Example : if k=5, the data is split into 5 folds. The model is trained 5 times, each time with a different fold as the validation set.
Stratified K-Fold
- Similar to K-Fold Cross-Validation, but ensures each fold has a proportional representation of different classes.
- This method is particularly useful for imbalanced datasets.
- Example : In a binary classification probelm with a 70-30 class split, each fold will maintain this ratio.
Leave-One-Out Cross-Validation (LOOCV)
- Each data point is used once as the validation set while the remaining data points form the training set.
- The process is repeated for each data point.
- This method can be computationally expensive for large datasets.
- Example : If the dataset has 100 points, the model is trained 100 items, each time with one point as the validation set.
Time Series Cross-Validation
- Used for time series data where the order of data points is important.
- The training set always includes observations from the past, and the validation set includes observations from the future.
- Example : For time series data split into 5 folds, the first fold might use data from the first 3 months for training and the next month for validation.
Example Code for K-Fold Cross-Validation
- Use 'Iris' dataset and a 'Logistic Regression'
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
model = LogisticRegression(max_iter=200)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
๐ก cross_val_score
A utility for evaluating the performance of a model using cross-validation.
- Automates Cross-Validation : Handles the splitting of data into folds, training, and evaluating the model for each fold.
- Flexible Scroing : Allows the use of various scoring metrics to evalute the performance of the model.
- Supports Different Cross-Validation Strategies : Works with different cross-validation strategies such as K-Fold, stratified K-Fold, Leave-One-Out, and more.
Function Signature
cross_val_score(estimator, X, y=None, *, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
Parameters
- estimator : The model to be evaluated, e.g.,
LogisticRegression(), DecisionTreeClassifier(), etc.
- x : Features data (array-like or matrix).
- y : Target variable (array-like), default is
None for unsupervised learning tasks.
- scoring : A string or callable that defines the metric to evaluate the model, e.g.,
'accuracy', 'precision', 'recall', etc. Default is None, which uses the estimator's default scorer.
- cv : Cross-validation splitting strategy. This can be an integer (number of folds for K-Fold), a cross-validation object (like
KFold or StratifiedKFold), or a generator.
- n_jobs : Number of CPU cores to use for parallel processing. โ1 means using all available cores.
- verbose : Controls the verbosity of the output.
- fit_params : Parameters to pass to the fit method of the estimator.
- pre_dispatch : Controls the number of jobs dispatched during parallel processing.
- error_score : Value to assign to the score if an error occurs in an estimatorโs fit method.
Example Usage
Simple using the iris dataset and a LogisticRegression model.
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
model = LogisticRegression(max_iter=200)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
๐ก GridSearch
A technique used in machine learning to systematically explore a predefined space of hyperparameters for a model, in order to find the optimal combination that results in the best performance.
Key Concepts
- Hyperparameters : These are parameters that are not learned from the data but set prior to the training process. Examples include the learning rate for training neural networks, the depth of a decision tree, or the number of clusters in K-means.
- Parameter Grid : A predefined set of hyperparameter values to be explored. For example, a parameter grid might include different values for the number of trees in a random forest and different maximum depths.
- Exhaustive Search : Grid search evaluates every possible combination of hyperparameters in the parameter grid, ensuring that the
Using GridSearchCV
A tool in the scikit-learn library that helps in tuning hyperparameters of an estimator (model) to find the best possible combination for the highest performance.
- Parameters
- estimator : The model(estimator) for which hyperparameters need to be tuned, e.g.,
LogisticRegression(), RandomForestClassifier().
- param_grid : Dictionary or list of dictionaries with hyperparameters to try, e.g.,
{'c': 0.1, 1, 10], 'solver': ['liblinear', 'saga']}.
- scoring : A string or callable to evaluate the performance, e.g.,
'accuracy', 'f1', make_scorer().
- n_jobs : Number fo CPU cores to use for parallel processing. โ1 means using all available cores.
- cv : Cross-validation splitting strategy, like K-Fold. This can be an integer (number of folds) or a cross-validation object.
- verbose : Controls the verbosity of the output.
- refit : Whether to refit the model with the best parameters on the entire dataset after search.
- error_score : Value to assign to the score if an error occurs in an estimator's
fit method.
- return_train_score : If
True, the training scorew will be returned along with the validation scores.
- Example Usage
An example using GridSearchCV to tune a RandomForestClassifier on the Iris dataset.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, KFold
iris = load_iris()
X = iris.data
y = iris.target
model = RandomForestClassifier()
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
kf = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kf, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X, y)
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)