Sequential Forward Selection (SFS) is a heuristic algorithm used in machine learning for feature selection. It is designed to reduce the dimensionality of the input feature set to improve the efficiency and performance of machine learning models. SFS iteratively adds features to a model based on a specified criterion until adding new features no longer improves the model's performance or until a predetermined number of features is reached.
Feature selection is crucial in machine learning to eliminate irrelevant or redundant features, reduce computational complexity, and enhance model interpretability. Among various feature selection techniques, Sequential Forward Selection stands out for its simplicity and effectiveness in selecting a subset of relevant features that contribute most to the prediction accuracy of a model.
Sequential Forward Selection starts with an empty set of features and sequentially adds features to the set. At each step, it selects the feature that, when added to the set, offers the most significant improvement in model performance. The process is repeated until a specified number of features is selected or adding new features does not improve model performance.
Let represent the dataset with features . The objective is to find a subset that maximizes the performance metric , subject to , where is the desired number of features.
The selection of the feature to add at each iteration can be mathematically defined as:
where is the set of features selected up to the -th iteration, and is the performance metric of the model trained with the features in .
estimator
: Estimator
n_features
: float
int
, default = 1metric
: Evaluator
test_size
: float
, default = 0.2cv
: int
, default = 5shuffle
: bool
, default = Truestratify
: bool
, default = Falsefold_type
: FoldType
, default = KFold
random_state
: float
from luma.reduction.selection import SFS
from luma.classifier.discriminant import LDAClassifier
from luma.model_selection.fold import StratifiedKFold
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
X, y = load_wine(return_X_y=True)
model = SFS(estimator=LDAClassifier(),
n_features=3,
test_size=0.3,
cv=10,
shuffle=True,
stratify=True,
fold_type=StratifiedKFold,
random_state=42)
X_3d = model.fit_transform(X, y)
model.set_params(n_features=2)
X_2d = model.fit_transform(X, y)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)
ax1.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=y)
ax1.set_xlabel(r"$x_1$")
ax1.set_ylabel(r"$x_2$")
ax1.set_zlabel(r"$x_3$")
ax1.set_title(f"3 Features Selected via {type(model).__name__}")
ax2.scatter(X_2d[:, 0], X_2d[:, 1], c=y, alpha=0.8)
ax2.set_xlabel(r"$x_1$")
ax2.set_ylabel(r"$x_2$")
ax2.set_title(f"2 Features Selected via {type(model).__name__}")
ax2.grid(alpha=0.2)
plt.tight_layout()
plt.show()
Sequential Forward Selection is versatile and can be applied in various fields, including but not limited to:
- J. K. Pugh, L. H. Yang, and D. J. Montana, "A Fast Implementation of Sequential Forward Selection," Journal of Machine Learning Research, 2016.
- I. Guyon, A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, 2003.