Suppose you have trained a few classifiers, each one achieving about 80% accuracy.
A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.
Similarly, suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time (barely better than random guessing).
However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data.
Ensemble methods work best when the predictors are as
independent from one another as possible.
You can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.
Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set.
When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.
Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.
With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all.
Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set.
The BaggingClassifier class supports sampling the features as well. Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.
The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.
Another great quality of Random Forests is that they make it easy to measure the relative importance of each feature.
Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1.
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner.
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted.
Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.
Stacking (short for stacked generalization)
The following regression task: Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).
To train the blender, a common approach is to use a hold-out set.
It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression), to get a whole layer of blenders.