One of the most powerful ways of training models is to train multiple models and aggregate their predictions. This is the main concept of Ensemble Learning. While many flavours of Ensemble Learning exist, some of the most powerful algorithms and Boosting Algorithms. In my previous article, I broke down one of the most popular Boosting Algorithms; Adaptive Boosting. Today, I want to talk about its equally powerful twin; Gradient Boosting.
Boosting & Adaptive Boosting vs Gradient Boosting
Boosting refers to any Ensemble Method that can combine several weak learners(a predictor with poor accuracy) to make a strong learner(a predictor with high accuracy). The idea behind boosting is to train models sequentially, each trying to correct its predecessor.
An Overview Of Adaptive Boosting
In Adaptive Boosting, the main idea occurs with the model assigning a certain weight to each instance, and training a weak learner. Based on the predictor’s performance, it gets assigned its own separate weight based on a weighted error rate. The higher the accuracy of the predictor, the higher its weight, and the more “say” it will have on the final prediction.
Once the predictor has made predictions, AdaBoost looks at the misclassified instances, and boosts their instance weights. After normalising the instance weights so that they all equate to 1, a new predictor is trained and the process is repeated until a desirable output is reached, or a threshold is reached.
The final classification is done by taking a weighted vote. In other words, if we were predicting heart disease on a patient, and 60 stumps predicted 1 and 40 predicted 0, but the predictors in the 0 class had a higher cumulative weight(i.e the predictors had more “say”), then the final prediction would be 0.
In contrast to Adaptive Boosting, instead of sequentially boosting misclassified instance weights, Gradient Boosting actually make predictions on the predecessors residuals. Woah, hold it. What?
Ok, so let’s break down the model’s steps:
- The first thing Gradient Boosting does is that is starts of with a Dummy Estimator. Basically, it calculates the mean value of the target values and makes initial predictions. Using the predictions, it calculates the difference between the predicted value and the actual value. This is called the residuals.
- Next, instead of training a new estimator on the data to predict the target, it trains an estimator to predict the residuals of the first predictor. This predictor is usually a Decision Tree with certain limits, such as the maximum amount of leaf nodes allowed. If multiple instances’ residuals are in the same leaf node, it takes their average and uses that as the leaf node’s value.
- Next, to make predictions, for each instance, it adds the base estimator’s value onto the Decision Tree’s predicted residual value of the instance to make a new prediction. It then calculates the residuals again between the predicted and actual value.
- This process is repeated until a certain threshold is reached or the residual difference is very small.
- To make a prediction for an unseen instance, it gives the instance to each and very decision tree made, sums their predictions and adds the base estimator’s value.
An important hyperparameter to take note of here is the learning rate. This actually scales the contribution of each tree, so essentially increasing bias in exchange for a lower variance. So at step 3 and 4, the predicted value is actually multiplied by a learning rate to achieve better generalisation on unseen data.
A hands-on example of Gradient Boosting Regression with Python & Scikit-Learn
Some of the concepts might still be unfamiliar in your mind, so, in order to learn, one must apply! Let’s build a Gradient Boosting Regressor to predict house prices using the infamous Boston Housing Dataset. Without further ado, let’s get started!
import pandas as pd
import numpy as npfrom sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.datasets import load_bostonfrom sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
Ok, so we do some basic imports, along with our dataset, which is conveniently builtin to scikit-learn, and Kfold cross validation, for splitting our data into a train set and validation set. We also import the DecisionTreeRegressor as well as the GradientBoostingRegressor
df = pd.DataFrame(load_boston()['data'],columns=load_boston()['feature_names'])
df['y'] = load_boston()['target']
Here, we just convert our data into a DataFrame for convenience
X,y = df.drop('y',axis=1),df['y']kf = KFold(n_splits=5,random_state=42,shuffle=True)for train_index,val_index in kf.split(X):
X_train,X_val = X.iloc[train_index],X.iloc[val_index],
y_train,y_val = y.iloc[train_index],y.iloc[val_index]
Here, we initialise our features and our target, and use 5 Fold cross validation to split our dataset into a training set and a validation set.
Before I go ahead and implement scikit-learn’s GradientBoostingRegressor, I would like to make a custom one of my own, just to help illustrate the concepts I wrote about earlier.
First, we create our initial predictions to be just the average of the training label values and assign our learning rate to be 0.1:
base = [y_train.mean()] * len(y_train)
learning_rate = 0.1
Then, we calculate the residuals and get the MSE:
residuals_1 = base - y_train
Well, not so bad, considering we just predicted the mean value the whole time!
After that, we create our first tree and train it on the residuals. Again we will get the MSE of our predictions on the validation set:
predictions_dtree_1 = base + learning_rate * dtree_1.predict(X_train)
Ok, so a slight improvement, already showing the power of Gradient Boosting! Again, we get the residuals:
residuals_2 = y_train - predictions_dtree_1
dtree_2 = DecisionTreeRegressor(random_state=42)
And we get the MSE by making predictions, but note how we combined the predicted value of the first predictor with the new predictor’s values:
predictions_dtree_2 = ((dtree_2.predict(X_train) * learning_rate)) + predictions_dtree_1
Wow, that was a big leap indeed! Now, let’s keep training a few more predictors to see what we can achieve:
residuals_3 = y_train - predictions_dtree_2
dtree_3 = DecisionTreeRegressor(random_state=42)
dtree_3.fit(X_train,residuals_3)predictions_dtree_3 = (dtree_3.predict(X_train) * learning_rate) + predictions_dtree_2residuals_4 = y_train - predictions_dtree_3
dtree_4 = DecisionTreeRegressor(random_state=42)
dtree_4.fit(X_train,residuals_4)predictions_dtree_4 = (dtree_4.predict(X_train) * learning_rate) + predictions_dtree_3
So we definitely improved our score, but now it’s time for the ultimate test: the validation set!
To make a final prediction, we do the following:
initial prediction(the mean of the target values) * learning rate +
- predicted reisudal values from tree 1
- predicted reisudal values from tree 2
- predicted reisudal values from tree 3
- predicted reisudal values from tree 4
y_pred = base[:101] + learning_rate *
(dtree_2.predict(X_val) * learning_rate) +
(dtree_3.predict(X_val) * learning_rate) +
(dtree_4.predict(X_val) * learning_rate)
And the result (Drumroll please…):
Fantastic! Not only did it fit the training set well, it also generalised smoothly on the test set! Why? Because we used a learning rate to control each trees contribution size, making sure that the ensemble did not overfit the data.
Gradient Boosting with Scikit-Learn’s GradientBoostingRegressor
We have now manually made a basic gradient boosting algorithm, but now let us code out a Gradient Boosting Regressor using scikit-learn!
Using the same data as above:
gradient_booster = GradientBoostingRegressor(loss='ls',learning_rate=0.1)
Now, this model has a lot of parameters, so it is worth mentioning the most important ones:
learning_rate: exactly the same parameter as we have discussed about above; it scales the contribution of each tree
init: the initial estimator, which equates to the DummyEstimator by default(aka predicts the mean for everything)
max_depth: the maximum depth you want your trees to grow
n_estimators: the amount of trees you want to create
criterion: what loss function you would like to minimise for the decision trees to use when it is searching for the best feature and threshold that splits the data.
loss: the loss to use for calculating residuals(the default is “ls”, or least squares)
max_leaf_nodes: the maximum number of leaf nodes you want to have for each tree. If this number is smaller then the number of training instances, and if two or more instances are in the same leaf, then the leaf’s value will be the average of all the training instance values in that leaf.
Let’s fit our model to the dataset and get its R2 score:
And let’s get its R2 Score on the validation set:
Ok , so our algorithm is slightly overfitting. Let’s adjust the learning_rate parameter to see if we can get better results:
gradient_booster = GradientBoostingRegressor(loss='ls',learning_rate=0.25)gradient_booster.fit(X_train,y_train)gradient_booster.score(X_train,y_train)OUT:
This is a much better result, where we sacrificed some bias for a better variance. Finally, let’s get the ensemble’s MSE on the validation set:
predictions = gradient_booster.predict(X_val)
Ok, so that wraps up this one, folks! I hope you enjoyed it, but my Gradient Boosting days are not over yet! I will be back, and in the next articles I will be talking about Gradient Boosting for Classification, and more importantly; the big boy: XTREME GRADIENT BOOSTING! But, for now: