Adaptive Boosting, Simply Explained through Python

Vagif Aliyev
7 min readSep 3, 2020

Introduction to Ensemble Learning

Adaptive Boosting, or AdaBoost(as it’s know in Machine Learning Lingo) is a specific type of Ensemble Learning Technique. What is that? An Ensemble Learning Technique basically refers to any model that aggregates the predictions of multiple predictors to come up with better predictions. This includes the likes of:

  • Random Forests: An ensemble of Decision Trees
  • Stacking: an group of multiple models, where the predictions are aggregated and a model is used to blend the predictions of all the predictors)
  • Voting: similar to stacking, but instead of using a model, it simply aggregates the predictions together, and uses the class or class probability(the use can pick that), that gets the most votes)
  • Bagging: This works by training multiple predictors on different subsets of the dataset. The difference is that for Bagging, the same instance can be used several times for the same predictor. The advantage of this is that not all the data will be used, so you can evaluate your model using an Out-Of-Bag Score, which acts as a validation set. If you have not noticed, Random Forest is an ensemble of bagged decision trees.
  • Pasting: same as Bagging, except for the fact that each predictor cannot sample the same instance several times for the same predictor, Therefore there exists no Out-Of-Bag Score.

This is a high overview some of the ensemble techniques. If you want a deeper understanding, please refer to this link

Boosting

AdaBoost is actually part of a specific branch of Ensemble Learning called Boosting. The main idea behind boosting is to combine the predictions of several basic models with low accuracy, or weak learners, to eventually build a model with high accuracy, or a strong learner.

In AdaBoost, the main concept is that each predictor is trained to correct its predecessor’s mistakes. This results in a predictor that focuses more and more and the mistakes. This is the idea behind Adaboost.

Now that you have a somewhat understanding of ensemble learning and boosting, lets discuss the process of how AdaBoost Works.

Getting Technical with Adaptive Boosting

Some things to note about Adaptive Boosting:

  1. Adaptive Boosting predictors do not all have the same “say” on the final prediction. The amount of “say”, or in other words, the predictor’s weight, is determined by how accurate the predictor was. A higher accuracy results in a higher predictor weight.
  2. Adaptive Boosting works sequentially, so the order of the predictors matters.
  3. Adaptive Boosting combines multiple weak learners together to make predictions. These weak learners are most often stumps, which are decision trees that have a maximum depth of 1.

Ok, so bearing that in mind, let’s uncover the algorithm:

  1. First, AdaBoost gives each instance in a training set a certain weight(this is different from the predictor’s weight). To begin with, each instance is given an equal weight. More specifically, the base weight is usually always 1/m (m being the number of training instances).
  2. After the instance weights have been added, AdaBoost trains the first weak learner on the data. From the predictor’s performance, it calculates the predictor’s weighted error rate, or the amount of errors it made. The formula for that is the following(point 1):

3. After that, it calculates the predictor’s weight, using the error rate calculated before(point 2), with ln being the learning rate hyperparameter. The higher its weight, the better it is, and the more its say in the final prediction. If the predictor is close to randomly guessing, then its weight will be close to 0. If it is worse that random guessing, then its weight will be negative.

4. Next, the instance weights are updated(using the formula on point 3). The instances that were incorrect are boosted, and the other weights are decreased, and then normalised to that they cumulatively equate to 1.

5. A new predictor is trained on the new data with the updated weights, and the process is repeated. The algorithm stops when the desired number of predictors have run, or the perfect predictor is found.

6. To make predictions, AdaBoost computer the predictions of all the weighted predictor and calculates which class got the most weighted votes.

To clarify the last point, lets say that AdaBoost has to classify whether a person has breast cancer or not, and lets say that out of 100 predictors, 60 predicted 1 and 40 predicted 0.

To get the final prediction, Adaboost calculates sums the weight of the predictors that predicted 1, and the weights of predictors that predicted 0, and makes a final classification based on the class with the most weighted votes.

An important hyperparameter to note here is the learning rate: this basically controls how much the misclassified instances get boosted. In other words, a learning rate of 0.5 mean that the misclassified instances are only boosted half as much at each iteration)

I know that some people like videos, so to get a visual understanding, please check out this amazing video by Statquest.

Cementing our Knowledge by coding an AdaBoost Classifier in Python.

Ok, so now we have a theoretical and conceptual understanding of how AdaBoost works. Now, let’s take it to the code! Whip up you IDE of choice, and let’s get typing!

For this example, I am going to be using the breast cancer dataset that is conveniently built in to scikit-learn:

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

I’m also importing the AdaBoostClassifer class from sklearn, as well as KFold cross validation and some evaluation metrics.

data = load_breast_cancer()
df = pd.DataFrame(data['data'],columns=data['feature_names'])
df['target'] = data['target']
df.head(5)

Here, I’m loading the data and converting it to a DataFrame just for simplicity and because I personally like converting data to DataFrames! Feel free to skip this step by any means necessary.

Usually, at this step, you would do some EDA, feature cleaning, engineering etc, but the purpose of this tutorial is not an end-to-end ML project; it’s simply to demonstrate AdaBoost, through python. So for that reason, I am going straight to holdout validation:

X = df.drop('target',axis=1)
y = df['target']
for train_index,val_index in kf.split(X):
X_train,X_val = X.iloc[train_index],X.iloc[val_index],
y_train,y_val = y.iloc[train_index],y.iloc[val_index]

Ok, so here I have defined my features and my target feature, and I have split my data using a 5 Fold cross validation. Why is this better than good old train_test_split? Because in KFold cross validation, it actually splits the data into n(here it being 5) parts and picks the subset thats best for evaluation. Train test split is too primitive for real world use.

God damn, all this code and no AdaBoost? Well folks, the moment has arrived: modelling time!

ada_clf = AdaBoostClassifier()
ada_clf.get_params()
OUTPUT:
{'algorithm': 'SAMME.R',
'base_estimator': None,
'learning_rate': 1.0,
'n_estimators': 50,
'random_state': None}

It’s usually a good idea to put out a baseline model first, and then tune your parameters to see what errors the model made. But before we move on, let me break down the parameters and their meanings:

  • algorithm: So the Algorithm that I described up above is actually known as SAMME, or Stagewise additive modelling using exponential loss function (I’ll stick with SAMME). However, if the predictors can predict probabilities for classes, then we can SAMME.R (R meaning “Real”). This relies on class probabilities and generally performs better than our friend SAMME
  • base_estimator: this is the root estimator we will be using. More than likely, and by default, this will be a DecisionTreeClassifier with a max depth of 1(a stump).
  • learning_rate: as we have discussed before, this controls how much the misclassified instances are boosted.
  • n_estimators: the number of weak learners to train
  • random_state: a seed we set so that our results can be obtained over and over and are not stochastic.

Let’s fit our model to the training set and make predictions on the validation set:

ada_clf.fit(X_train, y_train)predictions = ada_clf.predict(X_val)

Another great thing about AdaBoost is that it does not require feature scaling, and really requires very little data preparation.

Ok, let’s evaluate our model:

print(classification_report(y_val,predictions))OUTPUT:precision    recall  f1-score   support0       0.98      0.87      0.92        46
1 0.92 0.99 0.95 67
accuracy 0.94 113
macro avg 0.95 0.93 0.93 113
weighted avg 0.94 0.94 0.94 113

Wow! 94% accuracy with a vanilla AdaBoostClassifier! Awesome! In a real world project, you would need to do more than this. I recommend you to try using RandomisedSearchCV or GridSearchCV to try improve the model’s performance. Think of it as a little lockdown project!

Again, I thank you for reading my article and I will be sure to keep ’em coming so stay tuned and safe :)

--

--

Vagif Aliyev

19 y/o student ex-founder of Snapstudy (acquired), founding engineer at Upword.ai