Boosting and XGBoost

In this video we talk about boosting, gradient boosting, and XGBoost (Extreme Gradient Boosting).

Like Bagging and pasting, boosting is an ensemble algorithm that combines a number of weak learners to create a strong one.

Weak learners simply mean that the algorithm performs at least slightly better than chance.

Boosting algorithms are additive. Consider this example from Gradient boosting: Distance to target by Terence Parr and Jeremy Howard. They ask us to imagine writing the formula for y that matches this plot:

Like the decision tree example above, our first approximation might be simple, perhaps just the y-intercept:

\[y = 30\]

as shown in the leftmost picture below.

Next, we may want to add in the slope of the line and get

\[y = 30 + x\]

and get the middle graph above. Finally, we add in the squiggle:

\[y = 30 + x + sin(x)\]

We have decomposed a complex task into subtasks, each refining the previous approximation. So, again, we have an additive algorithm.

This approach shouldn’t be surprising to us since this is how we typically develop programs. We get some skeleton code working and then incrementally add to it.

Boosting

Boosting algorithms work in a similar additive fashion. We first develop a simple model that roughly classifies the data. Next, we add another simple model that is focused on ameliorating the errors of the first. And then we add another and another.

\[boosting=model_1 + model_2 + model_3 + ... + model_n\]

How boosting differs from bagging and pasting

With bagging and pasting we created a number of decision trees each of which was trained on different data. One tree did not influence the construction of another. Thus, each classifier was independent of the others.

Boosting

Boosting is different.

Imagine that we create one decision tree classifier. Let’s call it Classifier 1. Classifier 1 doesn’t perform with 100% accuracy.

Next we create a second decision tree classifier and as part of its training data we will use the instances that Classifier got wrong. Now Classifier 2 isn’t perfect either and there will be some instances that both Classifier 1 and Classifier 2 got wrong, and, you guessed it, we will use those instances as part of the training data for Classifier 3.


Previous section: