Random Forests

These are notes from lesson 6 of Fast AI Practical Deep Learning for Coders.

1. Background on Random Forests

It’s worth starting with a random forest as a baseline model because “it’s very hard to screw up”. Decision trees only consider the ordering of data, so are not sensitive to outliers, skewed distributions, dummy variables etc.

A random forest is an ensemble of trees. A tree is an ensemble of binary splits.

binary split -> tree -> forest

1.1. Binary Split

Pick a column of the data set and split the rows into two groups. Predict the outcome based just on that.

For example, in the titanic dataset, pick the Sex column. This splits into male vs female. If we predict that all female passengers survived and all male passengers died, that’s a reasonably accurate prediction.

A model which iterates through the variables and finds the best binary split is called a “OneR” or one rule model.

This was actually the state-of-the-art in the 90s and performs reasonably well across a lot of problems.

1.2. Decision Tree

This extends the “OneR” idea to two or more splits.

For example, if we first split by sex, then find the next best variable to split on to create a two layer tree.

Leaf nodes are the number of terminating nodes in the tree. OneR creates 2 leaf nodes. If we split one side again, we’ll have 3 leaf nodes. If we split the other side, to create a balanced two layer tree, we’ll have 4 leaf nodes.

1.3. Random Forest

The idea behind bagging is that if we have many unbiased, uncorrelated models, we can average their predictions to get an aggregate model that is better than any of them.

Each individual model will overfit, so either be too high or too low for a given point, a positive or negative error term respectively. By combining multiple uncorrelated models, the error will average to 0.

An easy way to get many uncorrelated models is to only use a random subset of the data each time. Then build a decision tree for each subset.

This is the idea behind random forests.

The error decreases with the number of trees, with diminishing returns.

Jeremy’s rule of thumb: improvements level off after about 30 trees and he doesn’t often use >100.

2. Attributes of Random Forests

2.1 Feature Importance

A nice side effect of using decision trees is that we get feature importance plots for free.

Gini is a measure of inequality, similar to the score used in the notebook to quantify how good a split was.

Intuitively, this measures the likelihood that if you picked two samples from a group, how likely is it they’d be the same every time. If the group is all the same, the probability is 1. If every item is different, the probability is 0.

The idea is for each split of a decision tree, we can track the column used to split and the amount that the gini decreased by. If we loop through the tree and accumulate the “gini reduction” for each column, we have a metric of how important that column was in splitting the dataset.

This makes random forests a useful first model when tackling a new big dataset, as it can give an insight into how useful each column is.

2.2. Explainability

A comment on explainability, regarding feature importance compared to SHAP, LIME etc. These explain the model, so to be useful the model needs to be accurate.

If you use feature importance from a bad model then the columns it claims are important might not actually be. So the usefulness of explainability techniques boils down to what models can they explain and how accurate are those models.

2.3. Out-of-bag Error

For each tree, we trained on a subset of the rows. We can then see how each one performed on the held out data; this is the out-of-bag (OOB) error per tree.

We can average this over all of the trees to get an overall OOB error for the forest.

2.4. Partial Dependence Plots

These are not specific to random forests and can be applied to any ML model including deep neural networks.

If we want to know how an independent variable affects the dependent variable, a naive approach would be to just plot them.

But this could include a confounding variable. For example, the price of bulldozers increases over time, but driven by the presence of air conditioning which also increased over time.

A partial dependence plot takes each row in turn and sets the column(s) of interest to the first value, say, year=1950. Then we predict the target variable using our model. Then repeat this for year=1951, 1952, etc.

We can then average the target variable per year to get a view of how it depends on the independent variable, all else being equal.

3. Gradient-Boosting Forests

We make multiple trees, but instead of fitting all to different data subsets, we fit to residuals.

So we fit a very small tree (OneR even) to the data to get a first prediction. Then we calculate the error term. Then we fit another tree to predict the error term. Then calculate the second order error term. Then fit a tree to this, etc.

Then our prediction is the sum of these trees, rather than the average like with a random forest.

This is “boosting”; calculating an error term then fitting another model to it.

Contrast this with “bagging” which was when we calculate multiple models to different subsets of data and average them.

FastAI Lesson 6: Random Forests