Gradient Boosted Regression Trees 2
Regularization
GBRT provide three knobs to control overfitting: tree structure, shrinkage, and randomization.
Tree Structure
The depth of the individual trees is one aspect of model complexity. The depth of the trees basically control the degree of feature interactions that your model can fit. For example, if you want to capture the interaction between a feature latitude
and a feature longitude
your trees need a depth of at least two to capture this. Unfortunately, the degree of feature interactions is not known in advance but it is usually fine to assume that it is faily low -- in practise, a depth of 4-6 usually gives the best results. In scikit-learn you can constrain the depth of the trees using the max_depth
argument.
Another way to control the depth of the trees is by enforcing a lower bound on the number of samples in a leaf: this will avoid inbalanced splits where a leaf is formed for just one extreme data point. In scikit-learn you can do this using the argument min_samples_leaf
. This is effectively a means to introduce bias into your model with the hope to also reduce variance as shown in the example below:
def fmt_params(params):
return ", ".join("{0}={1}".format(key, val) for key, val in params.iteritems())fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')),
({'min_samples_leaf': 3},
('#fdae61', '#abd9e9'))]:
est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0)
est.set_params(**params)
est.fit(X_train, y_train)
test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
train_color=train_color, test_color=test_color)
ax.annotate('Higher bias', xy=(900, est.train_score_[899]), xycoords='data',
xytext=(600, 0.3), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)ax.annotate('Lower variance', xy=(900, test_dev[899]), xycoords='data',
xytext=(600, 0.4), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)plt.legend(loc='upper right')
Shrinkage
The most important regularization technique for GBRT is shrinkage: the idea is basically to do slow learning by shrinking the predictions of each individual tree by some small scalar, the learning_rate
. By doing so the model has to re-enforce concepts. A lower learning_rate
requires a higher number of n_estimators
to get to the same level of training error -- so its trading runtime against accuracy.
fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')),
({'learning_rate': 0.1},
('#fdae61', '#abd9e9'))]:
est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0)
est.set_params(**params)
est.fit(X_train, y_train)
test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
train_color=train_color, test_color=test_color)
ax.annotate('Requires more trees', xy=(200, est.train_score_[199]), xycoords='data',
xytext=(300, 1.0), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)ax.annotate('Lower test error', xy=(900, test_dev[899]), xycoords='data',
xytext=(600, 0.5), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)plt.legend(loc='upper right')
Stochastic Gradient Boosting
Similar to RandomForest
, introducing randomization into the tree building process can lead to higher accuracy. Scikit-learn provides two ways to introduce randomization: a) subsampling the training set before growing each tree (subsample
) and b) subsampling the features before finding the best split node (max_features
). Experience showed that the latter works better if there is a sufficient large number of features (>30). One thing worth noting is that both options reduce runtime.
Below we show the effect of using subsample=0.5
, ie. growing each tree on 50% of the training data, on our toy example:
fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')),
({'learning_rate': 0.1, 'subsample': 0.5},
('#fdae61', '#abd9e9'))]:
est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0,
random_state=1)
est.set_params(**params)
est.fit(X_train, y_train)
test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
train_color=train_color, test_color=test_color)
ax.annotate('Even lower test error', xy=(400, test_dev[399]), xycoords='data',
xytext=(500, 0.5), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0,
subsample=0.5)est.fit(X_train, y_train)test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params({'subsample': 0.5}),
train_color='#abd9e9', test_color='#fdae61', alpha=0.5)ax.annotate('Subsample alone does poorly', xy=(300, test_dev[299]), xycoords='data',
xytext=(250, 1.0),