Since their inception in the 1980's, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors.
Roughly speaking, a good prior charges smaller trees where overfitting does not occur. In the papers below, we take a step towards understanding why/when do Bayesian trees and their ensembles not overfit. We studied the speed at which the posterior concentrates around Hölder-smooth regression functions. We proposed a spike-and-tree variant of the popular Bayesian CART prior and established new theoretical results showing that regression trees (and their ensembles)
are capable of recovering smooth regression surfaces, achieving optimal rates up to a log factor;
can adapt to the unknown level of smoothness; and
can perform effective dimension reduction when p>n.
These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice. They also show that from a posterior contraction point of view, in general there is no meaningful difference between single trees or forests - the key part is the total number of leaves in the end. Other considerations, such as computational stability, may then guide the choice between the two. This conclusion may change if there is additional structure. For example, when the underlying regression function has an additive structure, a Bayesian forest is preferable over a single tree.
Ročková, V., & van der Pas, S. (2020). Posterior concentration for Bayesian regression trees and forests. The Annals of Statistics, 48(4), 2108-2131. [link]
Van der Pas, S., & Ročková, V. (2017). Bayesian dyadic trees and histograms for regression. Advances in Neural Information Processing Systems, 30. [link]