Skip to content

forestry-labs/Rforestry

Repository files navigation

R

Rforestry: Random Forests, Linear Trees, and Gradient Boosting for Inference and Interpretability

Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Sam Antonyan, Allen Tang, Jasjeet Sekhon

Introduction

Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.

How to install

  1. The GFortran compiler has to be up to date. GFortran Binaries can be found here.
  2. The devtools package has to be installed. You can install it using, install.packages("devtools").
  3. The package contains compiled code, and you must have a development environment to install the development version. You can use devtools::has_devel() to check whether you do. If no development environment exists, Windows users download and install Rtools and macOS users download and install Xcode.
  4. The latest development version can then be installed using devtools::install_github("forestry-labs/Rforestry"). For Windows users, you'll need to skip 64-bit compilation devtools::install_github("forestry-labs/Rforestry", INSTALL_opts = c('--no-multiarch')) due to an outstanding gcc issue.

Documentation

For the Python package, see the documentation here and install from PyPI here. For the R package, see the documentation here and install from CRAN here. For the source code for both packages, see the Github here

Usage

library(Rforestry)

set.seed(292315)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]

rf <- forestry(x = x_train, y = y_train, nthread = 2)

predict(rf, x_test)

Monotonic Constraints

The parameter monotonicConstraints strictly enforces monotonicity of partition averages when evaluating potential splits on the indicated features. This parameter can be used to specify both monotone increasing and monotone decreasing constraints.

library(Rforestry)

set.seed(49)
x <- rnorm(150)+5
y <- .15*x + .5*sin(3*x)
data_train <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))

monotone_rf <- forestry(x = data_train[,-3],
                        y = data_train$y,
                        monotonicConstraints = c(1,1),
                        nodesizeStrictSpl = 5,
                        nthread = 1,
                        ntree = 25)
                        
predict(monotone_rf, newdata = data_train[,-3])

OOB Predictions

We can return the predictions for the training data set using only the trees in which each observation was out-of-bag (OOB). Note that when there are few trees, or a high proportion of the observations sampled, there may be some observations which are not out-of-bag for any trees. The predictions for these are returned as NaN.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

If OOB predictions are going to be used, it is advised that one use OOB honesty during training (OOBhonest=true). In this version of honesty, the OOB observations for each tree are used as the honest (averaging) set. OOB honesty also changes how predictions are constructed. When predicting for observations that are out-of-sample (using Predict(..., aggregation = "average")), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample (using predict(..., aggregation = "oob")), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation="oob" (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. OOB honesty, when used in combination with aggregation="oob" at the prediction stage, cannot overfit IID data, at either the training or prediction stage. The outputs of such models are also more stable and more easily interpretable. One can observe this if one queries the model using interpretation tools such as ALEs, PDPs, LIME, etc.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500,
               OOBhonest=TRUE)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

Saving + Loading a model

In order to save a trained model, we include two functions in order to save and load a model we have built. The following code shows how to use saveForestry and loadForestry to save and load a forestry model.

library(Rforestry)

# Train a forest
forest <- forestry(x = iris[,-1],
                   y = iris[,1],
                   nthread = 2,
                   ntree = 500,
                   OOBhonest=TRUE)
               
# Get predictions before save the forest
y_pred_before <- predict(forest, iris[,-1])

# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))

# Delete the forest
rm(forest)

# Load the forest
forest_after <- loadForestry(file.path("forest.Rda"))

# Predict after loading the forest
y_pred_after <- predict(forest_after, iris[,-1])

Ridge Random Forest

A fast implementation of random forests using ridge penalized splitting and ridge regression for predictions. In order to use this version of random forests, set the linear option to TRUE.

library(Rforestry)

set.seed(49)
n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)
forest <- forestry(x, y, linear = TRUE, nthread = 2)
predict(forest, x)