Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Sam Antonyan, Allen Tang, Jasjeet Sekhon
Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.
- The GFortran compiler has to be up to date. GFortran Binaries can be found here.
- The devtools package has to be installed. You can install it using,
install.packages("devtools")
. - The package contains compiled code, and you must have a development environment to install the development version. You can use
devtools::has_devel()
to check whether you do. If no development environment exists, Windows users download and install Rtools and macOS users download and install Xcode. - The latest development version can then be installed using
devtools::install_github("forestry-labs/Rforestry")
. For Windows users, you'll need to skip 64-bit compilationdevtools::install_github("forestry-labs/Rforestry", INSTALL_opts = c('--no-multiarch'))
due to an outstanding gcc issue.
For the Python package, see the documentation here and install from PyPI here. For the R package, see the documentation here and install from CRAN here. For the source code for both packages, see the Github here
library(Rforestry)
set.seed(292315)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]
rf <- forestry(x = x_train, y = y_train, nthread = 2)
predict(rf, x_test)
The parameter monotonicConstraints
strictly enforces monotonicity of partition
averages when evaluating potential splits on the indicated features.
This parameter can be used to specify both monotone increasing and monotone
decreasing constraints.
library(Rforestry)
set.seed(49)
x <- rnorm(150)+5
y <- .15*x + .5*sin(3*x)
data_train <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))
monotone_rf <- forestry(x = data_train[,-3],
y = data_train$y,
monotonicConstraints = c(1,1),
nodesizeStrictSpl = 5,
nthread = 1,
ntree = 25)
predict(monotone_rf, newdata = data_train[,-3])
We can return the predictions for the training data set using only the trees in
which each observation was out-of-bag (OOB). Note that when there are few trees, or a
high proportion of the observations sampled, there may be some observations
which are not out-of-bag for any trees. The predictions for these are returned as NaN
.
library(Rforestry)
# Train a forest
rf <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500)
# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")
# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
If OOB predictions are going to be used, it is advised that one use OOB honesty during training (OOBhonest=true). In this version of honesty, the OOB observations for each tree are used as the honest (averaging) set. OOB honesty also changes how predictions are constructed. When predicting for observations that are out-of-sample (using Predict(..., aggregation = "average")), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample (using predict(..., aggregation = "oob")), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation="oob" (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. OOB honesty, when used in combination with aggregation="oob" at the prediction stage, cannot overfit IID data, at either the training or prediction stage. The outputs of such models are also more stable and more easily interpretable. One can observe this if one queries the model using interpretation tools such as ALEs, PDPs, LIME, etc.
library(Rforestry)
# Train a forest
rf <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)
# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")
# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
In order to save a trained model, we include two functions in order to save and load a model we have built. The following code shows how to use saveForestry and loadForestry to save and load a forestry model.
library(Rforestry)
# Train a forest
forest <- forestry(x = iris[,-1],
y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)
# Get predictions before save the forest
y_pred_before <- predict(forest, iris[,-1])
# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))
# Delete the forest
rm(forest)
# Load the forest
forest_after <- loadForestry(file.path("forest.Rda"))
# Predict after loading the forest
y_pred_after <- predict(forest_after, iris[,-1])
A fast implementation of random forests using ridge penalized splitting and
ridge regression for predictions.
In order to use this version of random forests, set the linear
option to TRUE
.
library(Rforestry)
set.seed(49)
n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)
forest <- forestry(x, y, linear = TRUE, nthread = 2)
predict(forest, x)