A convenient wrapper for ranger that completes its output by providing the Moran's I of the residuals for different distance thresholds, the rmse and nrmse (as computed by root_mean_squared_error()), and variable importance scores based on a scaled version of the data generated by scale.

rf(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  scaled.importance = FALSE,
  seed = 1,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

data

Data frame with a response variable and a set of predictors. Default: NULL

dependent.variable.name

Character string with the name of the response variable. Must be in the column names of data. If the dependent variable is binary with values 1 and 0, the argument case.weights of ranger is populated by the function case_weights(). Default: NULL

predictor.variable.names

Character vector with the names of the predictive variables. Every element of this vector must be in the column names of data. Optionally, the result of auto_cor() or auto_vif(). Default: NULL

distance.matrix

Squared matrix with the distances among the records in data. The number of rows of distance.matrix and data must be the same. If not provided, the computation of the Moran's I of the residuals is omitted. Default: NULL

distance.thresholds

Numeric vector with neighborhood distances. All distances in the distance matrix below each value in dustance.thresholds are set to 0 for the computation of Moran's I. If NULL, it defaults to seq(0, max(distance.matrix), length.out = 4). Default: NULL

xy

(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot ranger.arguments$xy of the model, so it can be used by rf_evaluate() and rf_tuning(). Default: NULL

ranger.arguments

Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. The ranger arguments x, y, and formula are disabled. Please, consult the help file of ranger if you are not familiar with the arguments of this function.

scaled.importance

Logical, if TRUE, the function scales data with scale and fits a new model to compute scaled variable importance scores. This makes variable importance scores of different models somewhat comparable. Default: FALSE

seed

Integer, random seed to facilitate reproducibility. If set to a given number, the returned model is always the same. Default: 1

verbose

Boolean. If TRUE, messages and plots generated during the execution of the function are displayed. Default: TRUE

n.cores

Integer, number of cores to use. Default: parallel::detectCores() - 1

cluster

A cluster definition generated with parallel::makeCluster(). This function does not use the cluster, but can pass it on to other functions when using the %>% pipe. It will be stored in the slot cluster of the output list. Default: NULL

Value

A ranger model with several extra slots:

  • ranger.arguments: Stores the values of the arguments used to fit the ranger model.

  • importance: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores (difference in accuracy between permuted and non permuted variables for every case, computed on the out-of-bag data).

  • performance: performance scores: R squared on out-of-bag data, R squared (cor(observed, predicted) ^ 2), pseudo R squared (cor(observed, predicted)), RMSE, and normalized RMSE (NRMSE).

  • residuals: residuals, normality test of the residuals computed with residuals_test(), and spatial autocorrelation of the residuals computed with moran_multithreshold().

Details

Please read the help file of ranger for further details. Notice that the formula interface of ranger is supported through ranger.arguments, but variable interactions are not allowed (but check the_feature_engineer()).

Examples

if(interactive()){

 #loading example data
 data("plant_richness_df")
 data("distance_matrix")

 #fittind random forest model
 out <- rf(
   data = plant_richness_df,
   dependent.variable.name = "richness_species_vascular",
   predictor.variable.names = colnames(plant_richness_df)[5:21],
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   n.cores = 1
 )

 class(out)

 #data frame with ordered variable importance
 out$importance$per.variable

 #variable importance plot
 out$importance$per.variable.plot

 #performance
 out$performance

 #spatial correlation of the residuals
 out$spatial.correlation.residuals$per.distance

 #plot of the Moran's I of the residuals for different distance thresholds
 out$spatial.correlation.residuals$plot

 #predictions for new data as done with ranger models:
 predicted <- stats::predict(
   object = out,
   data = plant_richness_df,
   type = "response"
 )$predictions

 #alternative data input methods
 ###############################

 #ranger.arguments can contain ranger arguments and any other rf argument
 my.ranger.arguments <- list(
 data = plant_richness_df,
 dependent.variable.name = "richness_species_vascular",
 predictor.variable.names = colnames(plant_richness_df)[8:21],
 distance.matrix = distance_matrix,
 distance.thresholds = c(0, 1000)
 )

 #fitting model with these ranger arguments
 out <- rf(
   ranger.arguments = my.ranger.arguments,
   n.cores = 1
   )

}