Fits spatial random forest models — rf

Fits spatial random forest models using different methods to generate, rank, and select spatial predictors acting as proxies of spatial processes not considered by the non-spatial predictors. The end goal is providing the model with information about the spatial structure of the data to minimize the spatial correlation (Moran's I) of the model residuals and generate honest variable importance scores.

rf_spatial(
  model = NULL,
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  scaled.importance = TRUE,
  method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive",
    "hengl", "hengl.moran.sequential", "hengl.effect.sequential",
    "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential",
    "pca.effect.recursive"),
  max.spatial.predictors = NULL,
  weight.r.squared = NULL,
  weight.penalization.n.predictors = NULL,
  seed = 1,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

model

A model fitted with rf(). If used, the arguments data, dependent.variable.name, predictor.variable.names, distance.matrix, distance.thresholds, ranger.arguments, and scaled.importance are taken directly from the model definition. Default: NULL

data

Data frame with a response variable and a set of predictors. Default: NULL

dependent.variable.name

Character string with the name of the response variable. Must be in the column names of data. If the dependent variable is binary with values 1 and 0, the argument case.weights of ranger is populated by the function case_weights(). Default: NULL

predictor.variable.names

Character vector with the names of the predictive variables. Every element of this vector must be in the column names of data. Default: NULL

distance.matrix

Squared matrix with the distances among the records in data. The number of rows of distance.matrix and data must be the same. If not provided, the computation of the Moran's I of the residuals is omitted. Default: NULL

distance.thresholds

Numeric vector with distances in the same units as distance.matrix Distances below each distance threshold are set to 0 on separated copies of the distance matrix to compute Moran's I at different neighborhood distances. If NULL, it defaults to seq(0, max(distance.matrix)/2, length.out = 4) (defined by default_distance_thresholds()). Default: NULL

xy

(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot ranger.arguments$xy of the model, so it can be used by rf_evaluate() and rf_tuning(). Default: NULL

ranger.arguments

Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function.

scaled.importance

Logical. If TRUE, and 'importance = "permutation', the function scales 'data' with scale and fits a new model to compute scaled variable importance scores. Default: TRUE

method

Character, method to build, rank, and select spatial predictors. One of:

"hengl"
"hengl.moran.sequential" (experimental)
"hengl.effect.sequential" (experimental)
"hengl.effect.recursive" (experimental)
"pca.moran.sequential" (experimental)
"pca.effect.sequential" (experimental)
"pca.effect.recursive" (experimental)
"mem.moran.sequential"
"mem.effect.sequential"
"mem.effect.recursive"

max.spatial.predictors

Integer, maximum number of spatial predictors to generate. Useful when memory problems arise due to a large number of spatial predictors, Default: NULL

weight.r.squared

Numeric between 0 and 1, weight of R-squared in the selection of spatial components. See Details, Default: NULL

weight.penalization.n.predictors

Numeric between 0 and 1, weight of the penalization for adding an increasing number of spatial predictors during selection. Default: NULL

seed

Integer, random seed to facilitate reproducibility. Default: 1.

verbose

Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: TRUE

n.cores

Integer, number of cores to use for parallel execution. Creates a socket cluster with parallel::makeCluster(), runs operations in parallel with foreach and %dopar%, and stops the cluster with parallel::clusterStop() when the job is done. Default: parallel::detectCores() - 1

cluster

A cluster definition generated with parallel::makeCluster(). If provided, overrides n.cores. When cluster = NULL (default value), and model is provided, the cluster in model, if any, is used instead. If this cluster is NULL, then the function uses n.cores instead. The function does not stop a provided cluster, so it should be stopped with parallel::stopCluster() afterwards. The cluster definition is stored in the output list under the name "cluster" so it can be passed to other functions via the model argument, or using the %>% pipe. Default: NULL

Value

A ranger model with several new slots:

ranger.arguments: Values of the arguments used to fit the ranger model.
importance: A list containing the vector of variable importance as originally returned by ranger (scaled or not depending on the value of 'scaled.importance'), a data frame with the predictors ordered by their importance, and a ggplot showing the importance values.
performance: With the out-of-bag R squared, pseudo R squared, RMSE and NRMSE of the model.
residuals: residuals, normality test of the residuals computed with residuals_test(), and spatial autocorrelation of the residuals computed with moran_multithreshold().
spatial: A list with four slots:
- method: Character, method used to generate, rank, and select spatial predictors.
- names: Character vector with the names of the selected spatial predictors. Not returned if the method is "hengl".
- optimization: Criteria used to select the spatial predictors. Not returned if the method is "hengl".
- plot: Plot of the criteria used to select the spatial predictors. Not returned if the method is "hengl".

Details

The function uses three different methods to generate spatial predictors ("hengl", "pca", and "mem"), two methods to rank them in order to define in what order they are introduced in the model ("effect" and "moran), and two methods to select the spatial predictors that minimize the spatial correlation of the model residuals ("sequential" and "recursive"). All method names but "hengl" (that uses the complete distance matrix as predictors in the spatial model) are named by combining a method to generate the spatial predictors, a method to rank them, and a method to select them, separated by a point. Examples are "mem.moran.sequential" or "mem.effect.recursive". All combinations are not possible, since the ranking method "moran" cannot be used with the selection method "recursive" (because the logics behind them are very different, see below). Methods to generate spatial predictors:

"hengl": named after Tomislav Hengl and the paper Hengl et al. (2018), where the authors propose to use the distance matrix among records as predictors in spatial random forest models (RFsp method). In this function, all methods starting with "hengl" use either the complete distance matrix, or select columns of the distance matrix as spatial predictors.
"mem": Generates Moran's Eigenvector Maps, that is, the eigenvectors of the double-centered weights of the distance matrix. The method is described in Dray, Legendre and Peres-Neto (2006) and Legendre and Gauthier (2014).
"pca": Computes spatial predictors from the principal component analysis of a weighted distance matrix (see weights_from_distance_matrix()). This is an experimental method, use with caution.

Methods to rank spatial predictors (see rank_spatial_predictors()):

"moran": Computes the Moran's I of each spatial predictor, selects the ones with positive values, and ranks them from higher to lower Moran's I.
"effect": If a given non-spatial random forest model is defined as y = p1 + ... + pn, being p1 + ... + pn the set of predictors, for every spatial predictor generated (spX) a spatial model y = p1 + ... + pn + spX is fitted, and the Moran's I of its residuals is computed. The spatial predictors are then ranked by how much they help to reduce spatial autocorrelation between the non-spatial and the spatial model.

Methods to select spatial predictors:

"sequential" (see select_spatial_predictors_sequential()): The spatial predictors are added one by one in the order they were ranked, and once all spatial predictors are introduced, the best first n predictors are selected. This method is similar to the one employed in the MEM methodology (Moran's Eigenvector Maps) described in Dray, Legendre and Peres-Neto (2006) and Legendre and Gauthier (2014). This method generally introduces tens of predictors into the model, but usually offers good results.
"recursive" (see select_spatial_predictors_recursive()): This method tries to find the smallest combination of spatial predictors that reduce the spatial correlation of the model's residuals the most. The algorithm goes as follows: 1. The first ranked spatial predictor is introduced into the model; 2. the remaining predictors are ranked again using the "effect" method, using the model in 1. as reference. The first spatial predictor in the resulting ranking is then introduced into the model, and the steps 1. and 2. are repeated until spatial predictors stop having an effect in reducing the Moran's I of the model residuals. This method takes longer to compute, but generates smaller sets of spatial predictors. This is an experimental method, use with caution.

Once ranking procedure is completed, an algorithm is used to select the minimal subset of spatial predictors that reduce the most the Moran's I of the residuals: for each new spatial predictor introduced in the model, the Moran's I of the residuals, it's p-value, a binary version of the p-value (0 if < 0.05 and 1 if >= 0.05), the R-squared of the model, and a penalization linear with the number of spatial predictors introduced (computed as (1 / total spatial predictors) * introduced spatial predictors) are rescaled between 0 and 1. Then, the optimization criteria is computed as max(1 - Moran's I, p-value binary) + (weight.r.squared * R-squared) - (weight.penalization.n.predictors * penalization). The predictors from the first one to the one with the highest optimization criteria are then selected as the best ones in reducing the spatial correlation of the model residuals, and used along with data to fit the final spatial model.

Examples

if(interactive()){

 #loading example data
 data(distance_matrix)
 data(plant_richness_df)

 #names of the response and predictors
 dependent.variable.name <- "richness_species_vascular"
 predictor.variable.names <- colnames(plant_richness_df)[5:21]

 #hengl
 model <- rf_spatial(
   data = plant_richness_df,
   dependent.variable.name = dependent.variable.name,
   predictor.variable.names = predictor.variable.names,
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   method = "hengl",
   n.cores = 1
 )

 #mem.moran.sequential
 model <- rf_spatial(
   data = plant_richness_df,
   dependent.variable.name = dependent.variable.name,
   predictor.variable.names = predictor.variable.names,
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   method = "mem.moran.sequential",
   n.cores = 1
 )

 #fitting an rf_spatial model from an rf model
 rf.model <- rf(
   data = plant_richness_df,
   dependent.variable.name = "richness_species_vascular",
   predictor.variable.names = colnames(plant_richness_df)[5:21],
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   n.cores = 1,
   verbose = FALSE
 )
 rf.model$spatial.correlation.residuals$plot

 #spatial version of the rf model
 rf.spatial <- rf_spatial(model = rf.model)
 rf.spatial$spatial.correlation.residuals$plot


}