Suggest variable interactions for random forest models

Suggests candidate variable interactions by selecting the variables above a given variable importance threshold (given by the argument importance.threshold) from a model and combining them in pairs through multiplication (a * b). The interacting variables are scaled between 1 and 100 before multiplication to avoid artifacts when a variable has 0 somewhere in the middle of its range (i.e. temperature).

For each variable interaction, a model including all the predictors plus the interaction is fitted, and it's R squared is compared with the R squared of the model without interactions. This model without interactions can either be provided through the argument model, or is fitted on the fly with rf_repeat() if the user provides the data.

Users should not use the suggested variable interactions hastily. Most likely, only one or a few of the suggested interactions may make sense from a domain expertise standpoint.

rf_interactions(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  ranger.arguments = NULL,
  importance.threshold = NULL,
  point.color = viridis::viridis(100, option = "F"),
  seed = NULL,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster.ips = NULL,
  cluster.cores = NULL,
  cluster.user = Sys.info()[["user"]],
  cluster.port = "11000"
)

Arguments

data	Data frame with a response variable and a set of predictors. Default: `NULL`
dependent.variable.name	Character string with the name of the response variable. Must be in the column names of `data`. If the dependent variable is binary with values 1 and 0, the argument `case.weights` of `ranger` is populated by the function `case_weights()`. Default: `NULL`
predictor.variable.names	Character vector with the names of the predictive variables. Every element of this vector must be in the column names of `data`. Default: `NULL`
ranger.arguments	Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function.
importance.threshold	Value of variable importance from `model` used as threshold to select variables to generate candidate interactions. Default: Quantile 0.75 of the variable importance in `model`.
point.color	Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. `viridis::viridis(100)`). Default: `viridis::viridis(100, option = "F")`
seed	Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same.
verbose	Logical If `TRUE`, messages and plots generated during the execution of the function are displayed, Default: `TRUE`
n.cores	Integer, number of cores to use. Default = `parallel::detectCores() - 1`
cluster.ips	Character vector with the IPs of the machines in a cluster. The machine with the first IP will be considered the main node of the cluster, and will generally be the machine on which the R code is being executed.
cluster.cores	Numeric integer vector, number of cores to use on each machine.
cluster.user	Character string, name of the user (should be the same throughout machines). Defaults to the current system user.
cluster.port	Integer, port used by the machines in the cluster to communicate. The firewall in all computers must allow traffic from and to such port. Default: `11000`

Value

A list with four slots: screening, with the complete screening results; selected, with the names and the R squared improvement produced by each variable interaction; columns, data frame with the interactions computed from the training data, and plot, with the list of plots of the selected interactions versus the response variable. The output list can be plotted all at once with patchwork::wrap_plots(p) or cowplot::plot_grid(plotlist = p), or one by one by extracting each plot from the list.

Examples

# \donttest{
if(interactive()){

 data(plant_richness_df)

 interactions <- rf_interactions(
   data = plant_richness_df,
   dependent.variable.name = "richness_species_vascular",
   predictor.variable.names = colnames(plant_richness_df)[5:21],
   verbose = TRUE
 )

 interactions$screening
 interactions$selected
 interactions$columns

}
# }