Suggests candidate variable interactions by selecting the variables above a given variable importance threshold (given by the argument importance.threshold) from a model and combining them in pairs through multiplication (a * b). The interacting variables are scaled between 1 and 100 before multiplication to avoid artifacts when a variable has 0 somewhere in the middle of its range (i.e. temperature).

For each variable interaction, a model including all the predictors plus the interaction is fitted, and it's R squared is compared with the R squared of the model without interactions. This model without interactions can either be provided through the argument model, or is fitted on the fly with rf_repeat() if the user provides the data.

Users should not use the suggested variable interactions hastily. Most likely, only one or a few of the suggested interactions may make sense from a domain expertise standpoint.

rf_interactions(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  ranger.arguments = NULL,
  importance.threshold = NULL,
  point.color = viridis::viridis(100, option = "F"),
  seed = NULL,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster.ips = NULL,
  cluster.cores = NULL,
  cluster.user = Sys.info()[["user"]],
  cluster.port = "11000"
)

Arguments

data

Data frame with a response variable and a set of predictors. Default: NULL

dependent.variable.name

Character string with the name of the response variable. Must be in the column names of data. If the dependent variable is binary with values 1 and 0, the argument case.weights of ranger is populated by the function case_weights(). Default: NULL

predictor.variable.names

Character vector with the names of the predictive variables. Every element of this vector must be in the column names of data. Default: NULL

ranger.arguments

Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function.

importance.threshold

Value of variable importance from model used as threshold to select variables to generate candidate interactions. Default: Quantile 0.75 of the variable importance in model.

point.color

Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. viridis::viridis(100)). Default: viridis::viridis(100, option = "F")

seed

Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same.

verbose

Logical If TRUE, messages and plots generated during the execution of the function are displayed, Default: TRUE

n.cores

Integer, number of cores to use. Default = parallel::detectCores() - 1

cluster.ips

Character vector with the IPs of the machines in a cluster. The machine with the first IP will be considered the main node of the cluster, and will generally be the machine on which the R code is being executed.

cluster.cores

Numeric integer vector, number of cores to use on each machine.

cluster.user

Character string, name of the user (should be the same throughout machines). Defaults to the current system user.

cluster.port

Integer, port used by the machines in the cluster to communicate. The firewall in all computers must allow traffic from and to such port. Default: 11000

Value

A list with four slots: screening, with the complete screening results; selected, with the names and the R squared improvement produced by each variable interaction; columns, data frame with the interactions computed from the training data, and plot, with the list of plots of the selected interactions versus the response variable. The output list can be plotted all at once with patchwork::wrap_plots(p) or cowplot::plot_grid(plotlist = p), or one by one by extracting each plot from the list.

Examples

# \donttest{ if(interactive()){ data(plant_richness_df) interactions <- rf_interactions( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], verbose = TRUE ) interactions$screening interactions$selected interactions$columns } # }