R/rf_interactions.R
rf_interactions.Rd
Suggests candidate variable interactions by selecting the variables above a given variable importance threshold (given by the argument importance.threshold
) from a model and combining them in pairs through multiplication (a * b
). The interacting variables are scaled between 1 and 100 before multiplication to avoid artifacts when a variable has 0 somewhere in the middle of its range (i.e. temperature).
For each variable interaction, a model including all the predictors plus the interaction is fitted, and it's R squared is compared with the R squared of the model without interactions. This model without interactions can either be provided through the argument model
, or is fitted on the fly with rf_repeat()
if the user provides the data.
Users should not use the suggested variable interactions hastily. Most likely, only one or a few of the suggested interactions may make sense from a domain expertise standpoint.
rf_interactions( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, ranger.arguments = NULL, importance.threshold = NULL, point.color = viridis::viridis(100, option = "F"), seed = NULL, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster.ips = NULL, cluster.cores = NULL, cluster.user = Sys.info()[["user"]], cluster.port = "11000" )
data | Data frame with a response variable and a set of predictors. Default: |
---|---|
dependent.variable.name | Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names | Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
ranger.arguments | Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
importance.threshold | Value of variable importance from |
point.color | Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
seed | Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. |
verbose | Logical If |
n.cores | Integer, number of cores to use. Default = |
cluster.ips | Character vector with the IPs of the machines in a cluster. The machine with the first IP will be considered the main node of the cluster, and will generally be the machine on which the R code is being executed. |
cluster.cores | Numeric integer vector, number of cores to use on each machine. |
cluster.user | Character string, name of the user (should be the same throughout machines). Defaults to the current system user. |
cluster.port | Integer, port used by the machines in the cluster to communicate. The firewall in all computers must allow traffic from and to such port. Default: |
A list with four slots: screening
, with the complete screening results; selected
, with the names and the R squared improvement produced by each variable interaction; columns
, data frame with the interactions computed from the training data, and plot
, with the list of plots of the selected interactions versus the response variable. The output list can be plotted all at once with patchwork::wrap_plots(p)
or cowplot::plot_grid(plotlist = p)
, or one by one by extracting each plot from the list.
# \donttest{ if(interactive()){ data(plant_richness_df) interactions <- rf_interactions( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], verbose = TRUE ) interactions$screening interactions$selected interactions$columns } # }