Skip to contents

Ranks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.

The strength of association between the response and each predictor is computed by the function f. The f functions available are:

The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name

Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.

This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order in collinear(), cor_select(), and vif_select().

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Accepts a character vector of response variables as input for the argument response. When more than one response is provided, the output is a named list of preference data frames.

Usage

preference_order(
  df = NULL,
  response = NULL,
  predictors = NULL,
  f = "auto",
  warn_limit = NULL,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

response

(optional; character string or vector) Name/s of response variable/s in df. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

f

(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of f_auto() for the given data is used:

Default: NULL

warn_limit

(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

data frame: columns are "response", "predictor", "f" (function name), and "preference".

Author

Blas M. Benito, PhD

Examples

#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors_numeric,
  f = NULL
  )
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.

#returns data frame ordered by preference
df_preference
#>      response             predictor            f preference
#> 1  vi_numeric              swi_mean f_r2_pearson 0.74333418
#> 2  vi_numeric  soil_temperature_max f_r2_pearson 0.61129999
#> 3  vi_numeric               swi_max f_r2_pearson 0.59335158
#> 4  vi_numeric             swi_range f_r2_pearson 0.41849723
#> 5  vi_numeric               swi_min f_r2_pearson 0.26066257
#> 6  vi_numeric        topo_diversity f_r2_pearson 0.11688815
#> 7  vi_numeric  soil_temperature_min f_r2_pearson 0.08964740
#> 8  vi_numeric            topo_slope f_r2_pearson 0.04638905
#> 9  vi_numeric        topo_elevation f_r2_pearson 0.02936205
#> 10 vi_numeric soil_temperature_mean f_r2_pearson 0.02388867


#several responses
#------------------------------------------------
responses <- c(
  "vi_categorical",
  "vi_counts"
)

preference_list <- preference_order(
  df = df,
  response = responses,
  predictors = predictors
)
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_counts'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_rf()'.
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#> 
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.

#returns a named list
names(preference_list)
#> [1] "vi_counts"      "vi_categorical"
preference_list[[1]]
#>     response          predictor       f preference
#> 1  vi_counts           swi_mean f_r2_rf 0.84397411
#> 2  vi_counts        koppen_zone f_r2_rf 0.81258327
#> 3  vi_counts koppen_description f_r2_rf 0.79691235
#> 4  vi_counts            swi_max f_r2_rf 0.72791918
#> 5  vi_counts       koppen_group f_r2_rf 0.71508036
#> 6  vi_counts            swi_min f_r2_rf 0.64426396
#> 7  vi_counts          soil_type f_r2_rf 0.63949456
#> 8  vi_counts     topo_elevation f_r2_rf 0.27154187
#> 9  vi_counts     topo_diversity f_r2_rf 0.13091316
#> 10 vi_counts         topo_slope f_r2_rf 0.07353939
preference_list[[2]]
#>          response          predictor                  f preference
#> 1  vi_categorical           swi_mean f_v_rf_categorical  0.5771390
#> 2  vi_categorical        koppen_zone f_v_rf_categorical  0.5568756
#> 3  vi_categorical koppen_description f_v_rf_categorical  0.5499681
#> 4  vi_categorical       koppen_group f_v_rf_categorical  0.5385112
#> 5  vi_categorical            swi_max f_v_rf_categorical  0.5257116
#> 6  vi_categorical          soil_type f_v_rf_categorical  0.4521790
#> 7  vi_categorical            swi_min f_v_rf_categorical  0.4515519
#> 8  vi_categorical     topo_elevation f_v_rf_categorical  0.2705244
#> 9  vi_categorical     topo_diversity f_v_rf_categorical  0.1734628
#> 10 vi_categorical         topo_slope f_v_rf_categorical  0.1578257

#can be used in collinear()
# x <- collinear(
#   df = df,
#   response = responses,
#   predictors = predictors,
#   preference_order = preference_list
# )

#f function selected by user
#for binomial response and numeric predictors
# preference_order(
#   df = vi,
#   response = "vi_binomial",
#   predictors = predictors_numeric,
#   f = f_auc_glm_binomial
# )


#disable parallelization
future::plan(future::sequential)