Quantitative Variable Prioritization for Multicollinearity Filtering

Ranks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.

The strength of association between the response and each predictor is computed by the function f. The f functions available are:

Numeric response vs numeric predictor:
- f_r2_pearson(): Pearson's R-squared.
- f_r2_spearman(): Spearman's R-squared.
- f_r2_glm_gaussian(): Pearson's R-squared of response versus the predictions of a Gaussian GLM.
- f_r2_glm_gaussian_poly2(): Gaussian GLM with second degree polynomial.
- f_r2_gam_gaussian(): GAM model fitted with mgcv::gam().
- f_r2_rpart(): Recursive Partition Tree fitted with rpart::rpart().
- f_r2_rf(): Random Forest model fitted with ranger::ranger().
Integer counts response vs. numeric predictor:
- f_r2_glm_poisson(): Pearson's R-squared of a Poisson GLM.
- f_r2_glm_poisson_poly2(): Poisson GLM with second degree polynomial.
- f_r2_gam_poisson(): Poisson GAM.
Binomial response (1 and 0) vs. numeric predictor:
- f_auc_glm_binomial(): AUC of quasibinomial GLM with weighted cases.
- f_auc_glm_binomial_poly2(): As above with second degree polynomial.
- f_auc_gam_binomial(): Quasibinomial GAM with weighted cases.
- f_auc_rpart(): Recursive Partition Tree with weighted cases.
- f_auc_rf(): Random Forest model with weighted cases.
Categorical response (character of factor) vs. categorical predictor:
- f_v(): Cramer's V between two categorical variables.
Categorical response vs. categorical or numerical predictor:
- f_v_rf_categorical(): Cramer's V of a Random Forest model.

The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name

Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.

This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order in collinear(), cor_select(), and vif_select().

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Accepts a character vector of response variables as input for the argument response. When more than one response is provided, the output is a named list of preference data frames.

Usage

preference_order(
  df = NULL,
  response = NULL,
  predictors = NULL,
  f = "auto",
  warn_limit = NULL,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

response

(optional; character string or vector) Name/s of response variable/s in df. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

f

(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of f_auto() for the given data is used:

f_auc_rf(): if response is binomial.
f_r2_pearson(): if response and predictors are numeric.
f_v(): if response and predictors are categorical.
f_v_rf_categorical(): if response is categorical and predictors are numeric or mixed .
f_r2_rf(): in all other cases.

Default: NULL

warn_limit

(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

data frame: columns are "response", "predictor", "f" (function name), and "preference".

Author

Blas M. Benito, PhD

Examples

#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors_numeric,
  f = NULL
  )
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.

#returns data frame ordered by preference
df_preference
#>      response             predictor            f preference
#> 1  vi_numeric              swi_mean f_r2_pearson 0.74333418
#> 2  vi_numeric  soil_temperature_max f_r2_pearson 0.61129999
#> 3  vi_numeric               swi_max f_r2_pearson 0.59335158
#> 4  vi_numeric             swi_range f_r2_pearson 0.41849723
#> 5  vi_numeric               swi_min f_r2_pearson 0.26066257
#> 6  vi_numeric        topo_diversity f_r2_pearson 0.11688815
#> 7  vi_numeric  soil_temperature_min f_r2_pearson 0.08964740
#> 8  vi_numeric            topo_slope f_r2_pearson 0.04638905
#> 9  vi_numeric        topo_elevation f_r2_pearson 0.02936205
#> 10 vi_numeric soil_temperature_mean f_r2_pearson 0.02388867


#several responses
#------------------------------------------------
responses <- c(
  "vi_categorical",
  "vi_counts"
)

preference_list <- preference_order(
  df = df,
  response = responses,
  predictors = predictors
)
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_counts'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_rf()'.
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#> 
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.

#returns a named list
names(preference_list)
#> [1] "vi_counts"      "vi_categorical"
preference_list[[1]]
#>     response          predictor       f preference
#> 1  vi_counts           swi_mean f_r2_rf 0.84397411
#> 2  vi_counts        koppen_zone f_r2_rf 0.81258327
#> 3  vi_counts koppen_description f_r2_rf 0.79691235
#> 4  vi_counts            swi_max f_r2_rf 0.72791918
#> 5  vi_counts       koppen_group f_r2_rf 0.71508036
#> 6  vi_counts            swi_min f_r2_rf 0.64426396
#> 7  vi_counts          soil_type f_r2_rf 0.63949456
#> 8  vi_counts     topo_elevation f_r2_rf 0.27154187
#> 9  vi_counts     topo_diversity f_r2_rf 0.13091316
#> 10 vi_counts         topo_slope f_r2_rf 0.07353939
preference_list[[2]]
#>          response          predictor                  f preference
#> 1  vi_categorical           swi_mean f_v_rf_categorical  0.5771390
#> 2  vi_categorical        koppen_zone f_v_rf_categorical  0.5568756
#> 3  vi_categorical koppen_description f_v_rf_categorical  0.5499681
#> 4  vi_categorical       koppen_group f_v_rf_categorical  0.5385112
#> 5  vi_categorical            swi_max f_v_rf_categorical  0.5257116
#> 6  vi_categorical          soil_type f_v_rf_categorical  0.4521790
#> 7  vi_categorical            swi_min f_v_rf_categorical  0.4515519
#> 8  vi_categorical     topo_elevation f_v_rf_categorical  0.2705244
#> 9  vi_categorical     topo_diversity f_v_rf_categorical  0.1734628
#> 10 vi_categorical         topo_slope f_v_rf_categorical  0.1578257

#can be used in collinear()
# x <- collinear(
#   df = df,
#   response = responses,
#   predictors = predictors,
#   preference_order = preference_list
# )

#f function selected by user
#for binomial response and numeric predictors
# preference_order(
#   df = vi,
#   response = "vi_binomial",
#   predictors = predictors_numeric,
#   f = f_auc_glm_binomial
# )


#disable parallelization
future::plan(future::sequential)