Automated Multicollinearity Filtering with Pairwise Correlations

Implements a recursive forward selection algorithm to keep predictors with a maximum pairwise correlation with all other selected predictors lower than a given threshold. Uses cor_df() underneath, and as such, can handle different combinations of predictor types.

Please check the section Pairwise Correlation Filtering at the end of this help file for further details.

Usage

cor_select(
  df = NULL,
  predictors = NULL,
  preference_order = NULL,
  max_cor = 0.75,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

preference_order

(optional; string, character vector, output of preference_order()). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:

"auto" (default): if response is not NULL, calls preference_order() for internal computation.
character vector: predictor names in a custom preference order.
data frame: output of preference_order() from response of length one.
named list: output of preference_order() from response of length two or more.
NULL: disabled.

. Default: "auto"

max_cor

(optional; numeric) Maximum correlation allowed between any pair of variables in predictors. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default: 0.75

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

character vector if response is NULL or is a string.
named list if response is a character vector.

Pairwise Correlation Filtering

The function cor_select() applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor.

If the argument preference_order is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.

If preference_order is defined, whenever two or more variables are above max_cor, the one higher in preference_order is preserved. For example, for the predictors and preference order \(a\) and \(b\), if their correlation is higher than max_cor, then \(b\) will be removed and \(a\) preserved. If their correlation is lower than max_cor, then both are preserved.

Author

Blas M. Benito, PhD

Examples

#subset to limit example run time
df <- vi[1:1000, ]

#only numeric predictors only to speed-up examples
#categorical predictors are supported, but result in a slower analysis
predictors <- vi_predictors_numeric[1:8]

#predictors has mixed types
sapply(
  X = df[, predictors, drop = FALSE],
  FUN = class
)
#>            topo_slope        topo_diversity        topo_elevation 
#>             "integer"             "integer"             "integer" 
#>              swi_mean               swi_max               swi_min 
#>             "numeric"             "numeric"             "numeric" 
#>             swi_range soil_temperature_mean 
#>             "numeric"             "numeric" 

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

#without preference order
x <- cor_select(
  df = df,
  predictors = predictors,
  max_cor = 0.75
)
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - topo_elevation
#>  - topo_slope
#>  - topo_diversity
#>  - swi_range
#>  - soil_temperature_mean
#>  - swi_min
#>  - swi_mean


#with custom preference order
x <- cor_select(
  df = df,
  predictors = predictors,
  preference_order = c(
    "swi_mean",
    "soil_type"
  ),
  max_cor = 0.75
)
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - swi_mean
#>  - topo_elevation
#>  - topo_slope
#>  - topo_diversity
#>  - swi_range
#>  - soil_temperature_mean
#>  - swi_min


#with automated preference order
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors
)
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.

x <- cor_select(
  df = df,
  predictors = predictors,
  preference_order = df_preference,
  max_cor = 0.75
)
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - swi_mean
#>  - swi_range
#>  - swi_min
#>  - topo_diversity
#>  - topo_slope
#>  - topo_elevation
#>  - soil_temperature_mean

#resetting to sequential processing
future::plan(future::sequential)