Automated Multicollinearity Filtering with Pairwise Correlations
Source:R/cor_select.R
cor_select.Rd
Implements a recursive forward selection algorithm to keep predictors with a maximum pairwise correlation with all other selected predictors lower than a given threshold. Uses cor_df()
underneath, and as such, can handle different combinations of predictor types.
Please check the section Pairwise Correlation Filtering at the end of this help file for further details.
Usage
cor_select(
df = NULL,
predictors = NULL,
preference_order = NULL,
max_cor = 0.75,
quiet = FALSE
)
Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- predictors
(optional; character vector) Names of the predictors to select from
df
. If omitted, all numeric columns indf
are used instead. If argumentresponse
is not provided, non-numeric variables are ignored. Default: NULL- preference_order
(optional; string, character vector, output of
preference_order()
). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:"auto" (default): if
response
is not NULL, callspreference_order()
for internal computation.character vector: predictor names in a custom preference order.
data frame: output of
preference_order()
fromresponse
of length one.named list: output of
preference_order()
fromresponse
of length two or more.NULL: disabled.
. Default: "auto"
- max_cor
(optional; numeric) Maximum correlation allowed between any pair of variables in
predictors
. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default:0.75
- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
Value
character vector if
response
is NULL or is a string.named list if
response
is a character vector.
Pairwise Correlation Filtering
The function cor_select()
applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor
.
If the argument preference_order
is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.
If preference_order
is defined, whenever two or more variables are above max_cor
, the one higher in preference_order
is preserved. For example, for the predictors and preference order \(a\) and \(b\), if their correlation is higher than max_cor
, then \(b\) will be removed and \(a\) preserved. If their correlation is lower than max_cor
, then both are preserved.
See also
Other pairwise_correlation:
cor_clusters()
,
cor_cramer_v()
,
cor_df()
,
cor_matrix()
Examples
#subset to limit example run time
df <- vi[1:1000, ]
#only numeric predictors only to speed-up examples
#categorical predictors are supported, but result in a slower analysis
predictors <- vi_predictors_numeric[1:8]
#predictors has mixed types
sapply(
X = df[, predictors, drop = FALSE],
FUN = class
)
#> topo_slope topo_diversity topo_elevation
#> "integer" "integer" "integer"
#> swi_mean swi_max swi_min
#> "numeric" "numeric" "numeric"
#> swi_range soil_temperature_mean
#> "numeric" "numeric"
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
# progressr::handlers(global = TRUE)
#without preference order
x <- cor_select(
df = df,
predictors = predictors,
max_cor = 0.75
)
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#>
#> collinear::cor_select(): selected predictors:
#> - topo_elevation
#> - topo_slope
#> - topo_diversity
#> - swi_range
#> - soil_temperature_mean
#> - swi_min
#> - swi_mean
#with custom preference order
x <- cor_select(
df = df,
predictors = predictors,
preference_order = c(
"swi_mean",
"soil_type"
),
max_cor = 0.75
)
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - swi_mean
#> - topo_elevation
#> - topo_slope
#> - topo_diversity
#> - swi_range
#> - soil_temperature_mean
#> - swi_min
#with automated preference order
df_preference <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
x <- cor_select(
df = df,
predictors = predictors,
preference_order = df_preference,
max_cor = 0.75
)
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - swi_mean
#> - swi_range
#> - swi_min
#> - topo_diversity
#> - topo_slope
#> - topo_elevation
#> - soil_temperature_mean
#resetting to sequential processing
future::plan(future::sequential)