Automated multicollinearity management — collinear • collinear

Automates multicollinearity management in data frames with numeric and non-numeric predictors by combining four methods:

Target Encoding: When a numeric response is provided and encoding_method is not NULL, it transforms categorical predictors (classes "character" and "factor") to numeric using the response values as reference. See target_encoding_lab() for further details.
Preference Order: When a response of any type is provided via response, the association between the response and each predictor is computed with an appropriate function (see preference_order() and f_auto()), and all predictors are ranked from higher to lower association. This rank is used to preserve important predictors during the multicollinearity filtering.
Pairwise Correlation Filtering: Automated multicollinearity filtering via pairwise correlation. Correlations between numeric and categoricals predictors are computed by target-encoding the categorical against the predictor, and correlations between categoricals are computed via Cramer's V. See cor_select(), cor_df(), and cor_cramer_v() for further details.
VIF filtering: Automated algorithm to identify and remove numeric predictors that are linear combinations of other predictors. See vif_select() and vif_df().

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Accepts a character vector of response variables as input for the argument response. When more than one response is provided, the output is a named list of character.

Usage

collinear(
  df = NULL,
  response = NULL,
  predictors = NULL,
  encoding_method = "loo",
  preference_order = "auto",
  f = "auto",
  max_cor = 0.75,
  max_vif = 5,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

response

(optional; character string or vector) Name/s of response variable/s in df. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

encoding_method

(optional; character string). Name of the target encoding method. One of: "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: "loo"

preference_order

(optional; string, character vector, output of preference_order()). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:

"auto" (default): if response is not NULL, calls preference_order() for internal computation.
character vector: predictor names in a custom preference order.
data frame: output of preference_order() from response of length one.
named list: output of preference_order() from response of length two or more.
NULL: disabled.

. Default: "auto"

f

(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of f_auto() for the given data is used:

f_auc_rf(): if response is binomial.
f_r2_pearson(): if response and predictors are numeric.
f_v(): if response and predictors are categorical.
f_v_rf_categorical(): if response is categorical and predictors are numeric or mixed .
f_r2_rf(): in all other cases.

Default: NULL

max_cor

(optional; numeric) Maximum correlation allowed between any pair of variables in predictors. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default: 0.75

max_vif

(optional, numeric) Maximum Variance Inflation Factor allowed during variable selection. Recommended values are between 2.5 and 10. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the variance inflation analysis is disabled. Default: 5.

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

character vector if response is NULL or is a string.
named list if response is a character vector.

Target Encoding

When the argument response names a numeric response variable, categorical predictors in predictors (or in the columns of df if predictors is NULL) are converted to numeric via target encoding with the function target_encoding_lab(). When response is NULL or names a categorical variable, target-encoding is skipped. This feature facilitates multicollinearity filtering in data frames with mixed column types.

Preference Order

This feature is designed to help protect important predictors during the multicollinearity filtering. It involves the arguments preference_order and f.

The argument preference_order accepts:

: A character vector of predictor names in a custom order of preference, from first to last. This vector does not need to contain all predictor names, but only the ones relevant to the user.
A data frame returned by preference_order(), which ranks predictors based on their association with a response variable.
If NULL, and response is provided, then preference_order() is used internally to rank the predictors using the function f. If f is NULL as well, then f_auto() selects a proper function based on the data properties.

Variance Inflation Factors

The Variance Inflation Factor for a given variable \(a\) is computed as \(1/(1-R2)\), where \(R2\) is the multiple R-squared of a multiple regression model fitted using \(a\) as response and all other predictors in the input data frame as predictors, as in \(a = b + c + ...\).

The square root of the VIF of \(a\) is the factor by which the confidence interval of the estimate for \(a\) in the linear model \(y = a + b + c + ...\)` is widened by multicollinearity in the model predictors.

The range of VIF values is (1, Inf]. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.

VIF-based Filtering

The function vif_select() computes Variance Inflation Factors and removes variables iteratively, until all variables in the resulting selection have a VIF below max_vif.

If the argument preference_order is not provided, all variables are ranked from lower to higher VIF, as returned by vif_df(), and the variable with the higher VIF above max_vif is removed on each iteration.

If preference_order is defined, whenever two or more variables are above max_vif, the one higher in preference_order is preserved, and the next one with a higher VIF is removed. For example, for the predictors and preference order \(a\) and \(b\), if any of their VIFs is higher than max_vif, then \(b\) will be removed no matter whether its VIF is lower or higher than \(a\)'s VIF. If their VIF scores are lower than max_vif, then both are preserved.

Pairwise Correlation Filtering

The function cor_select() applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor.

If the argument preference_order is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.

If preference_order is defined, whenever two or more variables are above max_cor, the one higher in preference_order is preserved. For example, for the predictors and preference order \(a\) and \(b\), if their correlation is higher than max_cor, then \(b\) will be removed and \(a\) preserved. If their correlation is lower than max_cor, then both are preserved.

References

David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538

Examples

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
#progressr::handlers(global = TRUE)

#subset to limit example run time
df <- vi[1:500, ]

#predictors has mixed types
#small subset to speed example up
predictors <- c(
  "swi_mean",
  "soil_type",
  "soil_temperature_mean",
  "growing_season_length",
  "rainfall_mean"
  )


#with numeric responses
#--------------------------------
#  target encoding
#  automated preference order
#  all predictors filtered by correlation and VIF
x <- collinear(
  df = df,
  response = c(
    "vi_numeric",
    "vi_binomial"
    ),
  predictors = predictors
)
#> 
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#>  - soil_type
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - growing_season_length
#>  - soil_type
#>  - soil_temperature_mean
#> 
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.
#> 
#> collinear::collinear(): processing response 'vi_binomial'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_binomial' to encode categorical predictors:
#>  - soil_type
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_binomial'.
#> 
#> collinear::f_auto(): selected function: 'f_auc_rf()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - soil_type
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_mean
#> 
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.

x
#> $vi_numeric
#> [1] "growing_season_length" "soil_type"             "soil_temperature_mean"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#> 
#> $vi_binomial
#> [1] "soil_type"             "rainfall_mean"         "swi_mean"             
#> [4] "soil_temperature_mean"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_binomial"
#> 


#with custom preference order
#--------------------------------
x <- collinear(
  df = df,
  response = "vi_numeric",
  predictors = predictors,
  preference_order = c(
    "swi_mean",
    "soil_type"
  )
)
#> 
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#>  - soil_type
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - swi_mean
#>  - soil_type
#>  - rainfall_mean
#>  - soil_temperature_mean
#> 
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.


#pre-computed preference order
#--------------------------------
preference_df <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors
)
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_rf()'.

x <- collinear(
  df = df,
  response = "vi_numeric",
  predictors = predictors,
  preference_order = preference_df
)
#> 
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#>  - soil_type
#> 
#> collinear::collinear(): using preference order data frame.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_type
#>  - soil_temperature_mean
#> 
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.

#resetting to sequential processing
future::plan(future::sequential)