Automates multicollinearity management in data frames with numeric and non-numeric predictors by combining four methods:
Target Encoding: When a numeric
response
is provided andencoding_method
is not NULL, it transforms categorical predictors (classes "character" and "factor") to numeric using the response values as reference. Seetarget_encoding_lab()
for further details.Preference Order: When a response of any type is provided via
response
, the association between the response and each predictor is computed with an appropriate function (seepreference_order()
andf_auto()
), and all predictors are ranked from higher to lower association. This rank is used to preserve important predictors during the multicollinearity filtering.Pairwise Correlation Filtering: Automated multicollinearity filtering via pairwise correlation. Correlations between numeric and categoricals predictors are computed by target-encoding the categorical against the predictor, and correlations between categoricals are computed via Cramer's V. See
cor_select()
,cor_df()
, andcor_cramer_v()
for further details.VIF filtering: Automated algorithm to identify and remove numeric predictors that are linear combinations of other predictors. See
vif_select()
andvif_df()
.
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
Accepts a character vector of response variables as input for the argument response
. When more than one response is provided, the output is a named list of character.
Usage
collinear(
df = NULL,
response = NULL,
predictors = NULL,
encoding_method = "loo",
preference_order = "auto",
f = "auto",
max_cor = 0.75,
max_vif = 5,
quiet = FALSE
)
Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- response
(optional; character string or vector) Name/s of response variable/s in
df
. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.- predictors
(optional; character vector) Names of the predictors to select from
df
. If omitted, all numeric columns indf
are used instead. If argumentresponse
is not provided, non-numeric variables are ignored. Default: NULL- encoding_method
(optional; character string). Name of the target encoding method. One of: "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: "loo"
- preference_order
(optional; string, character vector, output of
preference_order()
). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:"auto" (default): if
response
is not NULL, callspreference_order()
for internal computation.character vector: predictor names in a custom preference order.
data frame: output of
preference_order()
fromresponse
of length one.named list: output of
preference_order()
fromresponse
of length two or more.NULL: disabled.
. Default: "auto"
- f
(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of
f_auto()
for the given data is used:f_auc_rf()
: ifresponse
is binomial.f_r2_pearson()
: ifresponse
andpredictors
are numeric.f_v()
: ifresponse
andpredictors
are categorical.f_v_rf_categorical()
: ifresponse
is categorical andpredictors
are numeric or mixed .f_r2_rf()
: in all other cases.
Default: NULL
- max_cor
(optional; numeric) Maximum correlation allowed between any pair of variables in
predictors
. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default:0.75
- max_vif
(optional, numeric) Maximum Variance Inflation Factor allowed during variable selection. Recommended values are between 2.5 and 10. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the variance inflation analysis is disabled. Default: 5.
- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
Value
character vector if
response
is NULL or is a string.named list if
response
is a character vector.
Target Encoding
When the argument response
names a numeric response variable, categorical predictors in predictors
(or in the columns of df
if predictors
is NULL) are converted to numeric via target encoding with the function target_encoding_lab()
. When response
is NULL or names a categorical variable, target-encoding is skipped. This feature facilitates multicollinearity filtering in data frames with mixed column types.
Preference Order
This feature is designed to help protect important predictors during the multicollinearity filtering. It involves the arguments preference_order
and f
.
The argument preference_order
accepts:
: A character vector of predictor names in a custom order of preference, from first to last. This vector does not need to contain all predictor names, but only the ones relevant to the user.
A data frame returned by
preference_order()
, which ranks predictors based on their association with a response variable.If NULL, and
response
is provided, thenpreference_order()
is used internally to rank the predictors using the functionf
. Iff
is NULL as well, thenf_auto()
selects a proper function based on the data properties.
Variance Inflation Factors
The Variance Inflation Factor for a given variable \(a\) is computed as \(1/(1-R2)\), where \(R2\) is the multiple R-squared of a multiple regression model fitted using \(a\) as response and all other predictors in the input data frame as predictors, as in \(a = b + c + ...\).
The square root of the VIF of \(a\) is the factor by which the confidence interval of the estimate for \(a\) in the linear model \(y = a + b + c + ...\)` is widened by multicollinearity in the model predictors.
The range of VIF values is (1, Inf]. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.
VIF-based Filtering
The function vif_select()
computes Variance Inflation Factors and removes variables iteratively, until all variables in the resulting selection have a VIF below max_vif
.
If the argument preference_order
is not provided, all variables are ranked from lower to higher VIF, as returned by vif_df()
, and the variable with the higher VIF above max_vif
is removed on each iteration.
If preference_order
is defined, whenever two or more variables are above max_vif
, the one higher in preference_order
is preserved, and the next one with a higher VIF is removed. For example, for the predictors and preference order \(a\) and \(b\), if any of their VIFs is higher than max_vif
, then \(b\) will be removed no matter whether its VIF is lower or higher than \(a\)'s VIF. If their VIF scores are lower than max_vif
, then both are preserved.
Pairwise Correlation Filtering
The function cor_select()
applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor
.
If the argument preference_order
is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.
If preference_order
is defined, whenever two or more variables are above max_cor
, the one higher in preference_order
is preserved. For example, for the predictors and preference order \(a\) and \(b\), if their correlation is higher than max_cor
, then \(b\) will be removed and \(a\) preserved. If their correlation is lower than max_cor
, then both are preserved.
References
David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538
Examples
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
#progressr::handlers(global = TRUE)
#subset to limit example run time
df <- vi[1:500, ]
#predictors has mixed types
#small subset to speed example up
predictors <- c(
"swi_mean",
"soil_type",
"soil_temperature_mean",
"growing_season_length",
"rainfall_mean"
)
#with numeric responses
#--------------------------------
# target encoding
# automated preference order
# all predictors filtered by correlation and VIF
x <- collinear(
df = df,
response = c(
"vi_numeric",
"vi_binomial"
),
predictors = predictors
)
#>
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - soil_type
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - growing_season_length
#> - soil_type
#> - soil_temperature_mean
#>
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.
#>
#> collinear::collinear(): processing response 'vi_binomial'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_binomial' to encode categorical predictors:
#> - soil_type
#>
#> collinear::preference_order(): ranking predictors for response 'vi_binomial'.
#>
#> collinear::f_auto(): selected function: 'f_auc_rf()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - soil_type
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_mean
#>
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.
x
#> $vi_numeric
#> [1] "growing_season_length" "soil_type" "soil_temperature_mean"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#>
#> $vi_binomial
#> [1] "soil_type" "rainfall_mean" "swi_mean"
#> [4] "soil_temperature_mean"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_binomial"
#>
#with custom preference order
#--------------------------------
x <- collinear(
df = df,
response = "vi_numeric",
predictors = predictors,
preference_order = c(
"swi_mean",
"soil_type"
)
)
#>
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - soil_type
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - swi_mean
#> - soil_type
#> - rainfall_mean
#> - soil_temperature_mean
#>
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.
#pre-computed preference order
#--------------------------------
preference_df <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_rf()'.
x <- collinear(
df = df,
response = "vi_numeric",
predictors = predictors,
preference_order = preference_df
)
#>
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - soil_type
#>
#> collinear::collinear(): using preference order data frame.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_type
#> - soil_temperature_mean
#>
#> collinear::vif_select(): maximum VIF in 'predictors' is <= 5. skipping VIF-based filtering.
#resetting to sequential processing
future::plan(future::sequential)