Automated Multicollinearity Filtering with Variance Inflation Factors

This function automatizes multicollinearity filtering in data frames with numeric predictors by combining two methods:

Preference Order: method to rank and preserve relevant variables during multicollinearity filtering. See argument preference_order and function preference_order().
VIF-based filtering: recursive algorithm to identify and remove predictors with a VIF above a given threshold.

When the argument preference_order is not provided, the predictors are ranked lower to higher VIF. The predictor selection resulting from this option, albeit diverse and uncorrelated, might not be the one with the highest overall predictive power when used in a model.

Please check the sections Preference Order, Variance Inflation Factors, and VIF-based Filtering at the end of this help file for further details.

Usage

vif_select(
  df = NULL,
  predictors = NULL,
  preference_order = NULL,
  max_vif = 5,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

preference_order

(optional; string, character vector, output of preference_order()). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:

"auto" (default): if response is not NULL, calls preference_order() for internal computation.
character vector: predictor names in a custom preference order.
data frame: output of preference_order() from response of length one.
named list: output of preference_order() from response of length two or more.
NULL: disabled.

. Default: "auto"

max_vif

(optional, numeric) Maximum Variance Inflation Factor allowed during variable selection. Recommended values are between 2.5 and 10. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the variance inflation analysis is disabled. Default: 5.

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

character vector if response is NULL or is a string.
named list if response is a character vector.

Preference Order

This feature is designed to help protect important predictors during the multicollinearity filtering. It involves the arguments preference_order and f.

The argument preference_order accepts:

: A character vector of predictor names in a custom order of preference, from first to last. This vector does not need to contain all predictor names, but only the ones relevant to the user.
A data frame returned by preference_order(), which ranks predictors based on their association with a response variable.
If NULL, and response is provided, then preference_order() is used internally to rank the predictors using the function f. If f is NULL as well, then f_auto() selects a proper function based on the data properties.

Variance Inflation Factors

The Variance Inflation Factor for a given variable \(a\) is computed as \(1/(1-R2)\), where \(R2\) is the multiple R-squared of a multiple regression model fitted using \(a\) as response and all other predictors in the input data frame as predictors, as in \(a = b + c + ...\).

The square root of the VIF of \(a\) is the factor by which the confidence interval of the estimate for \(a\) in the linear model \(y = a + b + c + ...\)` is widened by multicollinearity in the model predictors.

The range of VIF values is (1, Inf]. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.

VIF-based Filtering

The function vif_select() computes Variance Inflation Factors and removes variables iteratively, until all variables in the resulting selection have a VIF below max_vif.

If the argument preference_order is not provided, all variables are ranked from lower to higher VIF, as returned by vif_df(), and the variable with the higher VIF above max_vif is removed on each iteration.

If preference_order is defined, whenever two or more variables are above max_vif, the one higher in preference_order is preserved, and the next one with a higher VIF is removed. For example, for the predictors and preference order \(a\) and \(b\), if any of their VIFs is higher than max_vif, then \(b\) will be removed no matter whether its VIF is lower or higher than \(a\)'s VIF. If their VIF scores are lower than max_vif, then both are preserved.

References

David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.

Author

Blas M. Benito, PhD

Examples

#subset to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]

#predictors has mixed types
sapply(
  X = df[, predictors, drop = FALSE],
  FUN = class
)
#>        koppen_zone       koppen_group koppen_description          soil_type 
#>        "character"        "character"        "character"           "factor" 
#>         topo_slope     topo_diversity     topo_elevation           swi_mean 
#>          "integer"          "integer"          "integer"          "numeric" 
#>            swi_max            swi_min 
#>          "numeric"          "numeric" 

#categorical predictors are ignored
x <- vif_select(
  df = df,
  predictors = predictors,
  max_vif = 2.5
)
#> 
#> collinear::vif_select(): these predictors are not numeric and will be ignored: 
#>  - koppen_zone
#>  - koppen_group
#>  - koppen_description
#>  - soil_type.
#> 
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#> 
#> collinear::vif_select(): selected predictors: 
#>  - topo_elevation
#>  - topo_diversity
#>  - topo_slope
#>  - swi_min
#>  - swi_max

x
#> [1] "topo_elevation" "topo_diversity" "topo_slope"     "swi_min"       
#> [5] "swi_max"       
#> attr(,"validated")
#> [1] TRUE

#all these have a VIF lower than max_vif (2.5)
vif_df(
  df = df,
  predictors = x
)
#>        predictor    vif
#> 4        swi_min 1.7846
#> 5        swi_max 1.7242
#> 3     topo_slope 1.6045
#> 2 topo_diversity 1.4499
#> 1 topo_elevation 1.1994


#higher max_vif results in larger selection
x <- vif_select(
  df = df,
  predictors = predictors_numeric,
  max_vif = 10
)
#> 
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#> 
#> collinear::vif_select(): selected predictors: 
#>  - topo_elevation
#>  - topo_diversity
#>  - topo_slope
#>  - swi_mean
#>  - soil_temperature_max
#>  - soil_temperature_min
#>  - swi_min

x
#> [1] "topo_elevation"       "topo_diversity"       "topo_slope"          
#> [4] "swi_mean"             "soil_temperature_max" "soil_temperature_min"
#> [7] "swi_min"             
#> attr(,"validated")
#> [1] TRUE


#smaller max_vif results in smaller selection
x <- vif_select(
  df = df,
  predictors = predictors_numeric,
  max_vif = 2.5
)
#> 
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#> 
#> collinear::vif_select(): selected predictors: 
#>  - topo_elevation
#>  - topo_diversity
#>  - topo_slope
#>  - swi_mean
#>  - soil_temperature_max

x
#> [1] "topo_elevation"       "topo_diversity"       "topo_slope"          
#> [4] "swi_mean"             "soil_temperature_max"
#> attr(,"validated")
#> [1] TRUE


#custom preference order
x <- vif_select(
  df = df,
  predictors = predictors_numeric,
  preference_order = c(
    "swi_mean",
    "soil_temperature_mean",
    "topo_elevation"
  ),
  max_vif = 2.5
)
#> 
#> collinear::vif_select(): selected predictors: 
#>  - swi_mean
#>  - soil_temperature_mean
#>  - topo_elevation
#>  - topo_diversity
#>  - topo_slope

x
#> [1] "swi_mean"              "soil_temperature_mean" "topo_elevation"       
#> [4] "topo_diversity"        "topo_slope"           
#> attr(,"validated")
#> [1] TRUE

#using automated preference order
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors_numeric
)
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.

x <- vif_select(
  df = df,
  predictors = predictors_numeric,
  preference_order = df_preference,
  max_vif = 2.5
)
#> 
#> collinear::vif_select(): selected predictors: 
#>  - swi_mean
#>  - soil_temperature_max
#>  - topo_diversity
#>  - topo_slope
#>  - topo_elevation

x
#> [1] "swi_mean"             "soil_temperature_max" "topo_diversity"      
#> [4] "topo_slope"           "topo_elevation"      
#> attr(,"validated")
#> [1] TRUE


#categorical predictors are ignored
#the function returns NA
x <- vif_select(
  df = df,
  predictors = vi_predictors_categorical
)
#> 
#> collinear::vif_select(): no numeric predictors available, skipping VIF-based filtering.

x
#> character(0)


#if predictors has length 1
#selection is skipped
#and data frame with one row is returned
x <- vif_select(
  df = df,
  predictors = predictors_numeric[1]
)
#> 
#> collinear::vif_select(): only one predictor available, skipping VIF-based filtering.

x
#> [1] "topo_slope"