Skip to contents

Automates multicollinearity management by selecting variables based on their Variance Inflation Factor (VIF).

Warning: predictors with perfect correlation might cause errors, please use cor_select() to remove perfect correlations first.

The vif_select() function is designed to automate the reduction of multicollinearity in a set of predictors by using Variance Inflation Factors.

If the 'response' argument is provided, categorical predictors are converted to numeric via target encoding (see target_encoding_lab()). If the 'response' argument is not provided, categorical variables are ignored.

The Variance Inflation Factor for a given variable y is computed as 1/(1-R2), where R2 is the multiple R-squared of a multiple regression model fitted using y as response and all other predictors in the input data frame as predictors. The VIF equation can be interpreted as the "rate of perfect model's R-squared to the unexplained variance of this model".

The possible range of VIF values is (1, Inf]. A VIF lower than 10 suggest that removing y from the data set would reduce overall multicollinearity. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.

The function vif_select() applies a recursive algorithm to remove variables with a VIF higher than a given threshold (defined by the argument max_vif).

If the argument response is provided, all non-numeric variables in predictors are transformed into numeric using target encoding (see target_encoding_lab()). Otherwise, non-numeric variables are ignored.

The argument preference_order allows defining a preference selection order to preserve (when possible) variables that might be interesting or even required for a given analysis.

For example, if predictors is c("a", "b", "c") and preference_order is c("a", "b"), there are two possibilities:

  • If the VIF of "a" is higher than the VIF of "b", and both VIF values are above max_vif, then "a" is selected and "b" is removed.

  • If their correlation is equal or above max_cor, then "a" is selected, no matter its correlation with "c",

If preference_order is not provided, then the predictors are ranked by their variance inflation factor as computed by vif_df().

Usage

vif_select(
  df = NULL,
  response = NULL,
  predictors = NULL,
  preference_order = NULL,
  max_vif = 5,
  encoding_method = "mean"
)

Arguments

df

(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL.

response

(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.

predictors

(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'

preference_order

(optional; character vector) vector with column names in 'predictors' in the desired preference order, or result of the function preference_order(). Allows defining a priority order for selecting predictors, which can be particularly useful when some predictors are more critical for the analysis than others. Predictors not included in this argument are ranked by their Variance Inflation Factor. Default: NULL.

max_vif

(optional, numeric) Numeric with recommended values between 2.5 and 10 defining the maximum VIF allowed for any given predictor in the output dataset. Higher VIF thresholds should result in a higher number of selected variables. Default: 5.

encoding_method

(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see target_encoding_lab() for further details). Default: "mean"

Value

Character vector with the names of the selected predictors.

Author

Blas M. Benito

  • David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. doi:10.1002/0471725153 .

Examples


data(
  vi,
  vi_predictors
)

#subset to limit example run time
vi <- vi[1:1000, ]
vi_predictors <- vi_predictors[1:10]

#reduce correlation in predictors with cor_select()
vi_predictors <- cor_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  max_cor = 0.75
)

#without response
#without preference_order
#permissive max_vif
#only numeric predictors are processed
selected.predictors <- vif_select(
  df = vi,
  predictors = vi_predictors,
  max_vif = 10
)

selected.predictors
#> [1] "topo_elevation" "topo_diversity" "topo_slope"     "swi_max"       
#> [5] "swi_min"       

#without response
#without preference_order
#restrictive max_vif
#only numeric predictors are processed
selected.predictors <- vif_select(
  df = vi,
  predictors = vi_predictors,
  max_vif = 2.5
)

selected.predictors
#> [1] "topo_elevation" "topo_diversity" "topo_slope"     "swi_max"       
#> [5] "swi_min"       

#with response
#without preference_order
#restrictive max_cor
#slightly different solution than previous one
#because categorical variables are target-enccoded
selected.predictors <- vif_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  max_vif = 2.5
)

selected.predictors
#> [1] "topo_elevation" "topo_diversity" "topo_slope"     "swi_min"       
#> [5] "soil_type"     

#with response
#with user-defined preference_order
#restrictive max_cor
#numerics and categorical variables in output
selected.predictors <- vif_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  preference_order = c(
    "soil_type", #categorical variable
    "soil_temperature_mean",
    "swi_mean",
    "rainfall_mean",
    "evapotranspiration_mean"
  ),
  max_vif = 2.5
)

selected.predictors
#> [1] "soil_type"      "topo_elevation" "topo_diversity" "topo_slope"    
#> [5] "swi_min"       


#with response
#with automated preference_order
#restrictive max_cor and max_vif
#numerics and categorical variables in output
preference.order <- preference_order(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  f = f_rsquared #cor(response, predictor)
)

head(preference.order)
#>        predictor preference
#> 1      soil_type 0.64655761
#> 2        swi_max 0.59335158
#> 3        swi_min 0.26066257
#> 4 topo_diversity 0.11688815
#> 5     topo_slope 0.04638905
#> 6 topo_elevation 0.02936205

selected.predictors <- vif_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  preference_order = preference.order,
  max_vif = 2.5
)

selected.predictors
#> [1] "soil_type"      "swi_min"        "topo_diversity" "topo_slope"    
#> [5] "topo_elevation"