Skip to contents

Computes the Variance Inflation Factor of all variables in a training data frame.

Warning: predictors with perfect correlation might cause errors, please use cor_select() to remove perfect correlations first.

The Variance Inflation Factor for a given variable y is computed as 1/(1-R2), where R2 is the multiple R-squared of a multiple regression model fitted using y as response and all the remaining variables of the input data set as predictors. The equation can be interpreted as "the rate of perfect model's R-squared to the unexplained variance of this model".

The possible range of VIF values is (1, Inf]. A VIF lower than 10 suggest that removing y from the data set would reduce overall multicollinearity.

This function computes the Variance Inflation Factor (VIF) in two steps:

  • Applies \link[base]{solve} to obtain the precision matrix, which is the inverse of the covariance matrix.

  • Uses \link[base]{diag} to extract the diagonal of the precision matrix, which contains the variance of the prediction of each predictor from all other predictors.

Usage

vif_df(df = NULL, response = NULL, predictors = NULL, encoding_method = "mean")

Arguments

df

(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL.

response

(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.

predictors

(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'

encoding_method

(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see target_encoding_lab() for further details). Default: "mean"

Value

Data frame with predictor names and VIF values

Author

Blas M. Benito

  • David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. doi:10.1002/0471725153 .

Examples


data(
  vi,
  vi_predictors
)

#subset to limit example run time
vi <- vi[1:1000, ]

#reduce correlation in predictors with cor_select()
vi_predictors <- cor_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  max_cor = 0.75
)

#without response
#only numeric predictors are returned
df <- vif_df(
  df = vi,
  predictors = vi_predictors
)

df
#>                      variable    vif
#> 1                   soil_clay  1.392
#> 2              topo_diversity  1.582
#> 3                 country_gdp  1.833
#> 4                  topo_slope  1.841
#> 5                   soil_silt  1.968
#> 6          country_population  1.999
#> 7                    soil_soc  2.615
#> 8                rainfall_min  2.789
#> 9              rainfall_range  3.034
#> 10             humidity_range  3.242
#> 11                  swi_range  3.287
#> 12             topo_elevation  3.570
#> 13                    swi_min  3.890
#> 14              soil_nitrogen  4.149
#> 15 growing_season_temperature  4.594
#> 16          cloud_cover_range  4.773
#> 17    temperature_seasonality  6.025
#> 18              solar_rad_max  6.622
#> 19            temperature_max  8.678
#> 20            cloud_cover_min  8.871
#> 21               humidity_max  9.117
#> 22             solar_rad_mean 10.467

#with response
#categorical and numeric predictors are returned
df <- vif_df(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors
)

df
#>                      variable    vif
#> 1              country_income  1.331
#> 2                   soil_clay  1.441
#> 3              topo_diversity  1.616
#> 4                  topo_slope  1.884
#> 5                 country_gdp  1.932
#> 6          country_population  2.177
#> 7                   soil_silt  2.199
#> 8                   continent  2.327
#> 9                    soil_soc  2.689
#> 10               rainfall_min  2.806
#> 11                  soil_type  3.174
#> 12             rainfall_range  3.242
#> 13                  subregion  3.361
#> 14             humidity_range  3.471
#> 15               biogeo_realm  3.611
#> 16             topo_elevation  3.675
#> 17              soil_nitrogen  4.286
#> 18                  swi_range  4.647
#> 19                    swi_min  4.769
#> 20               koppen_group  4.923
#> 21          cloud_cover_range  4.967
#> 22 growing_season_temperature  5.220
#> 23    temperature_seasonality  6.711
#> 24              solar_rad_max  6.981
#> 25            cloud_cover_min  9.341
#> 26            temperature_max  9.371
#> 27               humidity_max  9.400
#> 28             solar_rad_mean 10.671