Automated Multicollinearity Filtering with Variance Inflation Factors
Source:R/vif_select.R
vif_select.Rd
This function automatizes multicollinearity filtering in data frames with numeric predictors by combining two methods:
Preference Order: method to rank and preserve relevant variables during multicollinearity filtering. See argument
preference_order
and functionpreference_order()
.VIF-based filtering: recursive algorithm to identify and remove predictors with a VIF above a given threshold.
When the argument preference_order
is not provided, the predictors are ranked lower to higher VIF. The predictor selection resulting from this option, albeit diverse and uncorrelated, might not be the one with the highest overall predictive power when used in a model.
Please check the sections Preference Order, Variance Inflation Factors, and VIF-based Filtering at the end of this help file for further details.
Usage
vif_select(
df = NULL,
predictors = NULL,
preference_order = NULL,
max_vif = 5,
quiet = FALSE
)
Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- predictors
(optional; character vector) Names of the predictors to select from
df
. If omitted, all numeric columns indf
are used instead. If argumentresponse
is not provided, non-numeric variables are ignored. Default: NULL- preference_order
(optional; string, character vector, output of
preference_order()
). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are:"auto" (default): if
response
is not NULL, callspreference_order()
for internal computation.character vector: predictor names in a custom preference order.
data frame: output of
preference_order()
fromresponse
of length one.named list: output of
preference_order()
fromresponse
of length two or more.NULL: disabled.
. Default: "auto"
- max_vif
(optional, numeric) Maximum Variance Inflation Factor allowed during variable selection. Recommended values are between 2.5 and 10. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the variance inflation analysis is disabled. Default: 5.
- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
Value
character vector if
response
is NULL or is a string.named list if
response
is a character vector.
Preference Order
This feature is designed to help protect important predictors during the multicollinearity filtering. It involves the arguments preference_order
and f
.
The argument preference_order
accepts:
: A character vector of predictor names in a custom order of preference, from first to last. This vector does not need to contain all predictor names, but only the ones relevant to the user.
A data frame returned by
preference_order()
, which ranks predictors based on their association with a response variable.If NULL, and
response
is provided, thenpreference_order()
is used internally to rank the predictors using the functionf
. Iff
is NULL as well, thenf_auto()
selects a proper function based on the data properties.
Variance Inflation Factors
The Variance Inflation Factor for a given variable \(a\) is computed as \(1/(1-R2)\), where \(R2\) is the multiple R-squared of a multiple regression model fitted using \(a\) as response and all other predictors in the input data frame as predictors, as in \(a = b + c + ...\).
The square root of the VIF of \(a\) is the factor by which the confidence interval of the estimate for \(a\) in the linear model \(y = a + b + c + ...\)` is widened by multicollinearity in the model predictors.
The range of VIF values is (1, Inf]. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.
VIF-based Filtering
The function vif_select()
computes Variance Inflation Factors and removes variables iteratively, until all variables in the resulting selection have a VIF below max_vif
.
If the argument preference_order
is not provided, all variables are ranked from lower to higher VIF, as returned by vif_df()
, and the variable with the higher VIF above max_vif
is removed on each iteration.
If preference_order
is defined, whenever two or more variables are above max_vif
, the one higher in preference_order
is preserved, and the next one with a higher VIF is removed. For example, for the predictors and preference order \(a\) and \(b\), if any of their VIFs is higher than max_vif
, then \(b\) will be removed no matter whether its VIF is lower or higher than \(a\)'s VIF. If their VIF scores are lower than max_vif
, then both are preserved.
References
David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
See also
Other vif:
vif_df()
Examples
#subset to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]
#predictors has mixed types
sapply(
X = df[, predictors, drop = FALSE],
FUN = class
)
#> koppen_zone koppen_group koppen_description soil_type
#> "character" "character" "character" "factor"
#> topo_slope topo_diversity topo_elevation swi_mean
#> "integer" "integer" "integer" "numeric"
#> swi_max swi_min
#> "numeric" "numeric"
#categorical predictors are ignored
x <- vif_select(
df = df,
predictors = predictors,
max_vif = 2.5
)
#>
#> collinear::vif_select(): these predictors are not numeric and will be ignored:
#> - koppen_zone
#> - koppen_group
#> - koppen_description
#> - soil_type.
#>
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#>
#> collinear::vif_select(): selected predictors:
#> - topo_elevation
#> - topo_diversity
#> - topo_slope
#> - swi_min
#> - swi_max
x
#> [1] "topo_elevation" "topo_diversity" "topo_slope" "swi_min"
#> [5] "swi_max"
#> attr(,"validated")
#> [1] TRUE
#all these have a VIF lower than max_vif (2.5)
vif_df(
df = df,
predictors = x
)
#> predictor vif
#> 4 swi_min 1.7846
#> 5 swi_max 1.7242
#> 3 topo_slope 1.6045
#> 2 topo_diversity 1.4499
#> 1 topo_elevation 1.1994
#higher max_vif results in larger selection
x <- vif_select(
df = df,
predictors = predictors_numeric,
max_vif = 10
)
#>
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#>
#> collinear::vif_select(): selected predictors:
#> - topo_elevation
#> - topo_diversity
#> - topo_slope
#> - swi_mean
#> - soil_temperature_max
#> - soil_temperature_min
#> - swi_min
x
#> [1] "topo_elevation" "topo_diversity" "topo_slope"
#> [4] "swi_mean" "soil_temperature_max" "soil_temperature_min"
#> [7] "swi_min"
#> attr(,"validated")
#> [1] TRUE
#smaller max_vif results in smaller selection
x <- vif_select(
df = df,
predictors = predictors_numeric,
max_vif = 2.5
)
#>
#> collinear::vif_select(): argument 'preference_order' is NULL, ranking predictors from lower to higher multicollinearity.
#>
#> collinear::vif_select(): selected predictors:
#> - topo_elevation
#> - topo_diversity
#> - topo_slope
#> - swi_mean
#> - soil_temperature_max
x
#> [1] "topo_elevation" "topo_diversity" "topo_slope"
#> [4] "swi_mean" "soil_temperature_max"
#> attr(,"validated")
#> [1] TRUE
#custom preference order
x <- vif_select(
df = df,
predictors = predictors_numeric,
preference_order = c(
"swi_mean",
"soil_temperature_mean",
"topo_elevation"
),
max_vif = 2.5
)
#>
#> collinear::vif_select(): selected predictors:
#> - swi_mean
#> - soil_temperature_mean
#> - topo_elevation
#> - topo_diversity
#> - topo_slope
x
#> [1] "swi_mean" "soil_temperature_mean" "topo_elevation"
#> [4] "topo_diversity" "topo_slope"
#> attr(,"validated")
#> [1] TRUE
#using automated preference order
df_preference <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors_numeric
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
x <- vif_select(
df = df,
predictors = predictors_numeric,
preference_order = df_preference,
max_vif = 2.5
)
#>
#> collinear::vif_select(): selected predictors:
#> - swi_mean
#> - soil_temperature_max
#> - topo_diversity
#> - topo_slope
#> - topo_elevation
x
#> [1] "swi_mean" "soil_temperature_max" "topo_diversity"
#> [4] "topo_slope" "topo_elevation"
#> attr(,"validated")
#> [1] TRUE
#categorical predictors are ignored
#the function returns NA
x <- vif_select(
df = df,
predictors = vi_predictors_categorical
)
#>
#> collinear::vif_select(): no numeric predictors available, skipping VIF-based filtering.
x
#> character(0)
#if predictors has length 1
#selection is skipped
#and data frame with one row is returned
x <- vif_select(
df = df,
predictors = predictors_numeric[1]
)
#>
#> collinear::vif_select(): only one predictor available, skipping VIF-based filtering.
x
#> [1] "topo_slope"