Summary
This article explains how the package collinear
handles
responses and predictors of different types to facilitate
multicollinearity filtering. The explanation is centered on the inputs,
logic, and output of the function collinear()
, and the main
functions it calls: target_encoding_lab()
,
preference_order()
, cor_select()
, and
vif_select()
.
Parallelization Setup and Progress Bars
Most functions in the package now accept a parallelization setup via
future::plan()
and progress bars via
progressr::handlers()
. However, progress bars are ignored
in this tutorial because they don’t work in Rmarkdown.
future::plan(
future::multisession,
workers = parallelly::availableCores() - 1
)
#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)
Example Data
The package collinear
includes the example data frame
collinear::vi
, with 30000 rows, 68 columns, and 20108 NA
values. It contains several numeric and categorical responses and
predictors.
The response columns, all derived from the same data, have
descriptive names: vi_numeric
, vi_counts
(integers), vi_binomial
(1s and 0s),
vi_categorical
(five categories), and
vi_factor
(factor version of the previous one).
Predictor names are grouped in character vectors:
collinear::vi_predictors_numeric
(49 numeric and integer
predictors), collinear::vi_predictors_categorical
(12
character and factor predictors), and
collinear::vi_predictors
containing them all.
The code below makes collinear::vi
a bit smaller to
accelerate the examples below.
df <- collinear::vi[1:5000, ]
The Function collinear()
This function serves as single entry point to the full functionality of the package. It aims to facilitate multicollinearity filtering for any combination of categorical and/or numeric responses and predictors.
The code below runs a full multicollinearity filtering for a numeric and a categorical response, and a set of predictors with mixed types (numeric, integer, character and factor).
The meaning of the function arguments is explained in the next sections.
selection <- collinear::collinear(
df = df,
response = c(
"vi_numeric", #numeric response
"vi_categorical" #categorical response
),
predictors = vi_predictors, #numeric and categorical predictors
encoding_method = "loo", #leave-one-out target encoding
preference_order = "auto", #automatic ranking of predictors
f = "auto", #automatic selection of ranking function
quiet = FALSE, #enable messages
max_cor = 0.75, #maximum correlation threshold
max_vif = 5 #maximum VIF threshold
)
#>
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - koppen_zone
#> - koppen_group
#> - koppen_description
#> - soil_type
#> - biogeo_ecoregion
#> - biogeo_biome
#> - biogeo_realm
#> - country_name
#> - country_income
#> - continent
#> - region
#> - subregion
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - growing_season_length
#> - soil_temperature_max
#> - soil_temperature_range
#> - solar_rad_max
#> - rainfall_max
#> - subregion
#> - biogeo_realm
#> - swi_range
#> - rainfall_min
#> - solar_rad_mean
#> - soil_nitrogen
#> - continent
#> - soil_soc
#> - solar_rad_range
#> - cloud_cover_range
#> - topo_diversity
#> - soil_clay
#> - humidity_range
#> - country_income
#> - topo_elevation
#> - soil_sand
#> - topo_slope
#> - temperature_mean
#> - country_population
#> - country_gdp
#>
#> collinear::vif_select(): selected predictors:
#> - growing_season_length
#> - soil_temperature_max
#> - soil_temperature_range
#> - solar_rad_max
#> - rainfall_max
#> - subregion
#> - biogeo_realm
#> - swi_range
#> - rainfall_min
#> - soil_nitrogen
#> - continent
#> - cloud_cover_range
#> - topo_diversity
#>
#> collinear::collinear(): processing response 'vi_categorical'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): argument 'response' is not numeric, skipping target-encoding.
#>
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#>
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_max
#> - soil_type
#> - humidity_max
#> - solar_rad_max
#> - country_gdp
#> - swi_range
#> - rainfall_range
#> - country_population
#> - soil_soc
#> - rainfall_min
#> - temperature_range
#> - evapotranspiration_mean
#> - soil_nitrogen
#> - region
#> - growing_season_temperature
#> - country_income
#> - cloud_cover_range
#> - humidity_range
#> - soil_sand
#> - soil_clay
#> - topo_elevation
#> - topo_diversity
#> - topo_slope
#>
#> collinear::vif_select(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_max
#> - humidity_max
#> - solar_rad_max
#> - country_gdp
#> - swi_range
#> - rainfall_range
#> - country_population
#> - soil_soc
#> - topo_diversity
#> - topo_slope
#>
#> collinear::collinear(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_max
#> - soil_type
#> - humidity_max
#> - solar_rad_max
#> - country_gdp
#> - swi_range
#> - rainfall_range
#> - country_population
#> - soil_soc
#> - region
#> - country_income
#> - topo_diversity
#> - topo_slope
The output is a named list of vectors with predictor names when more than one response is provided, and a character vector otherwise.
selection
#> $vi_numeric
#> [1] "growing_season_length" "soil_temperature_max" "soil_temperature_range"
#> [4] "solar_rad_max" "rainfall_max" "subregion"
#> [7] "biogeo_realm" "swi_range" "rainfall_min"
#> [10] "soil_nitrogen" "continent" "cloud_cover_range"
#> [13] "topo_diversity"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#>
#> $vi_categorical
#> [1] "rainfall_mean" "swi_mean" "soil_temperature_max"
#> [4] "soil_type" "humidity_max" "solar_rad_max"
#> [7] "country_gdp" "swi_range" "rainfall_range"
#> [10] "country_population" "soil_soc" "region"
#> [13] "country_income" "topo_diversity" "topo_slope"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_categorical"
The variable selections are different because
collinear()
follows different paths for numeric and
categorical responses. The table below summarizes the differences
between these alternate paths.
Functionality | numeric response |
categorical response |
---|---|---|
Target-encoding | executed: 12 categorical predictors transformed to
numeric |
skipped: categorical predictors continue to next
steps |
Preference order |
f_r2_pearson() : R-squared between response
and predictors
|
f_v_rf_categorical() : Cramer’s V of
response against univariate random forest predictions |
Pairwise correlation filtering | - numeric vs numeric: Pearson correlation | - numeric vs numeric: Pearson correlation - numeric vs categorical: target-encoding + R-squared - categorical vs categorical: Cramer’s V |
VIF filtering | Applied to all remaining predictors
|
Applied to numeric remaining predictors
|
The following sections explain key differences between these paths, and provide hints on the logic behind target encoding, perference order, and multicollinearity filtering
How It Works
The functionalities of collinear()
(target encoding,
preference order, pairwise correlation filtering, and VIF filtering) are
provided by other functions that have specific data requirements.
Additionally, aiming to cover most use cases, collinear()
allows disabling each functionality separately. The table below
summarizes these details.
Function | Functionality | Requirements | Disabled |
---|---|---|---|
target_encoding_lab() |
transform categorical predictors to numeric |
- numeric response - categorical predictors
|
- response = NULL - encoding_method = NULL
|
preference_order() |
rank and preserve important predictors |
any response
|
- response = NULL - preference_order = NULL
|
cor_select() |
reduce pairwise correlation |
any predictors
|
max_cor = NULL |
vif_select() |
reduce variance inflation |
numeric predictors
|
max_vif = NULL |
The following sections focus on these functions and explain how their respective functionalities are implemented.
Target Encoding
Target-encoding transforms categorical predictors
to
numeric by using the values of a numeric response
across
groups as reference. This transformation enables the application of the
same multicollinearity filtering (and modelling) methods to categorical
and numeric predictors at once. This section explains the method in
brief, but there is a lengthier article about target-encoding here.
In collinear()
, this functionality is controlled by the
function target_encoding_lab()
. Its argument
encoding_method
defines how categorical predictors are
transformed to numeric, or disables the functionality entirely when
NULL
.
The example data frame below, used to explain how target encoding works, has two levels of the categorical predictor “koppen_zone” and the response “vi_numeric”.
When introducing this data frame into
target_encoding_lab()
with the method “loo” (from
leave-one-out), it is first grouped by the levels of
“koppen_zone”, and then each case is encoded as the average of
response
across all other cases within the same level.
The result shows “koppen_zone” encoded as numeric.
Due to the requirement for a numeric response
, in the
example call to collinear()
target encoding is only applied
for the response
“vi_numeric” as follows:
df_vi_numeric <- collinear::target_encoding_lab(
df = df,
response = "vi_numeric",
predictors = vi_predictors,
method = "loo",
overwrite = TRUE,
quiet = TRUE
)
This operation results in zero categorical predictors in the data
frame df_vi_numeric
:
collinear::identify_predictors_categorical(
df = df_vi_numeric,
predictors = vi_predictors
)
#> character(0)
On the other hand, target encoding is skipped for the categorical
response
“vi_categorical”, resulting in 12 categorical
predictors.
df_vi_categorical <- df
collinear::identify_predictors_categorical(
df = df_vi_categorical,
predictors = vi_predictors
)
#> [1] "koppen_zone" "koppen_group" "koppen_description"
#> [4] "soil_type" "biogeo_ecoregion" "biogeo_biome"
#> [7] "biogeo_realm" "country_name" "country_income"
#> [10] "continent" "region" "subregion"
If your data comprises numeric responses and a mixture of numeric and
categorical predictors, it is preferable to target-encode your data
frame before multicollinearity filtering with
target_encoding_lab()
or any other similar function, so the
encoded predictors are also available for data exploration and modelling
purposes.
Preference Order
The multicollinearity filtering method implemented in
collinear()
is devised to preserve as many
relevant predictors as possible. This principle ensures a good
balance between multicollinearity and predictive power in the resulting
selection of predictors.
This functionality is implemented as follows.
The functions cor_select()
and vif_select()
have the argument preference_order
(also in
collinear()
), which accepts a ranking of predictors. This
ranking is then considered by the multicollinearity filtering methods
implemented in these functions to preserve important predictors.
The argument preference_order
accepts different
inputs.
Custom Preference Vector
Valid input in collinear()
, cor_select()
,
and vif_select()
.
A custom preference vector has predictors
names ordered
by the user’s criteria. This option allows targeting specific predictors
for particular purposes. For example, the code below shows a
hypothetical case focused on preserving soil temperature variables over
all others.
selection_from_vector <- collinear::collinear(
df = df,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = c(
"soil_temperature_mean",
"soil_temperature_range",
"soil_temperature_min",
"soil_temperature_max"
),
quiet = TRUE
)
selection_from_vector
#> [1] "soil_temperature_mean" "soil_temperature_range" "growing_season_length"
#> [4] "rainfall_max" "swi_range" "rainfall_min"
#> [7] "soil_nitrogen" "soil_soc" "cloud_cover_range"
#> [10] "topo_diversity" "soil_clay" "topo_elevation"
#> [13] "topo_slope"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
Notice that the first predictor in preference_order
should always appear first in the variable selection output. The
appearence of all other targeted predictors depends on their correlation
with the first one.
Preference Data frame
Valid input in collinear()
, cor_select()
,
and vif_select()
.
A preference data frame has a column named “predictor”, and it is arranged from higher to lower values of a quantitative criterion.
The function preference_order()
generates this data
frame by computing the association between each predictor and the
response
using a given f
function. The names
and features of these functions can be found in the data frame returned
by f_functions()
.
collinear::f_functions()
These functions take a data frame named df
with the
columns “x” (predictor) and “y” (response) as input, so preparing a
custom one for your own purposes is simple enough. But take in mind that
preference_order()
arranges the resulting data frame from
higher to lower preference values.
#custom f function
f_lm <- function(df){
summary(lm(y ~ x, data = df))$r.squared
}
#using it in preference_order()
preference_df <- collinear::preference_order(
df = vi,
response = "vi_numeric",
predictors = vi_predictors_numeric,
f = f_lm,
quiet = TRUE
)
The output data frame contains the names of the response, the
predictors, the f
function, and the column “preference”
with the output of the f
function.
The resulting data frame can be plugged into the
preference_order
argument of collinear()
(and
also cor_select()
and vif_select()
):
selection_from_df <- collinear(
df = vi,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = preference_df,
quiet = TRUE
)
selection_from_df
#> [1] "growing_season_length" "soil_temperature_max" "soil_temperature_range"
#> [4] "solar_rad_max" "rainfall_max" "aridity_index"
#> [7] "swi_range" "soil_nitrogen" "topo_diversity"
#> [10] "soil_clay" "soil_sand" "topo_elevation"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
But collinear()
can also compute preference order by
itself.
selection_auto <- collinear::collinear(
df = vi,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = "auto",
f = "auto",
quiet = TRUE
)
selection_auto
#> [1] "growing_season_length" "soil_temperature_max" "soil_temperature_range"
#> [4] "solar_rad_max" "rainfall_max" "aridity_index"
#> [7] "swi_range" "soil_nitrogen" "topo_diversity"
#> [10] "soil_clay" "soil_sand" "topo_elevation"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
When f = "auto"
, preference_order()
calls
f_auto()
and f_auto_rules()
to select a
function appropriate for the data. The example below, with categorical
response and predictors, shows how the function choice changes when the
response and the predictors are categorical.
preference_auto <- collinear::preference_order(
df = vi,
response = "vi_categorical",
predictors = vi_predictors_categorical,
f = "auto",
quiet = TRUE
)
Here f_auto()
selects f_v()
, which computes
Cramer’s V between categorical responses and predictors.
Preference List
Valid input in collinear()
only.
The function preference_order()
, like
collinear()
, accepts more than one response.
preference_list <- collinear::preference_order(
df = vi,
response = c(
"vi_categorical",
"vi_numeric"
),
predictors = vi_predictors,
f = "auto",
quiet = TRUE
)
The output is a named list.
names(preference_list)
#> [1] "vi_numeric" "vi_categorical"
This list can be plugged into the preference_order
argument of collinear()
. If a response
is not
in the preference order list, then its preference order computed
automatically. This action is described in the function messages when
quiet = FALSE
.
selection_list <- collinear::collinear(
df = vi,
response = c(
"vi_categorical",
"vi_numeric",
"vi_binomial" #not in preference_list
),
predictors = vi_predictors,
preference_order = preference_list,
quiet = FALSE
)
#>
#> collinear::collinear(): processing response 'vi_categorical'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): argument 'response' is not numeric, skipping target-encoding.
#>
#> collinear::collinear(): selecting data frame 'vi_categorical' fron preference order list.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - rainfall_mean
#> - koppen_group
#> - soil_type
#> - humidity_max
#> - humidity_min
#> - evapotranspiration_max
#> - solar_rad_max
#> - rainfall_range
#> - swi_range
#> - subregion
#> - rainfall_min
#> - soil_soc
#> - biogeo_biome
#> - soil_nitrogen
#> - cloud_cover_range
#> - humidity_range
#> - soil_sand
#> - topo_diversity
#> - soil_clay
#> - topo_slope
#> - topo_elevation
#>
#> collinear::vif_select(): selected predictors:
#> - rainfall_mean
#> - humidity_max
#> - humidity_min
#> - evapotranspiration_max
#> - solar_rad_max
#> - swi_range
#> - rainfall_min
#> - soil_soc
#> - soil_nitrogen
#> - soil_sand
#> - topo_diversity
#> - soil_clay
#> - topo_slope
#> - topo_elevation
#>
#> collinear::collinear(): selected predictors:
#> - rainfall_mean
#> - koppen_group
#> - soil_type
#> - humidity_max
#> - humidity_min
#> - evapotranspiration_max
#> - solar_rad_max
#> - swi_range
#> - subregion
#> - rainfall_min
#> - soil_soc
#> - biogeo_biome
#> - soil_nitrogen
#> - soil_sand
#> - topo_diversity
#> - soil_clay
#> - topo_slope
#> - topo_elevation
#>
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - koppen_zone
#> - koppen_group
#> - koppen_description
#> - soil_type
#> - biogeo_ecoregion
#> - biogeo_biome
#> - biogeo_realm
#> - country_name
#> - country_income
#> - continent
#> - region
#> - subregion
#>
#> collinear::collinear(): selecting data frame 'vi_numeric' fron preference order list.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_max
#> - humidity_max
#> - soil_type
#> - soil_temperature_range
#> - solar_rad_max
#> - country_name
#> - rainfall_range
#> - country_gdp
#> - swi_range
#> - country_population
#> - rainfall_min
#> - soil_soc
#> - biogeo_biome
#> - soil_nitrogen
#> - evapotranspiration_mean
#> - growing_season_temperature
#> - solar_rad_range
#> - continent
#> - cloud_cover_range
#> - humidity_range
#> - topo_diversity
#> - soil_clay
#> - topo_elevation
#> - soil_sand
#> - country_income
#> - topo_slope
#>
#> collinear::vif_select(): selected predictors:
#> - rainfall_mean
#> - swi_mean
#> - soil_temperature_max
#> - humidity_max
#> - soil_type
#> - soil_temperature_range
#> - solar_rad_max
#> - country_name
#> - rainfall_range
#> - country_gdp
#> - country_population
#> - biogeo_biome
#> - soil_nitrogen
#> - continent
#> - topo_diversity
#> - topo_elevation
#> - country_income
#> - topo_slope
#>
#> collinear::collinear(): processing response 'vi_binomial'.
#> ---------------------------------------------------------------
#>
#> collinear::target_encoding_lab(): using response 'vi_binomial' to encode categorical predictors:
#> - koppen_zone
#> - koppen_group
#> - koppen_description
#> - soil_type
#> - biogeo_ecoregion
#> - biogeo_biome
#> - biogeo_realm
#> - country_name
#> - country_income
#> - continent
#> - region
#> - subregion
#>
#> collinear::collinear(): input preference order list does not have data for the response 'vi_binomial'.
#>
#> collinear::preference_order(): ranking predictors for response 'vi_binomial'.
#>
#> collinear::f_auto(): selected function: 'f_auc_rf()'.
#>
#> collinear::cor_select(): computing pairwise correlation matrix.
#>
#> collinear::cor_select(): selected predictors:
#> - biogeo_realm
#> - region
#> - koppen_group
#> - country_income
#> - biogeo_biome
#> - subregion
#> - soil_type
#> - rainfall_mean
#> - soil_temperature_range
#> - soil_temperature_max
#> - solar_rad_max
#> - humidity_max
#> - country_gdp
#> - country_population
#> - rainfall_min
#> - rainfall_range
#> - soil_soc
#> - soil_nitrogen
#> - evapotranspiration_mean
#> - swi_range
#> - growing_season_temperature
#> - solar_rad_range
#> - humidity_range
#> - topo_diversity
#> - soil_clay
#> - cloud_cover_range
#> - topo_elevation
#> - soil_sand
#> - topo_slope
#>
#> collinear::vif_select(): selected predictors:
#> - biogeo_realm
#> - region
#> - koppen_group
#> - country_income
#> - biogeo_biome
#> - subregion
#> - soil_type
#> - rainfall_mean
#> - soil_temperature_range
#> - soil_temperature_max
#> - solar_rad_max
#> - humidity_max
#> - country_gdp
#> - country_population
#> - soil_soc
#> - soil_nitrogen
#> - swi_range
#> - humidity_range
#> - topo_diversity
#> - soil_clay
selection_list
#> $vi_categorical
#> [1] "rainfall_mean" "koppen_group" "soil_type"
#> [4] "humidity_max" "humidity_min" "evapotranspiration_max"
#> [7] "solar_rad_max" "swi_range" "subregion"
#> [10] "rainfall_min" "soil_soc" "biogeo_biome"
#> [13] "soil_nitrogen" "soil_sand" "topo_diversity"
#> [16] "soil_clay" "topo_slope" "topo_elevation"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_categorical"
#>
#> $vi_numeric
#> [1] "rainfall_mean" "swi_mean" "soil_temperature_max"
#> [4] "humidity_max" "soil_type" "soil_temperature_range"
#> [7] "solar_rad_max" "country_name" "rainfall_range"
#> [10] "country_gdp" "country_population" "biogeo_biome"
#> [13] "soil_nitrogen" "continent" "topo_diversity"
#> [16] "topo_elevation" "country_income" "topo_slope"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#>
#> $vi_binomial
#> [1] "biogeo_realm" "region" "koppen_group"
#> [4] "country_income" "biogeo_biome" "subregion"
#> [7] "soil_type" "rainfall_mean" "soil_temperature_range"
#> [10] "soil_temperature_max" "solar_rad_max" "humidity_max"
#> [13] "country_gdp" "country_population" "soil_soc"
#> [16] "soil_nitrogen" "swi_range" "humidity_range"
#> [19] "topo_diversity" "soil_clay"
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_binomial"
Pairwise Correlation Filtering
Pairwise correlation is commonly used to detect and reduce multicollinearity by identifying pairs of predictors that are highly correlated with each other.
This function cor_select()
builds upon this concept and
improves it by integrating categorical predictors into the correlation
analysis, and implementing an automated selection algorithm designed to
preserve important predictors.
This function performs the following steps:
- Computes a pairwise correlation matrix using methods able to integrate categorical predictors.
- Applies a forward stepwise multicollinearity filtering to select predictors below a certain correlation threshold.
Pairwise Correlation Matrix
Pairwise correlations are computed with cor_df()
.
df_cor <- collinear::cor_df(
df = vi,
predictors = vi_predictors
)
Notice that the original sign of the correlation is kept in the
output, but the correlation
column is arranged using
absolute values instead.
There are three possible cases to handle when building the correlation matrix:
-
Numeric vs numeric
(
cor_numeric_vs_numeric()
): Computes the R-squared between both predictors.
x <- collinear::cor_numeric_vs_numeric(
df = vi,
predictors = c(
"temperature_mean", #numeric
"temperature_max" #numeric
)
)
-
Numeric vs categorical
(
cor_numeric_vs_categorical()
): The categorical predictor is target-encoded against the numeric, and then their R-squared is computed.
x <- collinear::cor_numeric_vs_categorical(
df = vi,
predictors = c(
"temperature_mean", #numeric
"soil_type" #categorical
)
)
-
Categorical vs categorical
(
cor_categorical_vs_categorical()
): Computes the Cramer’s V between both predictors. Please, taken in mind that comparing Cramer’s V and R-squared is a suboptimal solution, and it is always preferable to target-encode categorical predictors before the pairwise correlation analysis.
x <- collinear::cor_categorical_vs_categorical(
df = vi,
predictors = c(
"koppen_zone", #categorical
"soil_type" #categorical
)
)
The function cor_matrix()
removes the correlation sign
and rearranges the pairwise correlations data frame into a correlation
matrix.
m <- collinear::cor_matrix(
df = df_cor
)
The first 10 rows and columns of the correlation matrix are shown below.
Multicollinearity Filtering
The forward stepwise multicollinearity filtering implemented in
cor_select()
works as follows:
- Order the pairwise correlation matrix to match
preference_order
. - Add the first first predictor in
preference_order
to the vectorselected
. - For every other predictor: get its maximum correlation with the
predictors
selected
. If lower thanmax_cor
, add it toselected
, and ignore it otherwise. Move to the next predictor until all them have been tested.
#preference order from a previous example
preference_order <- preference_list$vi_numeric$predictor
#correlation threshold
max_cor <- 0.5
#reorder pairwise correlation matrix
m <- m[
preference_order,
preference_order
]
#set diagonals to zero
diag(m) <- 0
#initialize required vectors
selected <- preference_order[1]
candidates <- preference_order[-1]
#iterate over candidates
for(candidate in candidates){
#apply selection criteria
if(max(m[selected, candidate]) <= max_cor){
selected <- c(
selected,
candidate
)
}
}
selected
#> [1] "rainfall_mean" "swi_min" "country_gdp" "swi_range"
#> [5] "temperature_min" "humidity_range" "topo_diversity" "soil_clay"
#> [9] "topo_elevation" "country_income" "soil_silt"
None of the predictors in selected
has an absolute
pairwise correlation with others higher than the defined threshold.
df_cor <- collinear::cor_df(
df = df,
predictors = selected
)
VIF Filtering
In a linear model, the confidence interval of a predictor’s estimate
is widened by a factor equal to the square root of its Variance
Inflation Factor (VIF). Such VIF score is computed as
1/(1 - R2)
, where R2
is the R-squared of the
linear model of the predictor against all other predictors.
This article goes deep into this topic, but the key detail here is that VIF is at the same time a metric of the uncertainty induced by multicollinearity and a tool to manage it
The function vif_select()
incorporates this idea into an
automated selection algorithm that takes preference order into account
to preserve important predictors.
The actual VIF computation is implemented in the function
vif_df()
. Unlike cor_df()
,
vif_df()
ignores categorical predictors, unless these are
target-encoded.
df_vif <- collinear::vif_df(
df = df,
predictors = selected #output of pairwise correlation selection
)
#>
#> collinear::vif_df(): these predictors are not numeric and will be ignored:
#> - country_income.
In general, VIF scores higher than 2.5 are indicative of multicollinearity, but recommended thresholds may vary between 2.5 and 10 depending on the model type.
The function vif_select()
calls vif_df()
iteratively to remove predictors with a VIF above a defined threshold
threshold, much like cor_select()
does.
#VIF threshold
max_vif <- 2.5
#filter out categorical predictors
selected <- collinear::identify_predictors_numeric(
df = df,
predictors = selected
)
#initialize required vectors
#example starts with the selection made
#by `cor_select()` in the previous section
candidates <- selected[-1]
selected <- selected[1]
#iterate over candidate variables
for(candidate in candidates){
vif.df <- vif_df(
df = df,
predictors = c(
selected,
candidate
),
quiet = TRUE
)
#if candidate keeps vif below the threshold
if(max(vif.df$vif) <= max_vif){
#add candidate to selected
selected <- c(
selected,
candidate
)
}
}
selected
#> [1] "rainfall_mean" "swi_min" "country_gdp" "swi_range"
#> [5] "temperature_min"
None of the predictors in selected
has a VIF higher than
the defined threshold.
df_vif <- collinear::vif_df(
df = df,
predictors = selected
)
And that’s all! If you got here, thank you for your interest in
collinear
. I hope you can find it useful!
Blas M. Benito, PhD