Skip to contents

Summary

This article explains how the package collinear handles responses and predictors of different types to facilitate multicollinearity filtering. The explanation is centered on the inputs, logic, and output of the function collinear(), and the main functions it calls: target_encoding_lab(), preference_order(), cor_select(), and vif_select().

Required Packages

The packages required for this tutorial are:

Parallelization Setup and Progress Bars

Most functions in the package now accept a parallelization setup via future::plan() and progress bars via progressr::handlers(). However, progress bars are ignored in this tutorial because they don’t work in Rmarkdown.

future::plan(
  future::multisession,
  workers = parallelly::availableCores() - 1
  )

#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)

Example Data

The package collinear includes the example data frame collinear::vi, with 30000 rows, 68 columns, and 20108 NA values. It contains several numeric and categorical responses and predictors.

The response columns, all derived from the same data, have descriptive names: vi_numeric, vi_counts (integers), vi_binomial (1s and 0s), vi_categorical (five categories), and vi_factor (factor version of the previous one).

Predictor names are grouped in character vectors: collinear::vi_predictors_numeric (49 numeric and integer predictors), collinear::vi_predictors_categorical (12 character and factor predictors), and collinear::vi_predictors containing them all.

The code below makes collinear::vi a bit smaller to accelerate the examples below.

df <- collinear::vi[1:5000, ]

The Function collinear()

This function serves as single entry point to the full functionality of the package. It aims to facilitate multicollinearity filtering for any combination of categorical and/or numeric responses and predictors.

The code below runs a full multicollinearity filtering for a numeric and a categorical response, and a set of predictors with mixed types (numeric, integer, character and factor).

The meaning of the function arguments is explained in the next sections.

selection <- collinear::collinear(
  df = df,
  response = c(              
    "vi_numeric",             #numeric response
    "vi_categorical"          #categorical response
    ),            
  predictors = vi_predictors, #numeric and categorical predictors
  encoding_method = "loo",    #leave-one-out target encoding
  preference_order = "auto",  #automatic ranking of predictors
  f = "auto",                 #automatic selection of ranking function
  quiet = FALSE,              #enable messages 
  max_cor = 0.75,             #maximum correlation threshold
  max_vif = 5                 #maximum VIF threshold
)
#> 
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#>  - koppen_zone
#>  - koppen_group
#>  - koppen_description
#>  - soil_type
#>  - biogeo_ecoregion
#>  - biogeo_biome
#>  - biogeo_realm
#>  - country_name
#>  - country_income
#>  - continent
#>  - region
#>  - subregion
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#> 
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - growing_season_length
#>  - soil_temperature_max
#>  - soil_temperature_range
#>  - solar_rad_max
#>  - rainfall_max
#>  - subregion
#>  - biogeo_realm
#>  - swi_range
#>  - rainfall_min
#>  - solar_rad_mean
#>  - soil_nitrogen
#>  - continent
#>  - soil_soc
#>  - solar_rad_range
#>  - cloud_cover_range
#>  - topo_diversity
#>  - soil_clay
#>  - humidity_range
#>  - country_income
#>  - topo_elevation
#>  - soil_sand
#>  - topo_slope
#>  - temperature_mean
#>  - country_population
#>  - country_gdp
#> 
#> collinear::vif_select(): selected predictors: 
#>  - growing_season_length
#>  - soil_temperature_max
#>  - soil_temperature_range
#>  - solar_rad_max
#>  - rainfall_max
#>  - subregion
#>  - biogeo_realm
#>  - swi_range
#>  - rainfall_min
#>  - soil_nitrogen
#>  - continent
#>  - cloud_cover_range
#>  - topo_diversity
#> 
#> collinear::collinear(): processing response 'vi_categorical'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): argument 'response' is not numeric, skipping target-encoding.
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#> 
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_max
#>  - soil_type
#>  - humidity_max
#>  - solar_rad_max
#>  - country_gdp
#>  - swi_range
#>  - rainfall_range
#>  - country_population
#>  - soil_soc
#>  - rainfall_min
#>  - temperature_range
#>  - evapotranspiration_mean
#>  - soil_nitrogen
#>  - region
#>  - growing_season_temperature
#>  - country_income
#>  - cloud_cover_range
#>  - humidity_range
#>  - soil_sand
#>  - soil_clay
#>  - topo_elevation
#>  - topo_diversity
#>  - topo_slope
#> 
#> collinear::vif_select(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_max
#>  - humidity_max
#>  - solar_rad_max
#>  - country_gdp
#>  - swi_range
#>  - rainfall_range
#>  - country_population
#>  - soil_soc
#>  - topo_diversity
#>  - topo_slope
#> 
#> collinear::collinear(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_max
#>  - soil_type
#>  - humidity_max
#>  - solar_rad_max
#>  - country_gdp
#>  - swi_range
#>  - rainfall_range
#>  - country_population
#>  - soil_soc
#>  - region
#>  - country_income
#>  - topo_diversity
#>  - topo_slope

The output is a named list of vectors with predictor names when more than one response is provided, and a character vector otherwise.

selection
#> $vi_numeric
#>  [1] "growing_season_length"  "soil_temperature_max"   "soil_temperature_range"
#>  [4] "solar_rad_max"          "rainfall_max"           "subregion"             
#>  [7] "biogeo_realm"           "swi_range"              "rainfall_min"          
#> [10] "soil_nitrogen"          "continent"              "cloud_cover_range"     
#> [13] "topo_diversity"        
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#> 
#> $vi_categorical
#>  [1] "rainfall_mean"        "swi_mean"             "soil_temperature_max"
#>  [4] "soil_type"            "humidity_max"         "solar_rad_max"       
#>  [7] "country_gdp"          "swi_range"            "rainfall_range"      
#> [10] "country_population"   "soil_soc"             "region"              
#> [13] "country_income"       "topo_diversity"       "topo_slope"          
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_categorical"

The variable selections are different because collinear() follows different paths for numeric and categorical responses. The table below summarizes the differences between these alternate paths.

Functionality numeric response categorical response
Target-encoding executed: 12 categorical predictors transformed to numeric skipped: categorical predictors continue to next steps
Preference order f_r2_pearson(): R-squared between response and predictors f_v_rf_categorical(): Cramer’s V of response against univariate random forest predictions
Pairwise correlation filtering - numeric vs numeric: Pearson correlation - numeric vs numeric: Pearson correlation
- numeric vs categorical: target-encoding + R-squared
- categorical vs categorical: Cramer’s V
VIF filtering Applied to all remaining predictors Applied to numeric remaining predictors

The following sections explain key differences between these paths, and provide hints on the logic behind target encoding, perference order, and multicollinearity filtering

How It Works

The functionalities of collinear() (target encoding, preference order, pairwise correlation filtering, and VIF filtering) are provided by other functions that have specific data requirements. Additionally, aiming to cover most use cases, collinear() allows disabling each functionality separately. The table below summarizes these details.

Function Functionality Requirements Disabled
target_encoding_lab() transform categorical predictors
to numeric
- numeric response
- categorical predictors
- response = NULL
- encoding_method = NULL
preference_order() rank and preserve
important predictors
any response - response = NULL
- preference_order = NULL
cor_select() reduce
pairwise correlation
any predictors max_cor = NULL
vif_select() reduce
variance inflation
numeric predictors max_vif = NULL

The following sections focus on these functions and explain how their respective functionalities are implemented.

Target Encoding

Target-encoding transforms categorical predictors to numeric by using the values of a numeric response across groups as reference. This transformation enables the application of the same multicollinearity filtering (and modelling) methods to categorical and numeric predictors at once. This section explains the method in brief, but there is a lengthier article about target-encoding here.

In collinear(), this functionality is controlled by the function target_encoding_lab(). Its argument encoding_method defines how categorical predictors are transformed to numeric, or disables the functionality entirely when NULL.

The example data frame below, used to explain how target encoding works, has two levels of the categorical predictor “koppen_zone” and the response “vi_numeric”.

When introducing this data frame into target_encoding_lab() with the method “loo” (from leave-one-out), it is first grouped by the levels of “koppen_zone”, and then each case is encoded as the average of response across all other cases within the same level.

The result shows “koppen_zone” encoded as numeric.

Due to the requirement for a numeric response, in the example call to collinear() target encoding is only applied for the response “vi_numeric” as follows:

df_vi_numeric <- collinear::target_encoding_lab(
  df = df,
  response = "vi_numeric",
  predictors = vi_predictors,
  method = "loo",
  overwrite = TRUE,
  quiet = TRUE
)

This operation results in zero categorical predictors in the data frame df_vi_numeric:

collinear::identify_predictors_categorical(
  df = df_vi_numeric,
  predictors = vi_predictors
)
#> character(0)

On the other hand, target encoding is skipped for the categorical response “vi_categorical”, resulting in 12 categorical predictors.

df_vi_categorical <- df

collinear::identify_predictors_categorical(
  df = df_vi_categorical,
  predictors = vi_predictors
)
#>  [1] "koppen_zone"        "koppen_group"       "koppen_description"
#>  [4] "soil_type"          "biogeo_ecoregion"   "biogeo_biome"      
#>  [7] "biogeo_realm"       "country_name"       "country_income"    
#> [10] "continent"          "region"             "subregion"

If your data comprises numeric responses and a mixture of numeric and categorical predictors, it is preferable to target-encode your data frame before multicollinearity filtering with target_encoding_lab() or any other similar function, so the encoded predictors are also available for data exploration and modelling purposes.

Preference Order

The multicollinearity filtering method implemented in collinear() is devised to preserve as many relevant predictors as possible. This principle ensures a good balance between multicollinearity and predictive power in the resulting selection of predictors.

This functionality is implemented as follows.

The functions cor_select() and vif_select() have the argument preference_order (also in collinear()), which accepts a ranking of predictors. This ranking is then considered by the multicollinearity filtering methods implemented in these functions to preserve important predictors.

The argument preference_order accepts different inputs.

Custom Preference Vector

Valid input in collinear(), cor_select(), and vif_select().

A custom preference vector has predictors names ordered by the user’s criteria. This option allows targeting specific predictors for particular purposes. For example, the code below shows a hypothetical case focused on preserving soil temperature variables over all others.

selection_from_vector <- collinear::collinear(
  df = df,
  response = "vi_numeric",
  predictors = vi_predictors_numeric,
  preference_order = c(
    "soil_temperature_mean",
    "soil_temperature_range",
    "soil_temperature_min",
    "soil_temperature_max"
  ),
  quiet = TRUE
)

selection_from_vector
#>  [1] "soil_temperature_mean"  "soil_temperature_range" "growing_season_length" 
#>  [4] "rainfall_max"           "swi_range"              "rainfall_min"          
#>  [7] "soil_nitrogen"          "soil_soc"               "cloud_cover_range"     
#> [10] "topo_diversity"         "soil_clay"              "topo_elevation"        
#> [13] "topo_slope"            
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"

Notice that the first predictor in preference_order should always appear first in the variable selection output. The appearence of all other targeted predictors depends on their correlation with the first one.

Preference Data frame

Valid input in collinear(), cor_select(), and vif_select().

A preference data frame has a column named “predictor”, and it is arranged from higher to lower values of a quantitative criterion.

The function preference_order() generates this data frame by computing the association between each predictor and the response using a given f function. The names and features of these functions can be found in the data frame returned by f_functions().

collinear::f_functions()

These functions take a data frame named df with the columns “x” (predictor) and “y” (response) as input, so preparing a custom one for your own purposes is simple enough. But take in mind that preference_order() arranges the resulting data frame from higher to lower preference values.

#custom f function
f_lm <- function(df){
  summary(lm(y ~ x, data = df))$r.squared
}

#using it in preference_order()
preference_df <- collinear::preference_order(
  df = vi,
  response = "vi_numeric",
  predictors = vi_predictors_numeric,
  f = f_lm,
  quiet = TRUE
)

The output data frame contains the names of the response, the predictors, the f function, and the column “preference” with the output of the f function.

The resulting data frame can be plugged into the preference_order argument of collinear() (and also cor_select() and vif_select()):

selection_from_df <- collinear(
  df = vi,
  response = "vi_numeric",
  predictors = vi_predictors_numeric,
  preference_order = preference_df,
  quiet = TRUE
)

selection_from_df
#>  [1] "growing_season_length"  "soil_temperature_max"   "soil_temperature_range"
#>  [4] "solar_rad_max"          "rainfall_max"           "aridity_index"         
#>  [7] "swi_range"              "soil_nitrogen"          "topo_diversity"        
#> [10] "soil_clay"              "soil_sand"              "topo_elevation"        
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"

But collinear() can also compute preference order by itself.

selection_auto <- collinear::collinear(
  df = vi,
  response = "vi_numeric",
  predictors = vi_predictors_numeric,
  preference_order = "auto",
  f = "auto",
  quiet = TRUE
)

selection_auto
#>  [1] "growing_season_length"  "soil_temperature_max"   "soil_temperature_range"
#>  [4] "solar_rad_max"          "rainfall_max"           "aridity_index"         
#>  [7] "swi_range"              "soil_nitrogen"          "topo_diversity"        
#> [10] "soil_clay"              "soil_sand"              "topo_elevation"        
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"

When f = "auto", preference_order() calls f_auto() and f_auto_rules() to select a function appropriate for the data. The example below, with categorical response and predictors, shows how the function choice changes when the response and the predictors are categorical.

preference_auto <- collinear::preference_order(
  df = vi,
  response = "vi_categorical",
  predictors = vi_predictors_categorical,
  f = "auto",
  quiet = TRUE
)

Here f_auto() selects f_v(), which computes Cramer’s V between categorical responses and predictors.

Preference List

Valid input in collinear() only.

The function preference_order(), like collinear(), accepts more than one response.

preference_list <- collinear::preference_order(
  df = vi,
  response = c(
    "vi_categorical",
    "vi_numeric"
    ),
  predictors = vi_predictors,
  f = "auto",
  quiet = TRUE
)

The output is a named list.

names(preference_list)
#> [1] "vi_numeric"     "vi_categorical"

This list can be plugged into the preference_order argument of collinear(). If a response is not in the preference order list, then its preference order computed automatically. This action is described in the function messages when quiet = FALSE.

selection_list <- collinear::collinear(
  df = vi,
  response = c(
    "vi_categorical",
    "vi_numeric",
    "vi_binomial" #not in preference_list
    ),
  predictors = vi_predictors,
  preference_order = preference_list,
  quiet = FALSE
)
#> 
#> collinear::collinear(): processing response 'vi_categorical'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): argument 'response' is not numeric, skipping target-encoding.
#> 
#> collinear::collinear(): selecting data frame 'vi_categorical' fron preference order list.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - rainfall_mean
#>  - koppen_group
#>  - soil_type
#>  - humidity_max
#>  - humidity_min
#>  - evapotranspiration_max
#>  - solar_rad_max
#>  - rainfall_range
#>  - swi_range
#>  - subregion
#>  - rainfall_min
#>  - soil_soc
#>  - biogeo_biome
#>  - soil_nitrogen
#>  - cloud_cover_range
#>  - humidity_range
#>  - soil_sand
#>  - topo_diversity
#>  - soil_clay
#>  - topo_slope
#>  - topo_elevation
#> 
#> collinear::vif_select(): selected predictors: 
#>  - rainfall_mean
#>  - humidity_max
#>  - humidity_min
#>  - evapotranspiration_max
#>  - solar_rad_max
#>  - swi_range
#>  - rainfall_min
#>  - soil_soc
#>  - soil_nitrogen
#>  - soil_sand
#>  - topo_diversity
#>  - soil_clay
#>  - topo_slope
#>  - topo_elevation
#> 
#> collinear::collinear(): selected predictors: 
#>  - rainfall_mean
#>  - koppen_group
#>  - soil_type
#>  - humidity_max
#>  - humidity_min
#>  - evapotranspiration_max
#>  - solar_rad_max
#>  - swi_range
#>  - subregion
#>  - rainfall_min
#>  - soil_soc
#>  - biogeo_biome
#>  - soil_nitrogen
#>  - soil_sand
#>  - topo_diversity
#>  - soil_clay
#>  - topo_slope
#>  - topo_elevation
#> 
#> collinear::collinear(): processing response 'vi_numeric'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#>  - koppen_zone
#>  - koppen_group
#>  - koppen_description
#>  - soil_type
#>  - biogeo_ecoregion
#>  - biogeo_biome
#>  - biogeo_realm
#>  - country_name
#>  - country_income
#>  - continent
#>  - region
#>  - subregion
#> 
#> collinear::collinear(): selecting data frame 'vi_numeric' fron preference order list.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_max
#>  - humidity_max
#>  - soil_type
#>  - soil_temperature_range
#>  - solar_rad_max
#>  - country_name
#>  - rainfall_range
#>  - country_gdp
#>  - swi_range
#>  - country_population
#>  - rainfall_min
#>  - soil_soc
#>  - biogeo_biome
#>  - soil_nitrogen
#>  - evapotranspiration_mean
#>  - growing_season_temperature
#>  - solar_rad_range
#>  - continent
#>  - cloud_cover_range
#>  - humidity_range
#>  - topo_diversity
#>  - soil_clay
#>  - topo_elevation
#>  - soil_sand
#>  - country_income
#>  - topo_slope
#> 
#> collinear::vif_select(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - soil_temperature_max
#>  - humidity_max
#>  - soil_type
#>  - soil_temperature_range
#>  - solar_rad_max
#>  - country_name
#>  - rainfall_range
#>  - country_gdp
#>  - country_population
#>  - biogeo_biome
#>  - soil_nitrogen
#>  - continent
#>  - topo_diversity
#>  - topo_elevation
#>  - country_income
#>  - topo_slope
#> 
#> collinear::collinear(): processing response 'vi_binomial'.
#> ---------------------------------------------------------------
#> 
#> collinear::target_encoding_lab(): using response 'vi_binomial' to encode categorical predictors:
#>  - koppen_zone
#>  - koppen_group
#>  - koppen_description
#>  - soil_type
#>  - biogeo_ecoregion
#>  - biogeo_biome
#>  - biogeo_realm
#>  - country_name
#>  - country_income
#>  - continent
#>  - region
#>  - subregion
#> 
#> collinear::collinear(): input preference order list does not have data for the response 'vi_binomial'.
#> 
#> collinear::preference_order(): ranking predictors for response 'vi_binomial'.
#> 
#> collinear::f_auto(): selected function: 'f_auc_rf()'.
#> 
#> collinear::cor_select(): computing pairwise correlation matrix.
#> 
#> collinear::cor_select(): selected predictors: 
#>  - biogeo_realm
#>  - region
#>  - koppen_group
#>  - country_income
#>  - biogeo_biome
#>  - subregion
#>  - soil_type
#>  - rainfall_mean
#>  - soil_temperature_range
#>  - soil_temperature_max
#>  - solar_rad_max
#>  - humidity_max
#>  - country_gdp
#>  - country_population
#>  - rainfall_min
#>  - rainfall_range
#>  - soil_soc
#>  - soil_nitrogen
#>  - evapotranspiration_mean
#>  - swi_range
#>  - growing_season_temperature
#>  - solar_rad_range
#>  - humidity_range
#>  - topo_diversity
#>  - soil_clay
#>  - cloud_cover_range
#>  - topo_elevation
#>  - soil_sand
#>  - topo_slope
#> 
#> collinear::vif_select(): selected predictors: 
#>  - biogeo_realm
#>  - region
#>  - koppen_group
#>  - country_income
#>  - biogeo_biome
#>  - subregion
#>  - soil_type
#>  - rainfall_mean
#>  - soil_temperature_range
#>  - soil_temperature_max
#>  - solar_rad_max
#>  - humidity_max
#>  - country_gdp
#>  - country_population
#>  - soil_soc
#>  - soil_nitrogen
#>  - swi_range
#>  - humidity_range
#>  - topo_diversity
#>  - soil_clay

selection_list
#> $vi_categorical
#>  [1] "rainfall_mean"          "koppen_group"           "soil_type"             
#>  [4] "humidity_max"           "humidity_min"           "evapotranspiration_max"
#>  [7] "solar_rad_max"          "swi_range"              "subregion"             
#> [10] "rainfall_min"           "soil_soc"               "biogeo_biome"          
#> [13] "soil_nitrogen"          "soil_sand"              "topo_diversity"        
#> [16] "soil_clay"              "topo_slope"             "topo_elevation"        
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_categorical"
#> 
#> $vi_numeric
#>  [1] "rainfall_mean"          "swi_mean"               "soil_temperature_max"  
#>  [4] "humidity_max"           "soil_type"              "soil_temperature_range"
#>  [7] "solar_rad_max"          "country_name"           "rainfall_range"        
#> [10] "country_gdp"            "country_population"     "biogeo_biome"          
#> [13] "soil_nitrogen"          "continent"              "topo_diversity"        
#> [16] "topo_elevation"         "country_income"         "topo_slope"            
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#> 
#> $vi_binomial
#>  [1] "biogeo_realm"           "region"                 "koppen_group"          
#>  [4] "country_income"         "biogeo_biome"           "subregion"             
#>  [7] "soil_type"              "rainfall_mean"          "soil_temperature_range"
#> [10] "soil_temperature_max"   "solar_rad_max"          "humidity_max"          
#> [13] "country_gdp"            "country_population"     "soil_soc"              
#> [16] "soil_nitrogen"          "swi_range"              "humidity_range"        
#> [19] "topo_diversity"         "soil_clay"             
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_binomial"

Pairwise Correlation Filtering

Pairwise correlation is commonly used to detect and reduce multicollinearity by identifying pairs of predictors that are highly correlated with each other.

This function cor_select() builds upon this concept and improves it by integrating categorical predictors into the correlation analysis, and implementing an automated selection algorithm designed to preserve important predictors.

This function performs the following steps:

  • Computes a pairwise correlation matrix using methods able to integrate categorical predictors.
  • Applies a forward stepwise multicollinearity filtering to select predictors below a certain correlation threshold.

Pairwise Correlation Matrix

Pairwise correlations are computed with cor_df().

df_cor <- collinear::cor_df(
  df = vi,
  predictors = vi_predictors
)

Notice that the original sign of the correlation is kept in the output, but the correlation column is arranged using absolute values instead.

There are three possible cases to handle when building the correlation matrix:

x <- collinear::cor_numeric_vs_numeric(
  df = vi,
  predictors = c(
    "temperature_mean", #numeric
    "temperature_max"   #numeric
  )
)
  • Numeric vs categorical (cor_numeric_vs_categorical()): The categorical predictor is target-encoded against the numeric, and then their R-squared is computed.
x <- collinear::cor_numeric_vs_categorical(
  df = vi,
  predictors = c(
    "temperature_mean", #numeric
    "soil_type"         #categorical
  )
)
  • Categorical vs categorical (cor_categorical_vs_categorical()): Computes the Cramer’s V between both predictors. Please, taken in mind that comparing Cramer’s V and R-squared is a suboptimal solution, and it is always preferable to target-encode categorical predictors before the pairwise correlation analysis.
x <- collinear::cor_categorical_vs_categorical(
  df = vi,
  predictors = c(
    "koppen_zone",      #categorical
    "soil_type"         #categorical
  )
)

The function cor_matrix() removes the correlation sign and rearranges the pairwise correlations data frame into a correlation matrix.

m <- collinear::cor_matrix(
  df = df_cor
)

The first 10 rows and columns of the correlation matrix are shown below.

Multicollinearity Filtering

The forward stepwise multicollinearity filtering implemented in cor_select() works as follows:

  • Order the pairwise correlation matrix to match preference_order.
  • Add the first first predictor in preference_order to the vector selected.
  • For every other predictor: get its maximum correlation with the predictors selected. If lower than max_cor, add it to selected, and ignore it otherwise. Move to the next predictor until all them have been tested.
#preference order from a previous example
preference_order <- preference_list$vi_numeric$predictor

#correlation threshold
max_cor <- 0.5

#reorder pairwise correlation matrix
m <- m[
  preference_order,
  preference_order
]

#set diagonals to zero
diag(m) <- 0

#initialize required vectors
selected <- preference_order[1]
candidates <- preference_order[-1]

#iterate over candidates
for(candidate in candidates){
  
  #apply selection criteria
  if(max(m[selected, candidate]) <= max_cor){
    
    selected <- c(
      selected,
      candidate
    )
    
  }
  
}

selected
#>  [1] "rainfall_mean"   "swi_min"         "country_gdp"     "swi_range"      
#>  [5] "temperature_min" "humidity_range"  "topo_diversity"  "soil_clay"      
#>  [9] "topo_elevation"  "country_income"  "soil_silt"

None of the predictors in selected has an absolute pairwise correlation with others higher than the defined threshold.

df_cor <- collinear::cor_df(
  df = df,
  predictors = selected
)

VIF Filtering

In a linear model, the confidence interval of a predictor’s estimate is widened by a factor equal to the square root of its Variance Inflation Factor (VIF). Such VIF score is computed as 1/(1 - R2), where R2 is the R-squared of the linear model of the predictor against all other predictors.

This article goes deep into this topic, but the key detail here is that VIF is at the same time a metric of the uncertainty induced by multicollinearity and a tool to manage it

The function vif_select() incorporates this idea into an automated selection algorithm that takes preference order into account to preserve important predictors.

The actual VIF computation is implemented in the function vif_df(). Unlike cor_df(), vif_df() ignores categorical predictors, unless these are target-encoded.

df_vif <- collinear::vif_df(
  df = df,
  predictors = selected #output of pairwise correlation selection
)
#> 
#> collinear::vif_df(): these predictors are not numeric and will be ignored: 
#>  - country_income.

In general, VIF scores higher than 2.5 are indicative of multicollinearity, but recommended thresholds may vary between 2.5 and 10 depending on the model type.

The function vif_select() calls vif_df() iteratively to remove predictors with a VIF above a defined threshold threshold, much like cor_select() does.

#VIF threshold
max_vif <- 2.5

#filter out categorical predictors
selected <- collinear::identify_predictors_numeric(
  df = df,
  predictors = selected
)

#initialize required vectors
#example starts with the selection made 
#by `cor_select()` in the previous section
candidates <- selected[-1]
selected <- selected[1]


#iterate over candidate variables
for(candidate in candidates){
  
  vif.df <- vif_df(
    df = df,
    predictors = c(
      selected,
      candidate
    ),
    quiet = TRUE
  )
  
  #if candidate keeps vif below the threshold
  if(max(vif.df$vif) <= max_vif){
    
    #add candidate to selected
    selected <- c(
      selected,
      candidate
    )
    
  }
  
}

selected
#> [1] "rainfall_mean"   "swi_min"         "country_gdp"     "swi_range"      
#> [5] "temperature_min"

None of the predictors in selected has a VIF higher than the defined threshold.

df_vif <- collinear::vif_df(
  df = df,
  predictors = selected
)

And that’s all! If you got here, thank you for your interest in collinear. I hope you can find it useful!

Blas M. Benito, PhD