
Dual Filtering Strategy
Source:vignettes/articles/dual_filtering_strategy.Rmd
dual_filtering_strategy.RmdSummary
The function collinear_select() applies two
complementary methods to assess different aspects of
multicollinearity:
Pairwise correlation (Pearson and/or Cramer’s V): identifies pairs of highly redundant variables.
Variance Inflation Factors: measures how much the variance of a regression coefficient is inflated due to multicollinearity with all other predictors.
By combining a pairwise method (correlation) and a multivariate one
(VIF), collinear_select() is able to capture
multicollinearity that either method alone might miss. At the same time,
it respects predictor rankings to protect relevant variables during
filtering.
This article explains how this dual filtering algorithm works, demonstrates its behavior with examples, and provides guidance on when to use each filtering method.
The Multicollinearity Filtering Algorithm
The filtering algorithm in collinear_select() works as
follows:
Predictors are ranked according to the argument
preference_order, or from lower to higher overall multicollinearity otherwise.The correlation matrix between all predictors is computed with
cor_matrix().The first predictor in the ranking is selected, while all others are candidates.
The candidates are evaluated sequentially:
4a. If max_cor is set, and the maximum correlation
between the candidate and the selected is higher than
max_cor, the candidate is discarded and the algorithm moves
to the next one. Otherwise, the candidate goes to the next step.
4b. If max_vif is set, and the maximum VIF of the
candidate against all selected predictors is higher than
max_vif, the candidate is discarded. Otherwise, the
candidate is selected, and the algorithm moves into the next
candidate.
To finalize, the selected predictors are returned.
Let’s take a look at how that works in code with the multicollinearity thresholds and the predictors and below.
max_vif = 5
max_cor = 0.7
predictors <- c(
"swi_mean",
"soil_temperature_mean",
"rainfall_mean",
"growing_season_length"
)To simplify the code below a bit, the predictors are already
ordered by preference, and we omit the cases
max_vif = NULL or max_cor = NULL.
The correlation matrix is computed with cor_matrix() and
ordered after the argument preference_order.
m <- collinear::cor_matrix(
df = vi_smol,
predictors = predictors,
quiet = TRUE
)
m <- m[
predictors,
predictors
]
round(m, 2)
#> swi_mean soil_temperature_mean rainfall_mean
#> swi_mean 1.00 -0.19 0.66
#> soil_temperature_mean -0.19 1.00 0.05
#> rainfall_mean 0.66 0.05 1.00
#> growing_season_length 0.88 -0.07 0.78
#> growing_season_length
#> swi_mean 0.88
#> soil_temperature_mean -0.07
#> rainfall_mean 0.78
#> growing_season_length 1.00The first predictor in preference_order is added to
selected while the remaining ones go to
candidates.
selected <- predictors[1]
candidates <- predictors[-1]The first candidate gets ready for evaluation.
candidate <- candidates[1]The maximum correlation between selected and
candidate is assessed by extracting the
selected rows and the candidate column of the
absolute correlation matrix, computing its maximum, and comparing it
with max_cor.
If the expression above evaluates to TRUE, the candidate
is removed from candidates, as in
candidates <- candidates[-1], and a new candidate is
selected for evaluation.
Otherwise the algorithm moves to the VIF evaluation.
The subset of m with all rows and columns of
selected and candidate is used as input for
vif().
max(
collinear::vif(
m = m[
c(selected, candidate),
c(selected, candidate)
]
)
) > max_vif
#> [1] FALSEIf it evaluates to TRUE, candidate is
rejected and the algorithm goes to the next candidate.
Otherwise, candidate is added to selected
and removed from candidates, and a new
candidate is selected for evaluation.
selected <- c(selected, candidate)
candidates <- candidates[-1]
candidate <- candidates[1]Let’s take a look at the whole loop:
#compute and reorder correlation matrix
m <- collinear::cor_matrix(
df = vi_smol,
predictors = predictors,
quiet = TRUE
)
m <- m[
predictors,
predictors
]
#generate selected and candidates
selected <- predictors[1]
candidates <- predictors[-1]
#iterate over candidates
for (candidate in candidates) {
#correlation
if (
max(
abs(
m[
selected,
candidate
]
)
) > max_cor
) {
#if TRUE move to next candidate
next
}
#vif evaluation
if (
max(
vif(
m[
c(selected, candidate),
c(selected, candidate)
]
)
) > max_vif
) {
#if TRUE, move to next candidate
next
}
selected <- c(selected, candidate)
}
selected
#> [1] "swi_mean" "soil_temperature_mean" "rainfall_mean"Let’s see if collinear_select() returns the same
selection.
collinear::collinear_select(
df = vi_smol,
predictors = predictors,
preference_order = predictors,
max_cor = 0.7,
max_vif = 5
)
#> [1] "swi_mean" "soil_temperature_mean" "rainfall_mean"
#> attr(,"validated")
#> [1] TRUEAnd that is all about how multicollinearity filtering works within
collinear!