
Intelligent Predictor Ranking
Source:vignettes/articles/intelligent_predictor_ranking.Rmd
intelligent_predictor_ranking.RmdSummary
The package collinear implements an automated
multicollinearity filtering method devised to preserve as many
relevant predictors as possible. This principle helps balance
multicollinearity reduction with predictive power retention.
This feature is implemented in collinear(),
collinear_select(), vif_select() and
cor_select() via the argument
preference_order. This argument allows representing
predictor relevance in three ways:
Expert Mode: Vector of predictor names ordered from left to right according to the user’s preference. This option helps
collinear()get domain expertise into account, and lets the user focus on specific predictors.Intelligent Predictor Ranking: This functionality, implemented in
preference_order(), prioritizes predictors by their univariate association with the response to ensure that the most relevant ones are retained during multicollinearity filtering. This option maximizes the predictive power of the filtered predictors.Naive Option: If none of the options above are used,
collinear()ranks predictors from lower to higher collinearity with all other predictors. This option preserves the less redundant predictors, but it might not lead to robust models.
These options are explained in detail in the following sections.
Setup
This article requires the following setup.
library(collinear)
library(future)
library(DT)
#parallelization setup
#only useful for categorical predictors
future::plan(
future::multisession,
workers = future::availableCores() - 1
)
#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)
#example data
data(
vi_smol,
vi_predictors,
package = "spatialData"
)
#separate vi_predictors in numeric and categorical
vi_predictors_numeric <- collinear::identify_numeric_variables(
df = vi_smol,
predictors = vi_predictors
)$valid
vi_predictors_categorical <- collinear::identify_categorical_variables(
df = vi_smol,
predictors = vi_predictors
)$validExpert Mode
Let’s consider a hypothetical: The user has dataframe x
with three variables a, b and c,
and domain knowledge indicating that a and b
are key and should be preserved when possible. Then, the user calls
collinear() as follows:
y <- collinear::collinear(
df = x,
predictors = c("a", "b", "c"),
preference_order = c("a", "b"),
max_cor = 0.5
)Notice that the argument responses is missing: this
option ignores it, making a response variable entirely optional.
What happens from here?:
-
a: Selected. -
b: Selected if its correlation with"a"is <= 0.5, and filtered away otherwise. -
c: Selected if its maximum correlation withaandbis <= 0.5, and filtered away otherwise.
In summary, the first predictor in preference_order is
always selected, and the other ones are selected or rejected
conditionally on their collinearity with the already selected ones.
In case you wonder: predictors not in
preference_order are ranked from lower to higher
collinearity among themselves, and added in such order to the preference
vector.
Let’s use some real data now. The code below calls
collinear() on the dataset spatialData::vi_smol,
which contains a numeric response vi_numeric (values of a
vegetation index) and a bunch of numeric predictors named in the vector
vi_predictors_numeric (generated from spatialData::vi_predictors
in the setup).
Let’s say we’d like to focus our analysis in the limiting role of the
soil water content (variables swi_xxx, from soil water
index) in controlling vi_numeric. In such case, we can call
collinear() as follows:
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = c(
"swi_min",
"swi_max",
"swi_mean",
"swi_range"
),
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
y$vi_numeric$selection
#> [1] "swi_min" "swi_range"
#> [3] "topo_elevation" "humidity_range"
#> [5] "topo_diversity" "soil_clay"
#> [7] "soil_silt" "rainfall_min"
#> [9] "growing_season_temperature"
#> attr(,"validated")
#> [1] TRUENotice how swi_min and swi_range are
selected, but swi_max and swi_mean are removed
because they are collinear with swi_min. All predictors not
in the argument preference_order were ranked from lower to
higher mutual collinearity.
Notice that there’s a linear formula in the formulas
slot of the output.
y$vi_numeric$formulas$linear
#> vi_numeric ~ swi_min + swi_range + topo_elevation + humidity_range +
#> topo_diversity + soil_clay + soil_silt + rainfall_min + growing_season_temperature
#> <environment: 0x5628335dd838>We can use it to fit a quick exploratory model, and save it for later.
m1 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()
m1
#>
#> Call:
#> stats::lm(formula = y$vi_numeric$formulas$linear, data = y$vi_numeric$df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.42580 -0.06746 -0.00238 0.06613 0.29015
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.252e-01 3.476e-02 -3.602 0.000343 ***
#> swi_min 5.574e-03 4.813e-04 11.581 < 2e-16 ***
#> swi_range 7.688e-03 3.108e-04 24.736 < 2e-16 ***
#> topo_elevation -3.068e-05 8.289e-06 -3.702 0.000235 ***
#> humidity_range -7.842e-03 7.541e-04 -10.398 < 2e-16 ***
#> topo_diversity 7.221e-03 1.002e-03 7.209 1.80e-12 ***
#> soil_clay 9.669e-04 6.300e-04 1.535 0.125386
#> soil_silt -1.136e-03 4.941e-04 -2.300 0.021832 *
#> rainfall_min 6.533e-04 1.011e-04 6.462 2.21e-10 ***
#> growing_season_temperature 2.986e-03 8.769e-04 3.405 0.000709 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1056 on 570 degrees of freedom
#> Multiple R-squared: 0.7532, Adjusted R-squared: 0.7494
#> F-statistic: 193.3 on 9 and 570 DF, p-value: < 2.2e-16Intelligent Predictor Ranking
Let’s go back to our little hypothetical with the dataframe
x, and the three variables a, b
and c. But this time we also have a response
y, and a user with not as much domain knowledge.
In this case, collinear() first fits the univariate
models y ~ a, y ~ b, and y ~ c,
computes the R-squared between observations and model predictions, and
ranks the predictors from best to worse according to this metric.
This functionality is implemented in the function
preference_order(), which can take advantage of a
future parallelization backend to speed-up operations.
Let’s take a look at how this option works with real data. Let me start with the simplest approach.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
quiet = TRUE
)The function returns a dataframe with the predictors ordered from
better to worse modelling performance against the response. The column
f indicates the name of the function used to fit the
univariate models, f_numeric_glm() in this case. This
function has been selected automatically because the argument
f of preference_order() is set to
f_auto by default (f functions must not have
parenthesis when calling them via the f argument). This
function looks at the types of the responses and predictors, and select
one of the functions in returned by f_functions() to
perform the operation.
Let’s talk more about that later, but for now, we can plug the
preference order dataframe directly into collinear().
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = x,
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
y$vi_numeric$selection
#> [1] "growing_season_length" "swi_min" "rainfall_min"
#> [4] "solar_rad_range" "cloud_cover_range" "soil_clay"
#> [7] "topo_slope" "soil_silt"
#> attr(,"validated")
#> [1] TRUEAgain, we can use the collinear() output to fit a little
model.
m2 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()
m2
#>
#> Call:
#> stats::lm(formula = y$vi_numeric$formulas$linear, data = y$vi_numeric$df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.39435 -0.04232 -0.00060 0.04276 0.27836
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.681e-01 1.810e-02 9.283 < 2e-16 ***
#> growing_season_length 1.267e-03 4.238e-05 29.894 < 2e-16 ***
#> swi_min 4.039e-03 4.020e-04 10.047 < 2e-16 ***
#> rainfall_min 1.143e-05 8.168e-05 0.140 0.8888
#> solar_rad_range -3.816e-03 7.157e-04 -5.333 1.40e-07 ***
#> cloud_cover_range 6.370e-04 2.917e-04 2.184 0.0294 *
#> soil_clay -3.457e-04 4.928e-04 -0.702 0.4833
#> topo_slope -6.437e-04 1.101e-03 -0.585 0.5589
#> soil_silt -1.765e-03 3.911e-04 -4.513 7.76e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.08228 on 571 degrees of freedom
#> Multiple R-squared: 0.8499, Adjusted R-squared: 0.8478
#> F-statistic: 404.2 on 8 and 571 DF, p-value: < 2.2e-16If we compare the R-squared of the two models we have created so far,
we can see that using preference_order() has improved model
fit.
m1$r.squared
#> [1] 0.7532484
m2$r.squared
#> [1] 0.8499317Let’s go back to what f_auto() does for a moment. This
function looks at the input data to assess the type of the response and
the predictors, and then looks at the dataframe below to choose a
function.
collinear::f_auto_rules()You can see it in action across different settings below.
collinear::f_auto(
df = vi_smol,
response = "vi_categorical",
predictors = vi_predictors_categorical,
quiet = TRUE
)
#> [1] "f_categorical_rf"
collinear::f_auto(
df = vi_smol,
response = "vi_binomial", #ones and zeros
predictors = vi_predictors_numeric,
quiet = TRUE
)
#> [1] "f_binomial_glm"
collinear::f_auto(
df = vi_smol,
response = "vi_counts", #integer counts
predictors = vi_predictors_numeric,
quiet = TRUE
)
#> [1] "f_count_glm"
collinear::f_auto(
df = vi_smol,
response = "vi_counts",
predictors = vi_predictors, #numeric and categorical
quiet = TRUE
)
#> [1] "f_count_rf"All f_...() functions available for usage in
preference_order() are listed in the dataframe returned by
f_functions().
collinear::f_functions()Once you know your way around these functions, you can choose the one
you prefer for your case. For example, below we replace f_auto with f_numeric_gam to fit
univariate GAM models.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
f = f_numeric_gam,
quiet = TRUE
)A gentle reminder to finish this section: collinear()
runs preference_order() internally when
preference_order = NULL and the argument f
receives a valid function. And like preference_order(), it
can use cross-validation to assess the association between response and
predictor in a more robust manner.
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = NULL,
f = f_numeric_glm,
quiet = FALSE,
cv_iterations = 100, #number of repetitions
cv_training_fraction = 0.5 #50% rows of vi_smol
)
#>
#> collinear::collinear(): setting 'max_cor' to 0.6161.
#>
#> collinear::collinear(): setting 'max_vif' to 4.9877.
#>
#> collinear::collinear(): selected predictors:
#> - growing_season_length
#> - cloud_cover_min
#> - temperature_seasonality
#> - cloud_cover_range
#> - evapotranspiration_mean
#> - soil_clay
#> - topo_diversity
#> - humidity_range
#> - topo_elevation
#> - topo_slopeThe output of preference_order() is returned by
collinear().
y$vi_numeric$preference_orderNaive Option
For this final option, our hypothetical user does not care about what
I have written above, and sets f = NULL in
preference_order().
In this scenario, preference_order() computes the
pairwise correlation between all pairs of predictors a,
b, and c with cor_matrix(), and
sums the correlations of each predictor with all others. Finally, it
ranks the predictors from lowest to highest sum of correlations.
This option gives preference to those predictors that contain more exclusive information, but in exchange, might not lead to robust models.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
f = NULL
)
#>
#> collinear::preference_order(): ranking 47 'predictors' from lower to higher multicollinearity.The output shows a column score computed as 1 minus the
sum of correlations, as indicated in the column metric.
Let’s use this ranking in collinear() to then fit a
linear model.
y <- collinear::collinear(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = x,
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
m3 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()And finally, an informal comparison between the three preference order methods described in this article.
#expert mode: focused on specific variables (swi_...)
m1$r.squared
#> [1] 0.7532484
#intelligent predictor ranking: optimized for prediction
m2$r.squared
#> [1] 0.8499317
#naive option: minimizes redundancy, not optimized for prediction
m3$r.squared
#> [1] 0.7615246Please, take in mind that these R-squared values are just coarse indicators of model robustness, and should not be interpreted as proof that one method is better than any other.
Now you know the different ways you can take advantage of
collinear() depending on your goals!