
Intelligent Predictor Ranking
Source:vignettes/articles/intelligent_predictor_ranking.Rmd
intelligent_predictor_ranking.RmdSummary
The package collinear implements an automated
multicollinearity filtering method devised to preserve as many
relevant predictors as possible. This principle helps balance
multicollinearity reduction with predictive power retention.
This feature is implemented in collinear(),
collinear_select(), vif_select() and
cor_select() via the argument
preference_order. This argument allows representing
predictor relevance in three ways:
Expert Mode: Vector of predictor names ordered from left to right according to the user’s preference. This option helps
collinear()get domain expertise into account, and lets the user focus on specific predictors.Intelligent Predictor Ranking: This functionality, implemented in
preference_order(), prioritizes predictors by their univariate association with the response to ensure that the most relevant ones are retained during multicollinearity filtering. This option maximizes the predictive power of the filtered predictors.Naive Option: If none of the options above are used,
collinear()ranks predictors from lower to higher collinearity with all other predictors. This option preserves the less redundant predictors, but it might not lead to robust models.
These options are explained in detail in the following sections.
Setup
This article requires the following setup.
library(collinear)
library(future)
library(DT)
#parallelization setup
#only useful for categorical predictors
future::plan(
future::multisession,
workers = future::availableCores() - 1
)
#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)
#example data
data(vi_smol, vi_predictors_numeric)Expert Mode
Let’s consider a little hypothetical.
The user has dataframe x with three variables
a, b and c, and domain knowledge
indicating that a and b are key and should be
preserved when possible. Then, the user calls collinear()
as follows:
y <- collinear::collinear(
df = x,
predictors = c("a", "b", "c"),
preference_order = c("a", "b"),
max_cor = 0.5
)Notice that the argument responses is missing: this
option ignores it, making a response variable entirely optional.
What happens from here?:
-
a: Selected. -
b: Selected if its correlation with"a"is <= 0.5, and filtered away otherwise. -
c: Selected if its maximum correlation withaandbis <= 0.5, and filtered away otherwise.
In summary, the first predictor in preference_order is
always selected, and the other ones are selected or rejected
conditionally on their collinearity with the already selected ones.
In case you wonder: predictors not in
preference_order are ranked from lower to higher
collinearity among themselves, and added in such order to the preference
vector.
Let’s take a look at a more tangible case now. The code below calls
collinear() on the dataset vi_smol, which contains a
numeric response vi_numeric (values of
a vegetation index) and a bunch of numeric predictors named in the
vector vi_predictors_numeric.
Let’s say we’d like to focus our analysis in the limiting role of the
soil water content (variables swi_xxx, from soil water
index) in controlling vi_numeric. In such case, we can call
collinear() as follows:
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = c(
"swi_min",
"swi_max",
"swi_mean",
"swi_range"
),
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
y$vi_numeric$selection
#> [1] "swi_min" "swi_range"
#> [3] "topo_elevation" "topo_slope"
#> [5] "humidity_range" "soil_clay"
#> [7] "soil_silt" "rainfall_min"
#> [9] "growing_season_temperature"
#> attr(,"validated")
#> [1] TRUENotice how swi_min and swi_range are
selected, but swi_max and swi_mean are removed
because they are collinear with swi_min. As said above, all
predictors not in the argument preference_order were ranked
from lower to higher mutual collinearity.
Now we can quickly fit a quick exploratory model, and save the R-squared for later.
m1 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()
m1
#>
#> Call:
#> stats::lm(formula = y$vi_numeric$formulas$linear, data = y$vi_numeric$df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.42802 -0.06894 -0.00307 0.07331 0.32260
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.261e-02 3.004e-02 0.420 0.67490
#> swi_min 5.591e-03 4.769e-04 11.723 < 2e-16 ***
#> swi_range 7.739e-03 3.072e-04 25.191 < 2e-16 ***
#> topo_elevation -2.709e-05 8.601e-06 -3.149 0.00172 **
#> topo_slope 7.592e-03 1.551e-03 4.895 1.26e-06 ***
#> humidity_range -8.134e-03 7.503e-04 -10.840 < 2e-16 ***
#> soil_clay 1.120e-03 6.266e-04 1.788 0.07433 .
#> soil_silt -1.257e-03 4.987e-04 -2.520 0.01200 *
#> rainfall_min 7.244e-04 1.008e-04 7.190 1.94e-12 ***
#> growing_season_temperature 3.418e-03 8.626e-04 3.963 8.29e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1079 on 600 degrees of freedom
#> Multiple R-squared: 0.7418, Adjusted R-squared: 0.7379
#> F-statistic: 191.5 on 9 and 600 DF, p-value: < 2.2e-16Intelligent Predictor Ranking
Let’s go back to our little hypothetical with the dataframe
x, and the three variables a, b
and c. But this time we also have a response
y, and a user with not as much domain knowledge as they’d
like (it happens, I’ve seen it).
In this case, collinear first fits the univariate models
y ~ a, y ~ b, and y ~ c, computes
the R-squared between observations and model predictions, and ranks the
predictors from best to worse according to this metric.
This functionality is implemented in the function
preference_order(), which can take advantage of a
parallelization backend to speed-up operations.
Let’s take a look at how it works, step by step. Let me start with the simplest approach.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
quiet = TRUE
)The function returns a dataframe with the predictors ordered from
better to worse modelling performance against the response. The column
f indicates the name of the function used to fit the
univariate models, f_numeric_glm() in this case. This
function has been selected automatically because the argument
f of preference_order() is set to f_auto by default
(f functions must not have parenthesis when calling them
via the f argument). This function looks at the types of
the responses and predictors, and select one of the functions in
returned by f_functions() to perform the operation.
Let’s talk more about that later, but for now, we can plug the
preference order dataframe directly into collinear().
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = x,
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
y$vi_numeric$selection
#> [1] "growing_season_length" "swi_min" "rainfall_min"
#> [4] "solar_rad_range" "cloud_cover_range" "soil_clay"
#> [7] "topo_diversity" "soil_silt"
#> attr(,"validated")
#> [1] TRUEAgain, we can use the collinear() output to fit a little
model.
m2 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()
m2
#>
#> Call:
#> stats::lm(formula = y$vi_numeric$formulas$linear, data = y$vi_numeric$df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.39369 -0.04100 -0.00053 0.04417 0.32204
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.507e-01 2.396e-02 6.291 6.08e-10 ***
#> growing_season_length 1.271e-03 4.163e-05 30.523 < 2e-16 ***
#> swi_min 3.844e-03 3.916e-04 9.815 < 2e-16 ***
#> rainfall_min 9.932e-06 8.123e-05 0.122 0.9027
#> solar_rad_range -3.609e-03 6.969e-04 -5.179 3.06e-07 ***
#> cloud_cover_range 5.419e-04 2.880e-04 1.882 0.0603 .
#> soil_clay -2.776e-04 4.865e-04 -0.571 0.5685
#> topo_diversity 7.559e-04 7.380e-04 1.024 0.3062
#> soil_silt -1.749e-03 3.864e-04 -4.527 7.21e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.08315 on 601 degrees of freedom
#> Multiple R-squared: 0.8464, Adjusted R-squared: 0.8443
#> F-statistic: 413.9 on 8 and 601 DF, p-value: < 2.2e-16If we compare the R-squared of the two models we have created so far,
we can see that preference_order() helps balance aggressive
multicollinearity filtering and robust modelling outcomes.
m1$r.squared
#> [1] 0.7417704
m2$r.squared
#> [1] 0.8463813Let’s go back to what f_auto() does for a moment. This
function looks at the input data to assess the type of the response and
the predictors, and then looks at the dataframe below to choose a
function.
collinear::f_auto_rules()You can see it in action across different settings below.
collinear::f_auto(
df = vi_smol,
response = "vi_categorical",
predictors = vi_predictors_categorical,
quiet = TRUE
)
#> [1] "f_categorical_rf"
collinear::f_auto(
df = vi_smol,
response = "vi_binomial", #ones and zeros
predictors = vi_predictors_numeric,
quiet = TRUE
)
#> [1] "f_binomial_glm"
collinear::f_auto(
df = vi_smol,
response = "vi_counts", #integer counts
predictors = vi_predictors_numeric,
quiet = TRUE
)
#> [1] "f_count_glm"
collinear::f_auto(
df = vi_smol,
response = "vi_counts",
predictors = vi_predictors, #numeric and categorical
quiet = TRUE
)
#> [1] "f_count_rf"All f_...() functions available for usage in
preference_order() are listed in the dataframe returned by
f_functions().
collinear::f_functions()Once you know your way around these functions, you can choose the one
you prefer for your case. For example, below we replace f_auto with f_numeric_gam to fit
univariate GAM models.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
f = f_numeric_gam,
quiet = TRUE
)A gentle reminder to finish this section: collinear()
runs preference_order() internally when
preference_order = NULL and the argument f
receives a valid function. And like preference_order(), it
can use cross-validation to assess the association between response and
predictor in a more robust manner.
y <- collinear::collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = NULL,
f = f_numeric_glm,
quiet = FALSE,
cv_iterations = 100, #number of repetitions
cv_training_fraction = 0.5 #50% rows of vi_smol
)
#>
#> collinear::collinear(): setting 'max_cor' to 0.618.
#>
#> collinear::collinear(): setting 'max_vif' to 5.0318.
#>
#> collinear::collinear(): selected predictors:
#> - growing_season_length
#> - cloud_cover_min
#> - temperature_seasonality
#> - cloud_cover_range
#> - evapotranspiration_mean
#> - soil_clay
#> - topo_diversity
#> - humidity_range
#> - topo_elevation
#> - topo_slope
#> - soil_siltThe output of preference_order() is returned by
collinear().
y$vi_numeric$preference_orderNaive Option
For this final option, our hypothetical user does not care about what
I have written above, and sets f = NULL in
preference_order().
In this scenario, preference_order() computes the
pairwise correlation between all pairs of predictors a,
b, and c with cor_matrix(), and
sums the correlations of each predictor with all others. Finally, it
ranks the predictors from lowest to highest sum of correlations.
This option gives preference to those predictors that contain more exclusive information, but in exchange, might not lead to robust models.
x <- collinear::preference_order(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
f = NULL
)
#>
#> collinear::preference_order(): ranking 47 'predictors' from lower to higher multicollinearity.The output shows a column score computed as 1 minus the
sum of correlations, as indicated in the column metric.
Let’s use this ranking in collinear() to then fit a
linear model.
y <- collinear::collinear(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_numeric,
preference_order = x,
max_cor = 0.5,
max_vif = 2.5,
quiet = TRUE
)
m3 <- stats::lm(
formula = y$vi_numeric$formulas$linear,
data = y$vi_numeric$df
) |>
summary()And finally, an informal comparison between the three preference order methods described in this article.
#expert mode: focused on specific variables (swi_...)
m1$r.squared
#> [1] 0.7417704
#intelligent predictor ranking: optimized for prediction
m2$r.squared
#> [1] 0.8463813
#naive option: minimizes redundancy, not optimized for prediction
m3$r.squared
#> [1] 0.7494892Please, take in mind that these R-squared values are just coarse indicators of model robustness, and should not be interpreted as proof that one method is better than any other.