Automates multicollinearity management in datasets with mixed variable types (numeric, categorical, and logical) through an integrated system of five key components:
- Target Encoding Integration (opt-in)
When
responsesis numeric, categorical predictors can be converted to numeric using response values as reference. This enables VIF and correlation analysis across mixed types. Seetarget_encoding_lab.- Intelligent Predictor Ranking (active by default)
Three prioritization strategies ensure the most relevant predictors are retained during filtering:
User-defined ranking (argument
preference_order): Accepts a character vector of predictor names or a dataframe frompreference_order. Lower-ranked collinear predictors are removed.Response-based ranking (
f): Usesf_autoto auto-select a function (seef_numeric_glm) to rank predictors by their univariate association with the response. Supports cross-validation viapreference_order.Multicollinearity-based ranking (default): When both
preference_orderandfareNULL, predictors are ranked from lower to higher multicollinearity.
- Unified Correlation Framework (active by default)
Computes pairwise correlations between variable types using Pearson (numeric–numeric), target encoding (numeric–categorical), and Cramer's V (categorical–categorical). See
cor_df,cor_matrix, andcor_cramer.- Adaptive Filtering Thresholds (active by default)
When
max_corandmax_vifare bothNULL, thresholds are determined from the median correlation structure of the predictors.- Dual Filtering Strategy (active by default)
Combines two complementary methods while respecting predictor rankings:
Pairwise Correlation Filtering: Removes predictors with Pearson correlation or Cramer's V above
max_cor. Seecor_select.VIF-based Filtering: Removes numeric predictors with VIF above
max_vif. Seevif_select,vif_df, andvif.
This function accepts parallelization via future::plan() and progress bars via progressr::handlers(). Parallelization benefits target_encoding_lab, preference_order, and cor_select.
Usage
collinear(
df = NULL,
responses = NULL,
predictors = NULL,
encoding_method = NULL,
preference_order = NULL,
f = f_auto,
max_cor = NULL,
max_vif = NULL,
quiet = FALSE,
...
)Arguments
- df
(required; dataframe, tibble, or sf) A dataframe with responses (optional) and predictors. Must have at least 10 rows for pairwise correlation analysis, and
10 * (length(predictors) - 1)for VIF. Default: NULL.- responses
(optional; character, character vector, or NULL) Name of one or several response variables in
df. Default: NULL.- predictors
(optional; character vector or NULL) Names of the predictors in
df. If NULL, all columns exceptresponsesand constant/near-zero-variance columns are used. Default: NULL.- encoding_method
(optional; character or NULL) One of "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: NULL.
- preference_order
(optional; character vector, dataframe from
preference_order, or NULL) Prioritizes predictors to preserve.- f
(optional; unquoted function name or NULL) Function to rank predictors by relationship with
responses. Seef_functions. Default:f_auto.- max_cor
(optional; numeric or NULL) Maximum allowed pairwise correlation (0.01–0.99). Recommended between 0.5 and 0.9. If NULL and
max_vifis NULL, it is selected automatically. Default: NULL.- max_vif
(optional; numeric or NULL) Maximum allowed VIF. Recommended between 2.5 and 10. If NULL and
max_coris NULL, configured automatically. Default: NULL.- quiet
(optional; logical) If FALSE, messages are printed. Default: FALSE.
- ...
(optional) Internal args (e.g.
function_nameforvalidate_arg_function_name, a precomputed correlation matrixm, or cross-validation args forpreference_order).
Value
A list of class collinear_output with sublists of class
collinear_selection. If responses = NULL a single sublist named "result" is returned; otherwise a sublist per response is returned.
Adaptive Multicollinearity Thresholds
When both max_cor and max_vif are NULL, the function determines thresholds as follows:
Compute the 75th percentile of pairwise correlations via
cor_stats.Map that value through a sigmoid between 0.545 (VIF~2.5) and 0.785 (VIF~7.5), centered at 0.665, to get
max_cor.Compute
max_viffrommax_corusinggam_cor_to_vif.
Variance Inflation Factors
VIF for predictor \(a\) is computed as \(1/(1-R^2)\), where \(R^2\) is the multiple R-squared from regressing \(a\) on the other predictors. Recommended maximums commonly used are 2.5, 5, and 10.
VIF-based Filtering
vif_select ranks numeric predictors (user preference_order if provided, otherwise from lower to higher VIF) and sequentially adds predictors whose VIF against the current selection is below max_vif.
Pairwise Correlation Filtering
cor_select computes the global correlation matrix, orders predictors by preference_order or by lower-to-higher summed correlations, and sequentially selects predictors with pairwise correlations below max_cor.
References
David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538
See also
Other multicollinearity_filtering:
collinear_select(),
cor_select(),
step_collinear(),
vif_select()
Examples
#load example data
data(
vi_smol,
vi_predictors,
package = "spatialData"
)
#select numeric predictors only
vi_predictors_numeric <- identify_numeric_variables(
df = vi_smol,
predictors = vi_predictors
)$valid
#OPTIONAL: parallelization setup
#worth it for large data
# future::plan(
# strategy = future::multisession,
# workers = future::availableCores() - 2
# )
#OPTIONAL
#progress bar (doesn't work when running an R example)
#progressr::handlers(global = TRUE)
#filter numeric predictors automatically
x <- collinear(
df = vi_smol,
response = "vi_numeric",
predictors = vi_predictors_numeric
)
#>
#> collinear::collinear()
#> └── collinear::validate_arg_df()
#> └── collinear::drop_geometry_column(): dropping geometry column from 'df'.
#>
#> collinear::collinear(): setting 'max_cor' to 0.5711.
#>
#> collinear::collinear(): setting 'max_vif' to 3.9685.
#>
#> collinear::collinear()
#> └── collinear::preference_order()
#> └── collinear::f_auto(): selected function 'f_numeric_glm()' to compute preference order.
#>
#> collinear::collinear(): selected predictors:
#> - growing_season_length
#> - soil_temperature_max
#> - evapotranspiration_range
#> - rainfall_min
#> - swi_range
#> - rainfall_range
#> - topo_diversity
#> - humidity_range
#> - soil_clay
#> - topo_elevation
#> - soil_silt
#selected variable names
x$vi_numeric$selection
#> [1] "growing_season_length" "soil_temperature_max"
#> [3] "evapotranspiration_range" "rainfall_min"
#> [5] "swi_range" "rainfall_range"
#> [7] "topo_diversity" "humidity_range"
#> [9] "soil_clay" "topo_elevation"
#> [11] "soil_silt"
#> attr(,"validated")
#> [1] TRUE
#preference order dataframe
x$vi_numeric$preference_order
#> response predictor f metric score rank
#> 1 vi_numeric growing_season_length f_numeric_glm R-squared 0.7041 1
#> 2 vi_numeric soil_ph f_numeric_glm R-squared 0.6751 2
#> 3 vi_numeric humidity_mean f_numeric_glm R-squared 0.6165 3
#> 4 vi_numeric swi_mean f_numeric_glm R-squared 0.6072 4
#> 5 vi_numeric cloud_cover_mean f_numeric_glm R-squared 0.5690 5
#> 6 vi_numeric rainfall_mean f_numeric_glm R-squared 0.5194 6
#> 7 vi_numeric growing_season_rainfall f_numeric_glm R-squared 0.5078 7
#> 8 vi_numeric cloud_cover_max f_numeric_glm R-squared 0.4991 8
#> 9 vi_numeric humidity_max f_numeric_glm R-squared 0.4877 9
#> 10 vi_numeric humidity_min f_numeric_glm R-squared 0.4630 10
#> 11 vi_numeric soil_temperature_max f_numeric_glm R-squared 0.4455 11
#> 12 vi_numeric soil_temperature_range f_numeric_glm R-squared 0.4423 12
#> 13 vi_numeric aridity_index f_numeric_glm R-squared 0.4421 13
#> 14 vi_numeric swi_max f_numeric_glm R-squared 0.4351 14
#> 15 vi_numeric evapotranspiration_max f_numeric_glm R-squared 0.3940 15
#> 16 vi_numeric solar_rad_max f_numeric_glm R-squared 0.3717 16
#> 17 vi_numeric rainfall_max f_numeric_glm R-squared 0.3394 17
#> 18 vi_numeric cloud_cover_min f_numeric_glm R-squared 0.3193 18
#> 19 vi_numeric evapotranspiration_range f_numeric_glm R-squared 0.2339 19
#> 20 vi_numeric rainfall_min f_numeric_glm R-squared 0.2314 20
#> 21 vi_numeric temperature_range f_numeric_glm R-squared 0.2272 21
#> 22 vi_numeric swi_range f_numeric_glm R-squared 0.2270 22
#> 23 vi_numeric rainfall_range f_numeric_glm R-squared 0.2087 23
#> 24 vi_numeric solar_rad_mean f_numeric_glm R-squared 0.1981 24
#> 25 vi_numeric swi_min f_numeric_glm R-squared 0.1596 25
#> 26 vi_numeric temperature_seasonality f_numeric_glm R-squared 0.1350 26
#> 27 vi_numeric soil_soc f_numeric_glm R-squared 0.1349 27
#> 28 vi_numeric soil_nitrogen f_numeric_glm R-squared 0.1287 28
#> 29 vi_numeric evapotranspiration_mean f_numeric_glm R-squared 0.1123 29
#> 30 vi_numeric topo_diversity f_numeric_glm R-squared 0.0992 30
#> 31 vi_numeric temperature_max f_numeric_glm R-squared 0.0865 31
#> 32 vi_numeric cloud_cover_range f_numeric_glm R-squared 0.0818 32
#> 33 vi_numeric humidity_range f_numeric_glm R-squared 0.0731 33
#> 34 vi_numeric temperature_min f_numeric_glm R-squared 0.0686 34
#> 35 vi_numeric solar_rad_range f_numeric_glm R-squared 0.0674 35
#> 36 vi_numeric soil_temperature_min f_numeric_glm R-squared 0.0647 36
#> 37 vi_numeric soil_clay f_numeric_glm R-squared 0.0499 37
#> 38 vi_numeric topo_elevation f_numeric_glm R-squared 0.0412 38
#> 39 vi_numeric soil_sand f_numeric_glm R-squared 0.0377 39
#> 40 vi_numeric topo_slope f_numeric_glm R-squared 0.0145 40
#> 41 vi_numeric soil_silt f_numeric_glm R-squared 0.0106 41
#> 42 vi_numeric soil_temperature_mean f_numeric_glm R-squared 0.0099 42
#> 43 vi_numeric solar_rad_min f_numeric_glm R-squared 0.0067 43
#> 44 vi_numeric temperature_mean f_numeric_glm R-squared 0.0052 44
#> 45 vi_numeric growing_degree_days f_numeric_glm R-squared 0.0031 45
#> 46 vi_numeric evapotranspiration_min f_numeric_glm R-squared 0.0016 46
#> 47 vi_numeric growing_season_temperature f_numeric_glm R-squared 0.0007 47
#model formulas
x$vi_numeric$formulas
#> $linear
#> vi_numeric ~ growing_season_length + soil_temperature_max + evapotranspiration_range +
#> rainfall_min + swi_range + rainfall_range + topo_diversity +
#> humidity_range + soil_clay + topo_elevation + soil_silt
#> <environment: 0x559e81c85120>
#>
#> $smooth
#> vi_numeric ~ s(growing_season_length) + s(soil_temperature_max) +
#> s(evapotranspiration_range) + s(rainfall_min) + s(swi_range) +
#> s(rainfall_range) + s(topo_diversity) + s(humidity_range) +
#> s(soil_clay) + s(topo_elevation) + s(soil_silt)
#> <environment: 0x559e818c3550>
#>
#using a formula
m <- lm(
formula = x$vi_numeric$formulas$linear,
data = x$vi_numeric$df
)
summary(m)
#>
#> Call:
#> lm(formula = x$vi_numeric$formulas$linear, data = x$vi_numeric$df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.34463 -0.03830 0.00033 0.04792 0.25815
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.839e-01 3.766e-02 12.848 < 2e-16 ***
#> growing_season_length 6.126e-04 5.294e-05 11.573 < 2e-16 ***
#> soil_temperature_max -8.416e-03 6.299e-04 -13.360 < 2e-16 ***
#> evapotranspiration_range -2.432e-04 1.036e-04 -2.348 0.019234 *
#> rainfall_min 3.302e-04 9.841e-05 3.355 0.000847 ***
#> swi_range 3.160e-03 3.464e-04 9.123 < 2e-16 ***
#> rainfall_range 6.266e-05 3.289e-05 1.905 0.057275 .
#> topo_diversity 1.950e-03 7.989e-04 2.441 0.014944 *
#> humidity_range -1.851e-03 5.993e-04 -3.088 0.002111 **
#> soil_clay 3.137e-04 4.541e-04 0.691 0.490012
#> topo_elevation -3.146e-05 5.653e-06 -5.564 4.07e-08 ***
#> soil_silt -1.186e-03 3.212e-04 -3.694 0.000243 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.07553 on 568 degrees of freedom
#> Multiple R-squared: 0.8274, Adjusted R-squared: 0.824
#> F-statistic: 247.5 on 11 and 568 DF, p-value: < 2.2e-16
#>
#disable parallelization
# future::plan(
# strategy = future::sequential
# )
