Skip to contents

Automates multicollinearity management in datasets with mixed variable types (numeric, categorical, and logical) through an integrated system of five key components:

Target Encoding Integration (opt-in)

When responses is numeric, categorical predictors can be converted to numeric using response values as reference. This enables VIF and correlation analysis across mixed types. See target_encoding_lab.

Intelligent Predictor Ranking (active by default)

Three prioritization strategies ensure the most relevant predictors are retained during filtering:

  • User-defined ranking (argument preference_order): Accepts a character vector of predictor names or a dataframe from preference_order. Lower-ranked collinear predictors are removed.

  • Response-based ranking (f): Uses f_auto to auto-select a function (see f_numeric_glm) to rank predictors by their univariate association with the response. Supports cross-validation via preference_order.

  • Multicollinearity-based ranking (default): When both preference_order and f are NULL, predictors are ranked from lower to higher multicollinearity.

Unified Correlation Framework (active by default)

Computes pairwise correlations between variable types using Pearson (numeric–numeric), target encoding (numeric–categorical), and Cramer's V (categorical–categorical). See cor_df, cor_matrix, and cor_cramer.

Adaptive Filtering Thresholds (active by default)

When max_cor and max_vif are both NULL, thresholds are determined from the median correlation structure of the predictors.

Dual Filtering Strategy (active by default)

Combines two complementary methods while respecting predictor rankings:

  • Pairwise Correlation Filtering: Removes predictors with Pearson correlation or Cramer's V above max_cor. See cor_select.

  • VIF-based Filtering: Removes numeric predictors with VIF above max_vif. See vif_select, vif_df, and vif.

This function accepts parallelization via future::plan() and progress bars via progressr::handlers(). Parallelization benefits target_encoding_lab, preference_order, and cor_select.

Usage

collinear(
  df = NULL,
  responses = NULL,
  predictors = NULL,
  encoding_method = NULL,
  preference_order = NULL,
  f = f_auto,
  max_cor = NULL,
  max_vif = NULL,
  quiet = FALSE,
  ...
)

Arguments

df

(required; dataframe, tibble, or sf) A dataframe with responses (optional) and predictors. Must have at least 10 rows for pairwise correlation analysis, and 10 * (length(predictors) - 1) for VIF. Default: NULL.

responses

(optional; character, character vector, or NULL) Name of one or several response variables in df. Default: NULL.

predictors

(optional; character vector or NULL) Names of the predictors in df. If NULL, all columns except responses and constant/near-zero-variance columns are used. Default: NULL.

encoding_method

(optional; character or NULL) One of "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: NULL.

preference_order

(optional; character vector, dataframe from preference_order, or NULL) Prioritizes predictors to preserve.

f

(optional; unquoted function name or NULL) Function to rank predictors by relationship with responses. See f_functions. Default: f_auto.

max_cor

(optional; numeric or NULL) Maximum allowed pairwise correlation (0.01–0.99). Recommended between 0.5 and 0.9. If NULL and max_vif is NULL, it is selected automatically. Default: NULL.

max_vif

(optional; numeric or NULL) Maximum allowed VIF. Recommended between 2.5 and 10. If NULL and max_cor is NULL, configured automatically. Default: NULL.

quiet

(optional; logical) If FALSE, messages are printed. Default: FALSE.

...

(optional) Internal args (e.g. function_name for validate_arg_function_name, a precomputed correlation matrix m, or cross-validation args for preference_order).

Value

A list of class collinear_output with sublists of class collinear_selection. If responses = NULL a single sublist named "result" is returned; otherwise a sublist per response is returned.

Adaptive Multicollinearity Thresholds

When both max_cor and max_vif are NULL, the function determines thresholds as follows:

  1. Compute the 75th percentile of pairwise correlations via cor_stats.

  2. Map that value through a sigmoid between 0.545 (VIF~2.5) and 0.785 (VIF~7.5), centered at 0.665, to get max_cor.

  3. Compute max_vif from max_cor using gam_cor_to_vif.

Variance Inflation Factors

VIF for predictor \(a\) is computed as \(1/(1-R^2)\), where \(R^2\) is the multiple R-squared from regressing \(a\) on the other predictors. Recommended maximums commonly used are 2.5, 5, and 10.

VIF-based Filtering

vif_select ranks numeric predictors (user preference_order if provided, otherwise from lower to higher VIF) and sequentially adds predictors whose VIF against the current selection is below max_vif.

Pairwise Correlation Filtering

cor_select computes the global correlation matrix, orders predictors by preference_order or by lower-to-higher summed correlations, and sequentially selects predictors with pairwise correlations below max_cor.

References

  • David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.

  • Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538

See also

Other multicollinearity_filtering: collinear_select(), cor_select(), step_collinear(), vif_select()

Examples


#load example data
data(
  vi_smol,
  vi_predictors,
  package = "spatialData"
  )

#select numeric predictors only
vi_predictors_numeric <- identify_numeric_variables(
  df = vi_smol,
  predictors = vi_predictors
)$valid

#OPTIONAL: parallelization setup
#worth it for large data
# future::plan(
#   strategy = future::multisession,
#   workers = future::availableCores() - 2
# )

#OPTIONAL
#progress bar (doesn't work when running an R example)
#progressr::handlers(global = TRUE)

#filter numeric predictors automatically
x <- collinear(
  df = vi_smol,
  response = "vi_numeric",
  predictors = vi_predictors_numeric
  )
#> 
#> collinear::collinear()
#> └── collinear::validate_arg_df()
#>     └── collinear::drop_geometry_column(): dropping geometry column from 'df'.
#> 
#> collinear::collinear(): setting 'max_cor' to 0.5711.
#> 
#> collinear::collinear(): setting 'max_vif' to 3.9685.
#> 
#> collinear::collinear()
#> └── collinear::preference_order()
#>     └── collinear::f_auto(): selected function 'f_numeric_glm()' to compute preference order.
#> 
#> collinear::collinear(): selected predictors: 
#>  - growing_season_length
#>  - soil_temperature_max
#>  - evapotranspiration_range
#>  - rainfall_min
#>  - swi_range
#>  - rainfall_range
#>  - topo_diversity
#>  - humidity_range
#>  - soil_clay
#>  - topo_elevation
#>  - soil_silt

#selected variable names
x$vi_numeric$selection
#>  [1] "growing_season_length"    "soil_temperature_max"    
#>  [3] "evapotranspiration_range" "rainfall_min"            
#>  [5] "swi_range"                "rainfall_range"          
#>  [7] "topo_diversity"           "humidity_range"          
#>  [9] "soil_clay"                "topo_elevation"          
#> [11] "soil_silt"               
#> attr(,"validated")
#> [1] TRUE

#preference order dataframe
x$vi_numeric$preference_order
#>      response                  predictor             f    metric  score rank
#> 1  vi_numeric      growing_season_length f_numeric_glm R-squared 0.7041    1
#> 2  vi_numeric                    soil_ph f_numeric_glm R-squared 0.6751    2
#> 3  vi_numeric              humidity_mean f_numeric_glm R-squared 0.6165    3
#> 4  vi_numeric                   swi_mean f_numeric_glm R-squared 0.6072    4
#> 5  vi_numeric           cloud_cover_mean f_numeric_glm R-squared 0.5690    5
#> 6  vi_numeric              rainfall_mean f_numeric_glm R-squared 0.5194    6
#> 7  vi_numeric    growing_season_rainfall f_numeric_glm R-squared 0.5078    7
#> 8  vi_numeric            cloud_cover_max f_numeric_glm R-squared 0.4991    8
#> 9  vi_numeric               humidity_max f_numeric_glm R-squared 0.4877    9
#> 10 vi_numeric               humidity_min f_numeric_glm R-squared 0.4630   10
#> 11 vi_numeric       soil_temperature_max f_numeric_glm R-squared 0.4455   11
#> 12 vi_numeric     soil_temperature_range f_numeric_glm R-squared 0.4423   12
#> 13 vi_numeric              aridity_index f_numeric_glm R-squared 0.4421   13
#> 14 vi_numeric                    swi_max f_numeric_glm R-squared 0.4351   14
#> 15 vi_numeric     evapotranspiration_max f_numeric_glm R-squared 0.3940   15
#> 16 vi_numeric              solar_rad_max f_numeric_glm R-squared 0.3717   16
#> 17 vi_numeric               rainfall_max f_numeric_glm R-squared 0.3394   17
#> 18 vi_numeric            cloud_cover_min f_numeric_glm R-squared 0.3193   18
#> 19 vi_numeric   evapotranspiration_range f_numeric_glm R-squared 0.2339   19
#> 20 vi_numeric               rainfall_min f_numeric_glm R-squared 0.2314   20
#> 21 vi_numeric          temperature_range f_numeric_glm R-squared 0.2272   21
#> 22 vi_numeric                  swi_range f_numeric_glm R-squared 0.2270   22
#> 23 vi_numeric             rainfall_range f_numeric_glm R-squared 0.2087   23
#> 24 vi_numeric             solar_rad_mean f_numeric_glm R-squared 0.1981   24
#> 25 vi_numeric                    swi_min f_numeric_glm R-squared 0.1596   25
#> 26 vi_numeric    temperature_seasonality f_numeric_glm R-squared 0.1350   26
#> 27 vi_numeric                   soil_soc f_numeric_glm R-squared 0.1349   27
#> 28 vi_numeric              soil_nitrogen f_numeric_glm R-squared 0.1287   28
#> 29 vi_numeric    evapotranspiration_mean f_numeric_glm R-squared 0.1123   29
#> 30 vi_numeric             topo_diversity f_numeric_glm R-squared 0.0992   30
#> 31 vi_numeric            temperature_max f_numeric_glm R-squared 0.0865   31
#> 32 vi_numeric          cloud_cover_range f_numeric_glm R-squared 0.0818   32
#> 33 vi_numeric             humidity_range f_numeric_glm R-squared 0.0731   33
#> 34 vi_numeric            temperature_min f_numeric_glm R-squared 0.0686   34
#> 35 vi_numeric            solar_rad_range f_numeric_glm R-squared 0.0674   35
#> 36 vi_numeric       soil_temperature_min f_numeric_glm R-squared 0.0647   36
#> 37 vi_numeric                  soil_clay f_numeric_glm R-squared 0.0499   37
#> 38 vi_numeric             topo_elevation f_numeric_glm R-squared 0.0412   38
#> 39 vi_numeric                  soil_sand f_numeric_glm R-squared 0.0377   39
#> 40 vi_numeric                 topo_slope f_numeric_glm R-squared 0.0145   40
#> 41 vi_numeric                  soil_silt f_numeric_glm R-squared 0.0106   41
#> 42 vi_numeric      soil_temperature_mean f_numeric_glm R-squared 0.0099   42
#> 43 vi_numeric              solar_rad_min f_numeric_glm R-squared 0.0067   43
#> 44 vi_numeric           temperature_mean f_numeric_glm R-squared 0.0052   44
#> 45 vi_numeric        growing_degree_days f_numeric_glm R-squared 0.0031   45
#> 46 vi_numeric     evapotranspiration_min f_numeric_glm R-squared 0.0016   46
#> 47 vi_numeric growing_season_temperature f_numeric_glm R-squared 0.0007   47

#model formulas
x$vi_numeric$formulas
#> $linear
#> vi_numeric ~ growing_season_length + soil_temperature_max + evapotranspiration_range + 
#>     rainfall_min + swi_range + rainfall_range + topo_diversity + 
#>     humidity_range + soil_clay + topo_elevation + soil_silt
#> <environment: 0x559e81c85120>
#> 
#> $smooth
#> vi_numeric ~ s(growing_season_length) + s(soil_temperature_max) + 
#>     s(evapotranspiration_range) + s(rainfall_min) + s(swi_range) + 
#>     s(rainfall_range) + s(topo_diversity) + s(humidity_range) + 
#>     s(soil_clay) + s(topo_elevation) + s(soil_silt)
#> <environment: 0x559e818c3550>
#> 

#using a formula
m <- lm(
  formula = x$vi_numeric$formulas$linear,
  data = x$vi_numeric$df
)

summary(m)
#> 
#> Call:
#> lm(formula = x$vi_numeric$formulas$linear, data = x$vi_numeric$df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.34463 -0.03830  0.00033  0.04792  0.25815 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               4.839e-01  3.766e-02  12.848  < 2e-16 ***
#> growing_season_length     6.126e-04  5.294e-05  11.573  < 2e-16 ***
#> soil_temperature_max     -8.416e-03  6.299e-04 -13.360  < 2e-16 ***
#> evapotranspiration_range -2.432e-04  1.036e-04  -2.348 0.019234 *  
#> rainfall_min              3.302e-04  9.841e-05   3.355 0.000847 ***
#> swi_range                 3.160e-03  3.464e-04   9.123  < 2e-16 ***
#> rainfall_range            6.266e-05  3.289e-05   1.905 0.057275 .  
#> topo_diversity            1.950e-03  7.989e-04   2.441 0.014944 *  
#> humidity_range           -1.851e-03  5.993e-04  -3.088 0.002111 ** 
#> soil_clay                 3.137e-04  4.541e-04   0.691 0.490012    
#> topo_elevation           -3.146e-05  5.653e-06  -5.564 4.07e-08 ***
#> soil_silt                -1.186e-03  3.212e-04  -3.694 0.000243 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.07553 on 568 degrees of freedom
#> Multiple R-squared:  0.8274,	Adjusted R-squared:  0.824 
#> F-statistic: 247.5 on 11 and 568 DF,  p-value: < 2.2e-16
#> 

#disable parallelization
# future::plan(
#   strategy = future::sequential
# )