collinear 3.0.0
Breaking Changes
API Changes
Argument
responserenamed toresponses: Now accepts multiple response variables. Functions affected:collinear(),preference_order(), and related validation functions.Argument
encoding_methoddefaults toNULLincollinear(): Target encoding is now opt-in rather than automatic. Previously defaulted to"mean".Default values changed for
max_corandmax_vif: Both now default toNULL, triggering adaptive threshold computation based on the correlation structure of the data.Output structure changed for
collinear(): Now returns a list of classcollinear_outputcontaining sub-lists of classcollinear_selection, each withresponse,df,preference_order,selection, andformulasslots. Previously returned a character vector or named list of character vectors.
Renamed Functions
| Old Name (v2.0) | New Name (v3.0) |
|---|---|
identify_predictors() |
Split into identify_valid_variables(), identify_numeric_variables(), identify_categorical_variables(), identify_logical_variables()
|
identify_predictors_categorical() |
identify_categorical_variables() |
identify_predictors_numeric() |
identify_numeric_variables() |
identify_predictors_zero_variance() |
identify_zero_variance_variables() |
identify_predictors_type() |
Removed (merged into identify_valid_variables()) |
Renamed f_ Functions for Preference Order
| Old Name (v2.0) | New Name (v3.0) |
|---|---|
f_r2_glm_gaussian() |
f_numeric_glm() |
f_r2_gam_gaussian() |
f_numeric_gam() |
f_r2_rf() |
f_numeric_rf() |
f_r2_glm_poisson() |
f_count_glm() |
f_r2_gam_poisson() |
f_count_gam() |
f_auc_glm_binomial() |
f_binomial_glm() |
f_auc_gam_binomial() |
f_binomial_gam() |
f_auc_rf_binomial() |
f_binomial_rf() |
f_v_rf() |
f_categorical_rf() |
| — |
f_count_rf() (new) |
Major New Features
Adaptive Multicollinearity Thresholds
When both max_cor = NULL and max_vif = NULL, the function now automatically determines optimal filtering thresholds using:
- The 75th percentile of pairwise correlations as input
- A sigmoid transformation that smoothly transitions between. conservative (VIF ≈ 2.5) and permissive (VIF ≈ 7.5) thresholds.
- A GAM model (
gam_cor_to_vif) mapping correlation thresholds to equivalent VIF values.
This data-driven approach adapts to each dataset’s correlation structure, preventing over-filtering while maintaining statistically meaningful bounds.
Tidymodels Integration
- New
step_collinear(): Recipe step for multicollinearity filtering in tidymodels workflows. - Implements proper
prep()andbake()methods following recipes architecture.
Cross-Validation Support in Preference Order
- New arguments
cv_training_fractionandcv_iterationsinpreference_order()and passed throughcollinear(). - Enables robust predictor ranking through repeated train/test splits.
Rich Output Structure
collinear() now returns comprehensive results including:
- Filtered dataframe with response and selected predictors.
- Preference order dataframe with rankings.
- Ready-to-use model formulas (linear, smooth/GAM, classification).
S3 methods print() and summary() for collinear_output and collinear_selection classes provide clean output formatting.
Correlation Matrix Improvements
-
cor_matrix()now returns signed correlations, preserving the positive semi-definite property required for VIF calculations. - Absolute values applied only when comparing against
max_corthresholds. - Fixes numerical instability that could produce negative VIF scores.
New Functions
Multicollinearity Assessment
-
collinear_stats(): Compute summary statistics for both correlation and VIF. -
cor_stats(): Summary statistics for pairwise correlations. -
vif_stats(): Summary statistics for variance inflation factors.
Preference Order
-
f_count_rf(): Score integer count predictors with random forest.
New Datasets and Models
| Name | Description |
|---|---|
experiment_adaptive_thresholds |
Validation experiment results (10,000 iterations) |
experiment_cor_vs_vif |
Correlation vs VIF equivalence experiment results |
gam_cor_to_vif |
Fitted GAM for mapping max_cor to max_vif
|
prediction_cor_to_vif |
Look-up table for threshold equivalence |
toy |
Simple dataset illustrating multicollinearity concepts |
vi_smol |
Smaller version of vi dataset (610 rows) for faster examples |
vi_responses |
Character vector of response variable names |
Improvements
VIF Computation
- Ridge regularization fallback for near-singular matrices.
- Improved tolerance calculation for
solve()to prevent false singularity detection. - VIF values exceeding 1M are now capped to
Inf.
Bug Fixes
- Fixed correlation matrix handling that destroyed positive semi-definite property when applying
abs()before VIF computation. - Fixed edge cases in VIF computation for ill-conditioned matrices.
- Proper handling of single-predictor cases across all functions.
collinear 2.0.0
CRAN release: 2024-11-08
Main Improvements
Expanded Functionality: Functions
collinear()andpreference_order()support both categorical and numeric responses and predictors, and can handle several responses at once.Robust Selection Algorithms: Enhanced selection in
vif_select()andcor_select().Enhanced Functionality to Rank Predictors: New functions to compute association between response and predictors covering most use-cases, and automated function selection depending on data features.
Simplified Target Encoding: Streamlined and parallelized for better efficiency, and new default is
"loo"(leave-one-out).Parallelization and Progress Bars: Utilizes
futureandprogressrfor enhanced performance and user experience.
collinear 1.1.1
CRAN release: 2023-12-08
- Initial CRAN release
- Basic multicollinearity filtering with
collinear(),cor_select(), andvif_select() - Target encoding methods: mean, rank, leave-one-out
- Preference order functionality
- Support for mixed numeric and categorical predictors
