
Quantitative Variable Prioritization for Multicollinearity Filtering
Source:R/preference_order.R
preference_order.RdRanks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.
The strength of association between the response and each predictor is computed by the function f. The f functions available are:
Numeric response vs numeric predictor:
f_r2_pearson(): Pearson's R-squared.f_r2_spearman(): Spearman's R-squared.f_r2_glm_gaussian(): Pearson's R-squared of response versus the predictions of a Gaussian GLM.f_r2_glm_gaussian_poly2(): Gaussian GLM with second degree polynomial.f_r2_gam_gaussian(): GAM model fitted withmgcv::gam().f_r2_rpart(): Recursive Partition Tree fitted withrpart::rpart().f_r2_rf(): Random Forest model fitted withranger::ranger().
Integer counts response vs. numeric predictor:
f_r2_glm_poisson(): Pearson's R-squared of a Poisson GLM.f_r2_glm_poisson_poly2(): Poisson GLM with second degree polynomial.f_r2_gam_poisson(): Poisson GAM.
Binomial response (1 and 0) vs. numeric predictor:
f_auc_glm_binomial(): AUC of quasibinomial GLM with weighted cases.f_auc_glm_binomial_poly2(): As above with second degree polynomial.f_auc_gam_binomial(): Quasibinomial GAM with weighted cases.f_auc_rpart(): Recursive Partition Tree with weighted cases.f_auc_rf(): Random Forest model with weighted cases.
Categorical response (character of factor) vs. categorical predictor:
f_v(): Cramer's V between two categorical variables.
Categorical response vs. categorical or numerical predictor:
f_v_rf_categorical(): Cramer's V of a Random Forest model.
The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name
Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.
This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order in collinear(), cor_select(), and vif_select().
Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).
Accepts a character vector of response variables as input for the argument response. When more than one response is provided, the output is a named list of preference data frames.
Usage
preference_order(
df = NULL,
response = NULL,
predictors = NULL,
f = "auto",
warn_limit = NULL,
quiet = FALSE
)Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- response
(optional; character string or vector) Name/s of response variable/s in
df. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.- predictors
(optional; character vector) Names of the predictors to select from
df. If omitted, all numeric columns indfare used instead. If argumentresponseis not provided, non-numeric variables are ignored. Default: NULL- f
(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of
f_auto()for the given data is used:f_auc_rf(): ifresponseis binomial.f_r2_pearson(): ifresponseandpredictorsare numeric.f_v(): ifresponseandpredictorsare categorical.f_v_rf_categorical(): ifresponseis categorical andpredictorsare numeric or mixed .f_r2_rf(): in all other cases.
Default: NULL
- warn_limit
(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL
- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
Examples
#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
# progressr::handlers(global = TRUE)
#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors_numeric,
f = NULL
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#returns data frame ordered by preference
df_preference
#> response predictor f preference
#> 1 vi_numeric swi_mean f_r2_pearson 0.74333418
#> 2 vi_numeric soil_temperature_max f_r2_pearson 0.61129999
#> 3 vi_numeric swi_max f_r2_pearson 0.59335158
#> 4 vi_numeric swi_range f_r2_pearson 0.41849723
#> 5 vi_numeric swi_min f_r2_pearson 0.26066257
#> 6 vi_numeric topo_diversity f_r2_pearson 0.11688815
#> 7 vi_numeric soil_temperature_min f_r2_pearson 0.08964740
#> 8 vi_numeric topo_slope f_r2_pearson 0.04638905
#> 9 vi_numeric topo_elevation f_r2_pearson 0.02936205
#> 10 vi_numeric soil_temperature_mean f_r2_pearson 0.02388867
#several responses
#------------------------------------------------
responses <- c(
"vi_categorical",
"vi_counts"
)
preference_list <- preference_order(
df = df,
response = responses,
predictors = predictors
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_counts'.
#>
#> collinear::f_auto(): selected function: 'f_r2_rf()'.
#>
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#>
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.
#returns a named list
names(preference_list)
#> [1] "vi_counts" "vi_categorical"
preference_list[[1]]
#> response predictor f preference
#> 1 vi_counts swi_mean f_r2_rf 0.84397411
#> 2 vi_counts koppen_zone f_r2_rf 0.81258327
#> 3 vi_counts koppen_description f_r2_rf 0.79691235
#> 4 vi_counts swi_max f_r2_rf 0.72791918
#> 5 vi_counts koppen_group f_r2_rf 0.71508036
#> 6 vi_counts swi_min f_r2_rf 0.64426396
#> 7 vi_counts soil_type f_r2_rf 0.63949456
#> 8 vi_counts topo_elevation f_r2_rf 0.27154187
#> 9 vi_counts topo_diversity f_r2_rf 0.13091316
#> 10 vi_counts topo_slope f_r2_rf 0.07353939
preference_list[[2]]
#> response predictor f preference
#> 1 vi_categorical swi_mean f_v_rf_categorical 0.5771390
#> 2 vi_categorical koppen_zone f_v_rf_categorical 0.5568756
#> 3 vi_categorical koppen_description f_v_rf_categorical 0.5499681
#> 4 vi_categorical koppen_group f_v_rf_categorical 0.5385112
#> 5 vi_categorical swi_max f_v_rf_categorical 0.5257116
#> 6 vi_categorical soil_type f_v_rf_categorical 0.4521790
#> 7 vi_categorical swi_min f_v_rf_categorical 0.4515519
#> 8 vi_categorical topo_elevation f_v_rf_categorical 0.2705244
#> 9 vi_categorical topo_diversity f_v_rf_categorical 0.1734628
#> 10 vi_categorical topo_slope f_v_rf_categorical 0.1578257
#can be used in collinear()
# x <- collinear(
# df = df,
# response = responses,
# predictors = predictors,
# preference_order = preference_list
# )
#f function selected by user
#for binomial response and numeric predictors
# preference_order(
# df = vi,
# response = "vi_binomial",
# predictors = predictors_numeric,
# f = f_auc_glm_binomial
# )
#disable parallelization
future::plan(future::sequential)