Quantitative Variable Prioritization for Multicollinearity Filtering
Source:R/preference_order.R
preference_order.Rd
Ranks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.
The strength of association between the response and each predictor is computed by the function f
. The f
functions available are:
Numeric response vs numeric predictor:
f_r2_pearson()
: Pearson's R-squared.f_r2_spearman()
: Spearman's R-squared.f_r2_glm_gaussian()
: Pearson's R-squared of response versus the predictions of a Gaussian GLM.f_r2_glm_gaussian_poly2()
: Gaussian GLM with second degree polynomial.f_r2_gam_gaussian()
: GAM model fitted withmgcv::gam()
.f_r2_rpart()
: Recursive Partition Tree fitted withrpart::rpart()
.f_r2_rf()
: Random Forest model fitted withranger::ranger()
.
Integer counts response vs. numeric predictor:
f_r2_glm_poisson()
: Pearson's R-squared of a Poisson GLM.f_r2_glm_poisson_poly2()
: Poisson GLM with second degree polynomial.f_r2_gam_poisson()
: Poisson GAM.
Binomial response (1 and 0) vs. numeric predictor:
f_auc_glm_binomial()
: AUC of quasibinomial GLM with weighted cases.f_auc_glm_binomial_poly2()
: As above with second degree polynomial.f_auc_gam_binomial()
: Quasibinomial GAM with weighted cases.f_auc_rpart()
: Recursive Partition Tree with weighted cases.f_auc_rf()
: Random Forest model with weighted cases.
Categorical response (character of factor) vs. categorical predictor:
f_v()
: Cramer's V between two categorical variables.
Categorical response vs. categorical or numerical predictor:
f_v_rf_categorical()
: Cramer's V of a Random Forest model.
The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name
Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.
This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f
. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order
in collinear()
, cor_select()
, and vif_select()
.
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
Accepts a character vector of response variables as input for the argument response
. When more than one response is provided, the output is a named list of preference data frames.
Usage
preference_order(
df = NULL,
response = NULL,
predictors = NULL,
f = "auto",
warn_limit = NULL,
quiet = FALSE
)
Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- response
(optional; character string or vector) Name/s of response variable/s in
df
. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.- predictors
(optional; character vector) Names of the predictors to select from
df
. If omitted, all numeric columns indf
are used instead. If argumentresponse
is not provided, non-numeric variables are ignored. Default: NULL- f
(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of
f_auto()
for the given data is used:f_auc_rf()
: ifresponse
is binomial.f_r2_pearson()
: ifresponse
andpredictors
are numeric.f_v()
: ifresponse
andpredictors
are categorical.f_v_rf_categorical()
: ifresponse
is categorical andpredictors
are numeric or mixed .f_r2_rf()
: in all other cases.
Default: NULL
- warn_limit
(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL
- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
Examples
#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
# progressr::handlers(global = TRUE)
#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors_numeric,
f = NULL
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_numeric'.
#>
#> collinear::f_auto(): selected function: 'f_r2_pearson()'.
#returns data frame ordered by preference
df_preference
#> response predictor f preference
#> 1 vi_numeric swi_mean f_r2_pearson 0.74333418
#> 2 vi_numeric soil_temperature_max f_r2_pearson 0.61129999
#> 3 vi_numeric swi_max f_r2_pearson 0.59335158
#> 4 vi_numeric swi_range f_r2_pearson 0.41849723
#> 5 vi_numeric swi_min f_r2_pearson 0.26066257
#> 6 vi_numeric topo_diversity f_r2_pearson 0.11688815
#> 7 vi_numeric soil_temperature_min f_r2_pearson 0.08964740
#> 8 vi_numeric topo_slope f_r2_pearson 0.04638905
#> 9 vi_numeric topo_elevation f_r2_pearson 0.02936205
#> 10 vi_numeric soil_temperature_mean f_r2_pearson 0.02388867
#several responses
#------------------------------------------------
responses <- c(
"vi_categorical",
"vi_counts"
)
preference_list <- preference_order(
df = df,
response = responses,
predictors = predictors
)
#>
#> collinear::preference_order(): ranking predictors for response 'vi_counts'.
#>
#> collinear::f_auto(): selected function: 'f_r2_rf()'.
#>
#> collinear::preference_order(): ranking predictors for response 'vi_categorical'.
#>
#> collinear::f_auto(): selected function: 'f_v_rf_categorical()'.
#returns a named list
names(preference_list)
#> [1] "vi_counts" "vi_categorical"
preference_list[[1]]
#> response predictor f preference
#> 1 vi_counts swi_mean f_r2_rf 0.84397411
#> 2 vi_counts koppen_zone f_r2_rf 0.81258327
#> 3 vi_counts koppen_description f_r2_rf 0.79691235
#> 4 vi_counts swi_max f_r2_rf 0.72791918
#> 5 vi_counts koppen_group f_r2_rf 0.71508036
#> 6 vi_counts swi_min f_r2_rf 0.64426396
#> 7 vi_counts soil_type f_r2_rf 0.63949456
#> 8 vi_counts topo_elevation f_r2_rf 0.27154187
#> 9 vi_counts topo_diversity f_r2_rf 0.13091316
#> 10 vi_counts topo_slope f_r2_rf 0.07353939
preference_list[[2]]
#> response predictor f preference
#> 1 vi_categorical swi_mean f_v_rf_categorical 0.5771390
#> 2 vi_categorical koppen_zone f_v_rf_categorical 0.5568756
#> 3 vi_categorical koppen_description f_v_rf_categorical 0.5499681
#> 4 vi_categorical koppen_group f_v_rf_categorical 0.5385112
#> 5 vi_categorical swi_max f_v_rf_categorical 0.5257116
#> 6 vi_categorical soil_type f_v_rf_categorical 0.4521790
#> 7 vi_categorical swi_min f_v_rf_categorical 0.4515519
#> 8 vi_categorical topo_elevation f_v_rf_categorical 0.2705244
#> 9 vi_categorical topo_diversity f_v_rf_categorical 0.1734628
#> 10 vi_categorical topo_slope f_v_rf_categorical 0.1578257
#can be used in collinear()
# x <- collinear(
# df = df,
# response = responses,
# predictors = predictors,
# preference_order = preference_list
# )
#f function selected by user
#for binomial response and numeric predictors
# preference_order(
# df = vi,
# response = "vi_binomial",
# predictors = predictors_numeric,
# f = f_auc_glm_binomial
# )
#disable parallelization
future::plan(future::sequential)