Skip to contents

Applies a recursive forward selection algorithm algorithm to select predictors with a bivariate correlation with any other predictor lower than a threshold defined by the argument max_cor.

If the argument response is provided, all non-numeric variables in predictors are transformed into numeric using target encoding (see target_encoding_lab()). Otherwise, non-numeric variables are ignored.

The argument preference_order allows defining a preference selection order to preserve (when possible) variables that might be interesting or even required for a given analysis. If NULL, predictors are ordered from lower to higher sum of their absolute pairwise correlation with the other predictors.

For example, if predictors is c("a", "b", "c") and preference_order is c("a", "b"), there are two possibilities:

  • If the correlation between "a" and "b" is below max_cor, both variables are selected.

  • If their correlation is equal or above max_cor, then "a" is selected, no matter its correlation with "c",

If preference_order is not provided, then the predictors are ranked by their variance inflation factor as computed by vif_df().

Usage

cor_select(
  df = NULL,
  response = NULL,
  predictors = NULL,
  preference_order = NULL,
  cor_method = "pearson",
  max_cor = 0.75,
  encoding_method = "mean"
)

Arguments

df

(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL.

response

(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.

predictors

(optional; character vector) Character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'

preference_order

(optional; character vector) vector with column names in 'predictors' in the desired preference order, or result of the function preference_order(). Allows defining a priority order for selecting predictors, which can be particularly useful when some predictors are more critical for the analysis than others. Default: NULL (predictors ordered from lower to higher sum of absolute correlation with the other predictors).

cor_method

(optional; character string) Method used to compute pairwise correlations. Accepted methods are "pearson" (with a recommended minimum of 30 rows in 'df') or "spearman" (with a recommended minimum of 10 rows in 'df'). Default: "pearson".

max_cor

(optional; numeric) Maximum correlation allowed between any pair of predictors. Higher values return larger number of predictors with higher multicollinearity. Default: 0.75

encoding_method

(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see target_encoding_lab() for further details). Default: "mean"

Value

Character vector with the names of the selected predictors.

Author

Blas M. Benito

Examples


data(
  vi,
  vi_predictors
)

#subset to limit example run time
vi <- vi[1:1000, ]
vi_predictors <- vi_predictors[1:10]

#without response
#without preference_order
#permissive max_cor
selected.predictors <- cor_select(
  df = vi,
  predictors = vi_predictors,
  max_cor = 0.8
)

selected.predictors
#> [1] "topo_elevation" "topo_slope"     "topo_diversity" "soil_type"     
#> [5] "swi_min"        "swi_max"       

#without response
#without preference_order
#restrictive max_cor
selected.predictors <- cor_select(
  df = vi,
  predictors = vi_predictors,
  max_cor = 0.5
)

selected.predictors
#> [1] "topo_elevation" "topo_slope"     "soil_type"     

#with response
#without preference_order
#restrictive max_cor
#slightly different solution than previous one
#because here target encoding is done against the response
#while before was done pairwise against each numeric predictor
selected.predictors <- cor_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  max_cor = 0.5
)

selected.predictors
#> [1] "topo_elevation" "topo_slope"     "swi_min"        "soil_type"     

#with response
#with user-defined preference_order
#restrictive max_cor
#numerics and categorical variables in output
selected.predictors <- cor_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  preference_order = c(
    "soil_type", #categorical variable
    "soil_temperature_mean",
    "swi_mean",
    "rainfall_mean",
    "evapotranspiration_mean"
  ),
  max_cor = 0.5
)

selected.predictors
#> [1] "soil_type"      "topo_elevation" "topo_slope"     "swi_min"       


#with response
#with automated preference_order
#restrictive max_cor and max_vif
#numerics and categorical variables in output
preference.order <- preference_order(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  f = f_rsquared #cor(response, predictor)
)

head(preference.order)
#>            predictor preference
#> 1        koppen_zone  0.8137065
#> 2 koppen_description  0.7984011
#> 3           swi_mean  0.7433342
#> 4       koppen_group  0.7150807
#> 5          soil_type  0.6465576
#> 6            swi_max  0.5933516

selected.predictors <- cor_select(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  preference_order = preference.order,
  max_cor = 0.5
)

selected.predictors
#> [1] "koppen_zone"    "topo_diversity" "topo_elevation"