Skip to contents

This function calculates the preference order of predictors based on a user-provided function that takes a predictor, a response, and a data frame as arguments.

Usage

preference_order(
  df = NULL,
  response = NULL,
  predictors = NULL,
  f = f_rsquared,
  encoding_method = "mean",
  workers = 1
)

Arguments

df

(required; data frame) A data frame with numeric and/or character predictors predictors, and optionally, a response variable. Default: NULL.

response

(required, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.

predictors

(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'

f

(optional: function) A function that returns a value representing the relationship between a given predictor and the response. Higher values are ranked higher. The available options are:

  • f_rsquared() (default option): returns the R-squared of the correlation between a numeric response and a numeric predictor.

  • f_gam_deviance: fits a univariate GAM model between a numeric response and a numeric predictor to return the explained deviance. Requires the package mgcv.

  • f_rf_rsquared() also named f_rf_deviance(): fits a univariate random forest model with ranger::ranger() between a numeric response and a numeric predictor to return the R-squared of the observations against the out-of-bag predictions. Requires the package ranger.

  • f_logistic_auc_balanced(): fits a logistic univariate GLM of a balanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

  • f_logistic_auc_unbalanced(): fits a quasibinomial univariate GLM with weighted cases of an unbalanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

  • f_gam_auc_balanced(): fits a logistic univariate GAM of a balanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

  • f_gam_auc_unbalanced(): fits a quasibinomial univariate GAM with weighted cases of an unbalanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

  • f_rf_auc_balanced(): fits a random forest model of a balanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

  • f_rf_auc_unbalanced(): fits a random forest model with weighted cases of an unbalanced binary response (0s and 1s) against a numeric predictor to return the Area Under the Curve of the observations against the predictors.

encoding_method

(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see target_encoding_lab() for further details). Default: "mean"

workers

(integer) number of workers for parallel execution. Default: 1

Value

A data frame with the columns "predictor" and "value". The former contains the predictors names in order, ready for the argument preference_order in cor_select(), vif_select() and collinear(). The latter contains the result of the function f for each combination of predictor and response.

Author

Blas M. Benito

Examples


data(
  vi,
  vi_predictors
)

#subset to limit example run time
vi <- vi[1:1000, ]

#computing preference order
#with response
#numeric and categorical predictors in the output
#as the R-squared between each predictor and the response
preference.order <- preference_order(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  f = f_rsquared,
  workers = 1
  )

preference.order
#>                     predictor   preference
#> 1            biogeo_ecoregion 9.269691e-01
#> 2                 koppen_zone 8.137065e-01
#> 3       growing_season_length 8.112105e-01
#> 4          koppen_description 7.984011e-01
#> 5                     soil_ph 7.531807e-01
#> 6                    swi_mean 7.433342e-01
#> 7               humidity_mean 7.367305e-01
#> 8                koppen_group 7.150807e-01
#> 9                country_name 6.852857e-01
#> 10               biogeo_biome 6.770510e-01
#> 11           cloud_cover_mean 6.479913e-01
#> 12                  soil_type 6.465576e-01
#> 13               humidity_max 6.237387e-01
#> 14              rainfall_mean 6.157360e-01
#> 15       soil_temperature_max 6.113000e-01
#> 16            cloud_cover_max 6.007134e-01
#> 17                    swi_max 5.933516e-01
#> 18               humidity_min 5.906650e-01
#> 19    growing_season_rainfall 5.891538e-01
#> 20     soil_temperature_range 5.411904e-01
#> 21               biogeo_realm 5.240342e-01
#> 22     evapotranspiration_max 5.165803e-01
#> 23               rainfall_max 5.088829e-01
#> 24              aridity_index 5.083582e-01
#> 25              solar_rad_max 5.068879e-01
#> 26                  subregion 4.548328e-01
#> 27            cloud_cover_min 4.188023e-01
#> 28                  swi_range 4.184972e-01
#> 29   evapotranspiration_range 4.034224e-01
#> 30             rainfall_range 3.843933e-01
#> 31          temperature_range 3.796879e-01
#> 32                    swi_min 2.606626e-01
#> 33               rainfall_min 2.579713e-01
#> 34    temperature_seasonality 2.491227e-01
#> 35             solar_rad_mean 2.432788e-01
#> 36              soil_nitrogen 2.315537e-01
#> 37            temperature_max 2.102187e-01
#> 38                  continent 1.799175e-01
#> 39    evapotranspiration_mean 1.693217e-01
#> 40                   soil_soc 1.601721e-01
#> 41                     region 1.488327e-01
#> 42          cloud_cover_range 1.434921e-01
#> 43            solar_rad_range 1.232068e-01
#> 44             topo_diversity 1.168882e-01
#> 45            temperature_min 1.106311e-01
#> 46       soil_temperature_min 8.964740e-02
#> 47                  soil_clay 7.751427e-02
#> 48             humidity_range 6.741354e-02
#> 49             country_income 5.641097e-02
#> 50                 topo_slope 4.638905e-02
#> 51                  soil_sand 4.324903e-02
#> 52             topo_elevation 2.936205e-02
#> 53      soil_temperature_mean 2.388867e-02
#> 54 growing_season_temperature 1.279096e-02
#> 55                  soil_silt 7.505508e-03
#> 56         country_population 4.045504e-03
#> 57           temperature_mean 3.061162e-03
#> 58                country_gdp 2.931651e-03
#> 59              solar_rad_min 1.711591e-03
#> 60        growing_degree_days 1.480086e-03
#> 61     evapotranspiration_min 2.170934e-05

#using it in variable selection with collinear()
selected.predictors <- cor_select(
  df = vi,
  response = "vi_mean", #don't forget the response!
  predictors = vi_predictors,
  preference_order = preference.order,
  max_cor = 0.75
  )

selected.predictors
#>  [1] "biogeo_ecoregion"       "soil_temperature_range" "evapotranspiration_max"
#>  [4] "rainfall_max"           "solar_rad_max"          "subregion"             
#>  [7] "cloud_cover_min"        "swi_range"              "swi_min"               
#> [10] "rainfall_min"           "solar_rad_mean"         "soil_nitrogen"         
#> [13] "continent"              "soil_soc"               "cloud_cover_range"     
#> [16] "solar_rad_range"        "topo_diversity"         "soil_clay"             
#> [19] "humidity_range"         "country_income"         "topo_slope"            
#> [22] "soil_sand"              "topo_elevation"         "country_population"    
#> [25] "temperature_mean"       "country_gdp"           

#check their correlations
selected.predictors.cor <- cor_df(
  df = vi,
  response = "vi_mean",
  predictors = selected.predictors
)

#all correlations below max_cor
selected.predictors.cor
#> # A tibble: 325 × 3
#>    x                      y                      correlation
#>    <chr>                  <chr>                        <dbl>
#>  1 temperature_mean       solar_rad_range             -0.75 
#>  2 solar_rad_range        soil_temperature_range       0.742
#>  3 soil_temperature_range biogeo_ecoregion            -0.739
#>  4 solar_rad_mean         swi_min                     -0.737
#>  5 soil_nitrogen          solar_rad_mean              -0.736
#>  6 cloud_cover_min        solar_rad_max               -0.736
#>  7 cloud_cover_min        evapotranspiration_max      -0.726
#>  8 soil_soc               soil_nitrogen                0.722
#>  9 rainfall_max           biogeo_ecoregion             0.718
#> 10 solar_rad_max          biogeo_ecoregion            -0.717
#> # ℹ 315 more rows

#USING A CUSTOM FUNCTION
#custom function to compute RMSE between a predictor and a response
#x is a predictor name
#y is a response name
#df is a data frame with multiple predictors and one response
#must return a single number, with higher number indicating higher preference
#notice we use "one minus RMSE" to give higher rank to variables with lower RMSE
f_rmse <- function(x, y, df){

  xy <- df[, c(x, y)] |>
    na.omit() |>
    scale()

  1 - sqrt(mean((xy[, 1] - xy[, 2])^2))

}

preference.order <- preference_order(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors,
  f = f_rmse,
  workers = 1
)

preference.order
#>                     predictor   preference
#> 1            biogeo_ecoregion  0.727344686
#> 2                 koppen_zone  0.557630869
#> 3       growing_season_length  0.554515080
#> 4          koppen_description  0.538782893
#> 5                    swi_mean  0.475226020
#> 6               humidity_mean  0.467969373
#> 7                koppen_group  0.444624151
#> 8                country_name  0.413471311
#> 9                biogeo_biome  0.405034875
#> 10           cloud_cover_mean  0.375779034
#> 11                  soil_type  0.374354695
#> 12               humidity_max  0.351897438
#> 13              rainfall_mean  0.344109445
#> 14            cloud_cover_max  0.329600086
#> 15                    swi_max  0.322538389
#> 16               humidity_min  0.319968847
#> 17    growing_season_rainfall  0.318525089
#> 18               biogeo_realm  0.257272814
#> 19               rainfall_max  0.243226452
#> 20              aridity_index  0.242741004
#> 21                  subregion  0.193450001
#> 22            cloud_cover_min  0.160359850
#> 23                  swi_range  0.160079415
#> 24             rainfall_range  0.128650029
#> 25                    swi_min  0.011102318
#> 26               rainfall_min  0.008436478
#> 27              soil_nitrogen -0.018116834
#> 28                  continent -0.072620497
#> 29                   soil_soc -0.094701026
#> 30                     region -0.107788271
#> 31          cloud_cover_range -0.114069373
#> 32             topo_diversity -0.146693329
#> 33            temperature_min -0.154746778
#> 34       soil_temperature_min -0.183121000
#> 35                  soil_clay -0.200720251
#> 36             country_income -0.234283259
#> 37                 topo_slope -0.252065667
#> 38 growing_season_temperature -0.331176895
#> 39                  soil_silt -0.350890299
#> 40         country_population -0.367813225
#> 41           temperature_mean -0.373846834
#> 42                country_gdp -0.374704749
#> 43        growing_degree_days -0.386049487
#> 44     evapotranspiration_min -0.410209439
#> 45              solar_rad_min -0.442449313
#> 46      soil_temperature_mean -0.518818650
#> 47             topo_elevation -0.529824980
#> 48                  soil_sand -0.553548230
#> 49             humidity_range -0.586430925
#> 50            solar_rad_range -0.642959090
#> 51    evapotranspiration_mean -0.679330603
#> 52            temperature_max -0.707066272
#> 53             solar_rad_mean -0.727275193
#> 54    temperature_seasonality -0.730677787
#> 55          temperature_range -0.796981943
#> 56   evapotranspiration_range -0.807495658
#> 57              solar_rad_max -0.849458633
#> 58     evapotranspiration_max -0.853114399
#> 59     soil_temperature_range -0.862214159
#> 60       soil_temperature_max -0.886835925
#> 61                    soil_ph -0.931834336