Skip to contents

Target encoding involves replacing the values of categorical variables with numeric ones from a "target variable", usually a model's response. Target encoding can be useful for improving the performance of machine learning models.

This function identifies categorical variables in the input data frame, and transforms them using a set of target-encoding methods selected by the user, and returns the input data frame with the newly encoded variables.

The target encoding methods implemented in this function are:

  • "rank": Returns the rank of the group as a integer, starting with 1 as the rank of the group with the lower mean of the response variable. The variables returned by this method are named with the suffix "__encoded_rank". This method is implemented in the function target_encoding_rank().

  • "mean": Replaces each value of the categorical variable with the mean of the response across the category the given value belongs to. This option accepts the argument "white_noise" to limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_mean". This method is implemented in the function target_encoding_mean().

  • "rnorm": Computes the mean and standard deviation of the response for each group of the categorical variable, and uses rnorm() to generate random values from a normal distribution with these parameters. The argument rnorm_sd_multiplier is used as a multiplier of the standard deviation to control the range of values produced by rnorm() for each group of the categorical predictor. The variables returned by this method are named with the suffix "__encoded_rnorm". This method is implemented in the function target_encoding_rnorm().

  • "loo": This is the leave-one-out method, that replaces each categorical value with the mean of the response variable across the other cases within the same group. This method supports the white_noise argument to increase limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_loo". This method is implemented in the function target_encoding_loo().

The methods "mean" and "rank" support the white_noise argument, which is a fraction of the range of the response variable, and the maximum possible value of white noise to be added. For example, if response is within 0 and 1, a white_noise of 0.25 will add to every value of the encoded variable a random number selected from a normal distribution between -0.25 and 0.25. This argument helps control potential overfitting induced by the encoded variable.

The method "rnorm" has the argument rnorm_sd_multiplier, which multiplies the standard deviation argument of the \link[stats]{rnorm} function to control the spread of the encoded values between groups. Values smaller than 1 reduce the spread in the results, while values larger than 1 have the opposite effect.

Usage

target_encoding_lab(
  df = NULL,
  response = NULL,
  predictors = NULL,
  encoding_methods = c("mean", "rank", "loo", "rnorm"),
  smoothing = 0,
  rnorm_sd_multiplier = 0,
  seed = 1,
  white_noise = 0,
  replace = FALSE,
  verbose = TRUE
)

Arguments

df

(required; data frame, tibble, or sf) A training data frame. Default: NULL

response

(required; character string) Name of the response. Must be a column name of df. Default: NULL

predictors

(required; character vector) Names of all the predictors in df. Only character and factor predictors are processed, but all are returned in the "df" slot of the function's output. Default: NULL

encoding_methods

(optional; character string or vector). Name of the target encoding methods. Default: c("mean", "mean_smoothing, "rank", "loo", "rnorm")

smoothing

(optional; numeric) Argument of target_encoding_mean() (method "mean_smoothing"). Minimum group size that keeps the mean of the group. Groups smaller than this have their means pulled towards the global mean of the response. Default: 0

rnorm_sd_multiplier

(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 1

seed

(optional; integer) Random seed to facilitate reproducibility when white_noise is not 0. Default: 1

white_noise

(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 0.

replace

(optional; logical) If TRUE, the function replaces each categorical variable with its encoded version, and returns the input data frame with the encoded variables instead of the original ones. Default: FALSE

verbose

(optional; logical) If TRUE, messages generated during the execution of the function are printed to the console Default: TRUE

Value

The input data frame with newly encoded columns if replace is FALSE, or the input data frame with encoded columns if TRUE

References

  • Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538

Author

Blas M. Benito

Examples


data(
  vi,
  vi_predictors
  )

#subset to limit example run time
vi <- vi[1:1000, ]

#applying all methods for a continuous response
df <- target_encoding_lab(
  df = vi,
  response = "vi_mean",
  predictors = "koppen_zone",
  encoding_methods = c(
    "mean",
    "rank",
    "rnorm",
    "loo"
  ),
  rnorm_sd_multiplier = c(0, 0.1, 0.2),
  white_noise = c(0, 0.1, 0.2)
)
#> 
#> Encoding the predictor: koppen_zone
#> New encoded predictor: 'koppen_zone__encoded_rank'
#> New encoded predictor: 'koppen_zone__encoded_mean'
#> New encoded predictor: 'koppen_zone__encoded_loo'
#> New encoded predictor: 'koppen_zone__encoded_rank__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_mean__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_loo__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_rank__noise_0.2'
#> New encoded predictor: 'koppen_zone__encoded_mean__noise_0.2'
#> New encoded predictor: 'koppen_zone__encoded_loo__noise_0.2'
#> New encoded predictor: 'koppen_zone__encoded_rnorm'
#> New encoded predictor: 'koppen_zone__encoded_rnorm__sd_multiplier_0.1'
#> New encoded predictor: 'koppen_zone__encoded_rnorm__sd_multiplier_0.2'

#identify encoded predictors
predictors.encoded <- grep(
  pattern = "*__encoded*",
  x = colnames(df),
  value = TRUE
)

#correlation between encoded predictors and the response
stats::cor(
  x = df[["vi_mean"]],
  y = df[, predictors.encoded],
  use = "pairwise.complete.obs"
)
#>      koppen_zone__encoded_rank koppen_zone__encoded_mean
#> [1,]                 0.8859924                 0.9020568
#>      koppen_zone__encoded_loo koppen_zone__encoded_rank__noise_0.1
#> [1,]                0.8964646                            0.8860717
#>      koppen_zone__encoded_mean__noise_0.1 koppen_zone__encoded_loo__noise_0.1
#> [1,]                            0.8828402                           0.8774811
#>      koppen_zone__encoded_rank__noise_0.2 koppen_zone__encoded_mean__noise_0.2
#> [1,]                            0.8861208                            0.8310055
#>      koppen_zone__encoded_loo__noise_0.2 koppen_zone__encoded_rnorm
#> [1,]                           0.8260699                  0.9019647
#>      koppen_zone__encoded_rnorm__sd_multiplier_0.1
#> [1,]                                     0.9002496
#>      koppen_zone__encoded_rnorm__sd_multiplier_0.2
#> [1,]                                     0.8962355