Skip to contents

Methods to apply target-encoding to individual categorical variables. The functions implemented are:

  • target_encoding_mean(): Each group is identified by the mean of the response over the group cases. The argument smoothing controls pushes the mean of small groups towards the global mean to avoid overfitting. White noise can be added via the white_noise argument. Columns encoded with this function are identified by the suffix "__encoded_mean". If white_noise is used, then the amount of white noise is also added to the suffix.

  • target_encoding_rank(): Each group is identified by the rank of the mean of the response variable over the group cases. The group with the lower mean receives the rank 1. White noise can be added via the white_noise argument. Columns encoded with this function are identified by the suffix "__encoded_rank". If white_noise is used, then the amount of noise is also added to the suffix.

  • target_encoding_rnorm(): Each case in a group receives a value coming from a normal distribution with the mean and the standard deviation of the response over the cases of the group. The argument rnorm_sd_multiplier multiplies the standard deviation to reduce the spread of the obtained values. Columns encoded with this function are identified by the suffix "__encoded_rnorm_rnorm_sd_multiplier_X", where X is the amount of rnorm_sd_multiplier used.

  • target_encoding_loo(): The suffix "loo" stands for "leave-one-out". Each case in a group is encoded as the average of the response over the other cases of the group. Columns encoded with this function are identified by the suffix "__encoded_loo".

Usage

target_encoding_mean(
  df,
  response,
  predictor,
  smoothing = 0,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_rnorm(
  df,
  response,
  predictor,
  rnorm_sd_multiplier = 1,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_rank(
  df,
  response,
  predictor,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_loo(
  df,
  response,
  predictor,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

add_white_noise(df, response, predictor, white_noise = 0.1, seed = 1)

Arguments

df

(required; data frame, tibble, or sf) A training data frame. Default: NULL

response

(required; character string) Name of the response. Must be a column name of df. Default: NULL

predictor

(required; character) Name of the categorical variable to encode. Default: NULL

smoothing

(optional; numeric) Argument of target_encoding_mean(). Minimum group size that keeps the mean of the group. Groups smaller than this have their means pulled towards the global mean of the response. Default: 0.

white_noise

(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 0.

seed

(optional; integer) Random seed to facilitate reproducibility. Default: 1

replace

(optional; logical) Advanced option that changes the behavior of the function. Use only if you really know exactly what you need. If TRUE, it replaces each categorical variable with its encoded version, and returns the input data frame with the replaced variables.

verbose

(optional; logical) If TRUE, messages and plots generated during the execution of the function are displayed. Default: TRUE

rnorm_sd_multiplier

(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 1

Value

The input data frame with a target-encoded variable.

References

  • Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538

Author

Blas M. Benito

Examples


data(vi)

#subset to limit example run time
vi <- vi[1:1000, ]

#mean encoding
#-------------

#without noise
df <- target_encoding_mean(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#with noise
df <- target_encoding_mean(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  white_noise = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)



#group rank
#----------

df <- target_encoding_rank(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)



#leave-one-out
#-------------

#without noise
df <- target_encoding_loo(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#with noise
df <- target_encoding_loo(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  white_noise = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)



#rnorm
#-----

#without sd multiplier
df <- target_encoding_rnorm(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#with sd multiplier
df <- target_encoding_rnorm(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  rnorm_sd_multiplier = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)