Target Encoding Lab: Transform Categorical Variables to Numeric
Source:R/target_encoding_lab.R
target_encoding_lab.Rd
Target encoding involves replacing the values of categorical variables with numeric ones derived from a "target variable", usually a model's response.
In essence, target encoding works as follows:
1. group all cases belonging to a unique value of the categorical variable.
2. compute a statistic of the target variable across the group cases.
3. assign the value of the statistic to the group.
The methods to compute the group statistic implemented here are:
"mean" (implemented in
target_encoding_mean()
): Encodes categorical values with the group means of the response. Variables encoded with this method are identified with the suffix "__encoded_mean". It has a method to control overfitting implemented via the argumentsmoothing
. The integer value of this argument indicates a threshold in number of rows. Groups above this threshold are encoded with the group mean, while groups below it are encoded with a weighted mean of the group's mean and the global mean. This method is named "mean smoothing" in the relevant literature."rank" (implemented in
target_encoding_rank()
): Returns the rank of the group as a integer, being 1 he group with the lower mean of the response variable. Variables encoded with this method are identified with the suffix "__encoded_rank"."loo" (implemented in
target_encoding_loo()
): Known as the "leave-one-out method" in the literature, it encodes each categorical value with the mean of the response variable across all other group cases. This method controls overfitting better than "mean". Variables encoded with this method are identified with the suffix "__encoded_loo".
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
Usage
target_encoding_lab(
df = NULL,
response = NULL,
predictors = NULL,
methods = c("loo", "mean", "rank"),
smoothing = 0,
white_noise = 0,
seed = 0,
overwrite = FALSE,
quiet = FALSE
)
Arguments
- df
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
- response
(optional, character string) Name of a numeric response variable in
df
. Default: NULL.- predictors
(optional; character vector) Names of the predictors to select from
df
. If omitted, all numeric columns indf
are used instead. If argumentresponse
is not provided, non-numeric variables are ignored. Default: NULL- methods
(optional; character vector or NULL). Name of the target encoding methods. If NULL, target encoding is ignored, and
df
is returned with no modification. Default: c("loo", "mean", "rank")- smoothing
(optional; integer vector) Argument of the method "mean". Groups smaller than this number have their means pulled towards the mean of the response across all cases. Default: 0
- white_noise
(optional; numeric vector) Argument of the methods "mean", "rank", and "loo". Maximum white noise to add, expressed as a fraction of the range of the response variable. Range from 0 to 1. Default:
0
.- seed
(optional; integer vector) Random seed to facilitate reproducibility when
white_noise
is not 0. If NULL, the function selects one at random, and the selected seed does not appear in the encoded variable names. Default: 0- overwrite
(optional; logical) If
TRUE
, the original predictors indf
are overwritten with their encoded versions, but only one encoding method, smoothing, white noise, and seed are allowed. Otherwise, encoded predictors with their descriptive names are added todf
. Default: FALSE- quiet
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE
References
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. doi: 10.1145/507533.507538
See also
Other target_encoding:
target_encoding_mean()
Examples
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
#applying all methods for a continuous response
df <- target_encoding_lab(
df = vi,
response = "vi_numeric",
predictors = "koppen_zone",
methods = c(
"mean",
"loo",
"rank"
),
white_noise = c(0, 0.1, 0.2)
)
#>
#> collinear::target_encoding_lab(): using response 'vi_numeric' to encode categorical predictors:
#> - koppen_zone
#identify encoded predictors
predictors.encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)