Summary
Target encoding maps categorical predictors to a statistic of a reference numeric variable across categories.
This functionality, implemented in
target_encoding_lab(), is used in two different ways:
- The function
cor_df()applies it to compute pairwise correlations between numeric and categorical predictors, if any. - The function
collinear()applies it optionally if the user requests it and the response is numeric to transform all categorical predictors to numeric and speed-up the multicollinearity analysis.
How It Works
To explain how target encoding works, let’s first create a little silly dataframe:
df <- data.frame(
num = 1:6,
cat = c(rep("a", 3), rep("b", 3))
)
df
#> num cat
#> 1 1 a
#> 2 2 a
#> 3 3 a
#> 4 4 b
#> 5 5 b
#> 6 6 bMean Target Encoding
To transform cat to numeric we can remap it to the
average of num across each category in cat, as
shown in the code below using the dplyr syntax.
df <- df |>
dplyr::group_by(cat) |>
dplyr::mutate(cat_mean = mean(num)) |>
dplyr::ungroup()
df
#> # A tibble: 6 × 3
#> num cat cat_mean
#> <int> <chr> <dbl>
#> 1 1 a 2
#> 2 2 a 2
#> 3 3 a 2
#> 4 4 b 5
#> 5 5 b 5
#> 6 6 b 5This is equivalent to the "mean" encoding method
implemented in target_encoding_lab().
df <- collinear::target_encoding_lab(
df = df,
response = "num",
predictors = "cat",
methods = "mean",
quiet = TRUE
)
df
#> num cat cat__encoded_loo
#> 1 1 a 2.5
#> 2 2 a 2.0
#> 3 3 a 1.5
#> 4 4 b 5.5
#> 5 5 b 5.0
#> 6 6 b 4.5Leave-One-Out Target Encoding
Another encoding method is the leave-one-out ("loo"),
which computes the group mean while excluding the current
observation.
For each given case, it sums the numeric values in the same category, subtracts the current row’s value, and divides by the count of observations in that category minus one. The code below replicates this behavior.
df <- df |>
dplyr::group_by(cat) |>
dplyr::mutate(
cat_loo = (sum(num) - num) / (dplyr::n() - 1)
) |>
dplyr::ungroup()
df
#> # A tibble: 6 × 4
#> num cat cat__encoded_loo cat_loo
#> <int> <fct> <dbl> <dbl>
#> 1 1 a 2.5 2.5
#> 2 2 a 2 2
#> 3 3 a 1.5 1.5
#> 4 4 b 5.5 5.5
#> 5 5 b 5 5
#> 6 6 b 4.5 4.5Notice how the encoded values now vary within each category. For category “a”: rows get values based on the other two observations in “a”, and the same happens with “b”.
The same method can be applied in target_encoding_lab()
when setting method to "loo".
df <- collinear::target_encoding_lab(
df = df,
response = "num",
predictors = "cat",
methods = "loo",
quiet = TRUE
)
df
#> num cat cat__encoded_loo
#> 1 1 a 2.5
#> 2 2 a 2.0
#> 3 3 a 1.5
#> 4 4 b 5.5
#> 5 5 b 5.0
#> 6 6 b 4.5Rank Target Encoding
The rank method orders categories by their mean values and assigns them integer ranks. It is useful to capture the ordinal relationship between categories without being sensitive to outliers or extreme values.
The dplyr version of this algorithm is a bit more
verbose, but I hope it does the trick:
df <- df |>
dplyr::group_by(cat) |>
dplyr::mutate(
cat_mean = mean(num)
) |>
dplyr::ungroup() |>
dplyr::mutate(
cat_rank = dplyr::dense_rank(cat_mean)
)
df
#> # A tibble: 6 × 4
#> num cat cat_mean cat_rank
#> <int> <fct> <dbl> <int>
#> 1 1 a 2 1
#> 2 2 a 2 1
#> 3 3 a 2 1
#> 4 4 b 5 2
#> 5 5 b 5 2
#> 6 6 b 5 2The same method can be applied via target_encoding_lab()
with encoding_method = "rank".
df <- collinear::target_encoding_lab(
df = df,
response = "num",
predictors = "cat",
methods = "rank",
quiet = TRUE
)
df
#> num cat cat__encoded_loo
#> 1 1 a 2.5
#> 2 2 a 2.0
#> 3 3 a 1.5
#> 4 4 b 5.5
#> 5 5 b 5.0
#> 6 6 b 4.5Target Encoding in Practice
Computing Correlations with Mixed Variable Types
When computing pairwise correlations between numeric and categorical
predictors, cor_df() automatically applies leave-one-out
target encoding behind the scenes.
x <- collinear::cor_df(
df = collinear::vi_smol,
predictors = c(
"soil_temperature_mean", # numeric
"koppen_zone" # categorical
),
quiet = TRUE
)
x
#> x y correlation metric
#> 1 soil_temperature_mean koppen_zone 0.9195774 PearsonThis is the same as applying target_encoding_lab() with
method "loo" and then compute the pairwise correlation with
stats::cor().
df <- collinear::target_encoding_lab(
df = vi_smol,
response = "soil_temperature_mean",
predictor = "koppen_zone",
encoding_method = "loo",
overwrite = TRUE,
quiet = TRUE
)
stats::cor(
x = df$soil_temperature_mean,
y = df$koppen_zone
)
#> [1] 0.9195774Speeding Up Multicollinearity Analysis
Enabling target encoding when working with numeric responses and
categorical predictors in collinear() provides two key
advantages:
- Speeds up computation considerably, and more importantly.
- Replaces Cramer’s V associations between pairs of categorical variables with Pearson correlations between pairs of numeric variables, providing a more statistically sound multicollinearity assessment.
# without target encoding
time_without <- system.time({
result_without_encoding <- collinear::collinear(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_categorical,
encoding_method = NULL,
quiet = TRUE
)
})
# with target encoding
time_with <- system.time({
result_with_encoding <- collinear::collinear(
df = vi_smol,
responses = "vi_numeric",
predictors = vi_predictors_categorical,
encoding_method = "loo",
quiet = TRUE
)
})
data.frame(
encoding = c("No", "Yes"),
seconds = c(time_without["elapsed"], time_with["elapsed"])
)
#> encoding seconds
#> 1 No 10.272
#> 2 Yes 2.509The speed-up is considerable!
But that’s not the only consequence of applying target encoding. Let’s take a look at the variable selections on each run.
result_without_encoding$vi_numeric$selection
#> [1] "koppen_zone" "soil_type" "biogeo_ecoregion" "country_name"
#> attr(,"validated")
#> [1] TRUE
result_with_encoding$vi_numeric$selection
#> [1] "koppen_zone" "subregion" "biogeo_realm" "continent"
#> attr(,"validated")
#> [1] TRUEThe selection resulting from the run with target encoding is a bit shorter. Why? Because without target encoding, pairwise associations between categorical predictors are computed with Cramer’s V, which tends to underestimate association under high cardinality (number of categories).
Hence, when target encoding is applied, collinear()
captures true multicollinearity and ensures a consistent filtering
across all predictors, no matter their type.
Important Considerations
Overfitting Risk: Mean encoding can leak information from the response into the predictor. The leave-one-out method mitigates this risk and is generally recommended for modeling workflows.
Category Frequency: Target encoding works best with categories that have sufficient observations. Categories with very few observations may produce unstable encoded values.
When to Use Target Encoding: Enable
encoding_method = "loo" in collinear() when
you have a numeric response and categorical predictors, and you want the
most accurate multicollinearity assessment. The improved statistical
soundness and computational speed make it the preferred approach for
mixed data types.
