Hierarchical clustering of predictors from their correlation matrix. Computes the correlation matrix with cor_df() and cor_matrix(), transforms it to a distance matrix using stats::dist(), computes a clustering solution with stats::hclust(), and applies stats::cutree() to separate groups based on the value of the argument max_cor.
Returns a dataframe with predictor names and their clusters, and optionally, prints a dendrogram of the clustering solution.
Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).
Usage
cor_clusters(
df = NULL,
predictors = NULL,
max_cor = 0.7,
method = "complete",
quiet = FALSE,
...
)Arguments
- df
(required; dataframe, tibble, or sf) A dataframe with predictors or the output of
cor_df(). Default: NULL.- predictors
(optional; character vector or NULL) Names of the predictors in
df. If NULL, all columns exceptresponsesand constant/near-zero-variance columns are used. Default: NULL.- max_cor
(optional; numeric or NULL) Correlation value used to separate clustering groups. Valid values are between 0.01 and 0.99. Default: 0.7
- method
(optional, character string) Argument of
stats::hclust()defining the agglomerative method. One of: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Unambiguous abbreviations are accepted as well. Default: "complete".- quiet
(optional; logical) If FALSE, messages are printed. Default: FALSE.
- ...
(optional) Internal args (e.g.
function_nameforvalidate_arg_function_name, a precomputed correlation matrixm, or cross-validation args forpreference_order).
See also
Other multicollinearity_assessment:
collinear_stats(),
cor_cramer(),
cor_df(),
cor_matrix(),
cor_stats(),
vif(),
vif_df(),
vif_stats()
Examples
data(vi_smol)
## OPTIONAL: parallelization setup
## irrelevant when all predictors are numeric
## only worth it for large data with many categoricals
# future::plan(
# future::multisession,
# workers = future::availableCores() - 1
# )
## OPTIONAL: progress bar
# progressr::handlers(global = TRUE)
#group predictors using max_cor as clustering threshold
clusters <- cor_clusters(
df = vi_smol,
predictors = c(
"koppen_zone", #character
"soil_type", #factor
"topo_elevation", #numeric
"soil_temperature_mean" #numeric
),
max_cor = 0.75
)
#>
#> collinear::cor_clusters()
#> └── collinear::cor_matrix()
#> └── collinear::cor_df()
#> └── collinear::validate_arg_df(): converted the following character columns to factor:
#> - koppen_zone
#>
#> collinear::cor_clusters()
#> └── collinear::cor_matrix()
#> └── collinear::cor_df(): 2 categorical predictors have cardinality > 2 and may bias the multicollinearity analysis. Applying target encoding to convert them to numeric will solve this issue.
#clusters dataframe
clusters$df
#> predictor cluster
#> 1 koppen_zone 1
#> 2 soil_temperature_mean 1
#> 3 soil_type 2
#> 4 topo_elevation 3
##plot hclust object
# graphics::plot(clusters$hclust)
##plot max_cor threshold
# graphics::abline(
# h = 1 - 0.75,
# col = "red4",
# lty = 3,
# lwd = 2
# )
## OPTIONAL: disable parallelization
#future::plan(future::sequential)
