Skip to contents

Hierarchical clustering of predictors from their correlation matrix. Computes the correlation matrix with cor_df() and cor_matrix(), transforms it to a distance matrix using stats::dist(), computes a clustering solution with stats::hclust(), and applies stats::cutree() to separate groups based on the value of the argument max_cor.

Returns a dataframe with predictor names and their clusters, and optionally, prints a dendrogram of the clustering solution.

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Usage

cor_clusters(
  df = NULL,
  predictors = NULL,
  max_cor = 0.7,
  method = "complete",
  quiet = FALSE,
  ...
)

Arguments

df

(required; dataframe, tibble, or sf) A dataframe with predictors or the output of cor_df(). Default: NULL.

predictors

(optional; character vector or NULL) Names of the predictors in df. If NULL, all columns except responses and constant/near-zero-variance columns are used. Default: NULL.

max_cor

(optional; numeric or NULL) Correlation value used to separate clustering groups. Valid values are between 0.01 and 0.99. Default: 0.7

method

(optional, character string) Argument of stats::hclust() defining the agglomerative method. One of: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Unambiguous abbreviations are accepted as well. Default: "complete".

quiet

(optional; logical) If FALSE, messages are printed. Default: FALSE.

...

(optional) Internal args (e.g. function_name for validate_arg_function_name, a precomputed correlation matrix m, or cross-validation args for preference_order).

Value

list:

  • df: dataframe with predictor names and their cluster IDs.

  • hclust: clustering object

See also

Examples

data(vi_smol)

## OPTIONAL: parallelization setup
## irrelevant when all predictors are numeric
## only worth it for large data with many categoricals
# future::plan(
#   future::multisession,
#   workers = future::availableCores() - 1
# )

## OPTIONAL: progress bar
# progressr::handlers(global = TRUE)

#group predictors using max_cor as clustering threshold
clusters <- cor_clusters(
  df = vi_smol,
  predictors = c(
    "koppen_zone", #character
    "soil_type", #factor
    "topo_elevation", #numeric
    "soil_temperature_mean" #numeric
  ),
  max_cor = 0.75
)
#> 
#> collinear::cor_clusters()
#> └── collinear::cor_matrix()
#>     └── collinear::cor_df()
#>         └── collinear::validate_arg_df(): converted the following character columns to factor:
#>  - koppen_zone
#> 
#> collinear::cor_clusters()
#> └── collinear::cor_matrix()
#>     └── collinear::cor_df(): 2 categorical predictors have cardinality > 2 and may bias the multicollinearity analysis. Applying target encoding to convert them to numeric will solve this issue.

#clusters dataframe
clusters$df
#>               predictor cluster
#> 1           koppen_zone       1
#> 2 soil_temperature_mean       1
#> 3             soil_type       2
#> 4        topo_elevation       3

##plot hclust object
# graphics::plot(clusters$hclust)

##plot max_cor threshold
# graphics::abline(
#   h = 1 - 0.75,
#   col = "red4",
#   lty = 3,
#   lwd = 2
# )

## OPTIONAL: disable parallelization
#future::plan(future::sequential)