Skip to contents

Hierarchical clustering of predictors from their pairwise correlation matrix. Computes the correlation matrix with cor_df() and cor_matrix(), transforms it to a dist object, computes a clustering solution with stats::hclust(), and applies stats::cutree() to separate groups based on the value of the argument max_cor.

Returns a data frame with predictor names and their clusters, and optionally, prints a dendrogram of the clustering solution.

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Usage

cor_clusters(
  df = NULL,
  predictors = NULL,
  max_cor = 0.75,
  method = "complete",
  plot = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

max_cor

(optional; numeric) Maximum correlation allowed between any pair of variables in predictors. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default: 0.75

method

(optional, character string) Argument of stats::hclust() defining the agglomerative method. One of: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Unambiguous abbreviations are accepted as well. Default: "complete".

plot

(optional, logical) If TRUE, the clustering is plotted. Default: FALSE

Value

data frame: predictor names and their clusters

See also

Other pairwise_correlation: cor_cramer_v(), cor_df(), cor_matrix(), cor_select()

Examples


#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

df_clusters <- cor_clusters(
  df = vi[1:1000, ],
  predictors = vi_predictors[1:15]
)

#disable parallelization
future::plan(future::sequential)