Skip to contents

Returns a correlation matrix between all pairs of predictors in a training dataset. Non-numeric predictors are transformed into numeric via target encoding, using the 'response' variable as reference.

Usage

cor_matrix(
  df = NULL,
  response = NULL,
  predictors = NULL,
  cor_method = "pearson",
  encoding_method = "mean"
)

Arguments

df

(required; data frame) A data frame with numeric and/or character predictors, and optionally, a response variable. Default: NULL.

response

(recommended, character string) Name of a numeric response variable. Character response variables are ignored. Please, see 'Details' to better understand how providing this argument or not leads to different results when there are character variables in 'predictors'. Default: NULL.

predictors

(optional; character vector) character vector with predictor names in 'df'. If omitted, all columns of 'df' are used as predictors. Default:'NULL'

cor_method

(optional; character string) Method used to compute pairwise correlations. Accepted methods are "pearson" (with a recommended minimum of 30 rows in 'df') or "spearman" (with a recommended minimum of 10 rows in 'df'). Default: "pearson".

encoding_method

(optional; character string). Name of the target encoding method to convert character and factor predictors to numeric. One of "mean", "rank", "loo", "rnorm" (see target_encoding_lab() for further details). Default: "mean"

Value

correlation matrix

Details

This function attempts to handle correlations between pairs of variables that can be of different types:

  • numeric vs. numeric: computed with stats::cor() with the methods "pearson" or "spearman".

  • numeric vs. character, two alternatives leading to different results:

    • 'response' is provided: the character variable is target-encoded as numeric using the values of the response as reference, and then its correlation with the numeric variable is computed with stats::cor(). This option generates a response-specific result suitable for training statistical and machine-learning models

    • 'response' is NULL (or the name of a non-numeric column): the character variable is target-encoded as numeric using the values of the numeric predictor (instead of the response) as reference, and then their correlation is computed with stats::cor(). This option leads to a response-agnostic result suitable for clustering problems.

  • character vs. character, two alternatives leading to different results:

    • 'response' is provided: the character variables are target-encoded as numeric using the values of the response as reference, and then their correlation is computed with stats::cor().

    • response' is NULL (or the name of a non-numeric column): the association between the character variables is computed using Cramer's V. This option might be problematic, because R-squared values and Cramer's V, even when having the same range between 0 and 1, are not fully comparable.

Author

Blas M. Benito

Examples


data(
  vi,
  vi_predictors
)

#subset to limit example run time
vi <- vi[1:1000, ]
vi_predictors <- vi_predictors[1:5]

#convert correlation data frame to matrix
df <- cor_df(
  df = vi,
  predictors = vi_predictors
)

m <- cor_matrix(
  df = df
)

#show first three columns and rows
m[1:5, 1:5]
#>                    koppen_zone koppen_description koppen_group soil_type
#> koppen_zone              1.000              0.997        0.991     0.335
#> koppen_description       0.997              1.000        0.871     0.365
#> koppen_group             0.991              0.871        1.000     0.577
#> soil_type                0.335              0.365        0.577     1.000
#> topo_slope               0.397              0.365        0.290     0.359
#>                    topo_slope
#> koppen_zone             0.397
#> koppen_description      0.365
#> koppen_group            0.290
#> soil_type               0.359
#> topo_slope              1.000

#generate correlation matrix directly
m <- cor_matrix(
  df = vi,
  predictors = vi_predictors
)

m[1:5, 1:5]
#>                    koppen_zone koppen_description koppen_group soil_type
#> koppen_zone              1.000              0.997        0.991     0.335
#> koppen_description       0.997              1.000        0.871     0.365
#> koppen_group             0.991              0.871        1.000     0.577
#> soil_type                0.335              0.365        0.577     1.000
#> topo_slope               0.397              0.365        0.290     0.359
#>                    topo_slope
#> koppen_zone             0.397
#> koppen_description      0.365
#> koppen_group            0.290
#> soil_type               0.359
#> topo_slope              1.000

#with response (much faster)
#different solution than previous one
#because target encoding is done against the response
#rather than against the other numeric in the pair
m <- cor_matrix(
  df = vi,
  response = "vi_mean",
  predictors = vi_predictors
)

m[1:5, 1:5]
#>                    koppen_zone koppen_group koppen_description soil_type
#> koppen_zone              1.000        0.937              0.991     0.775
#> koppen_group             0.937        1.000              0.933     0.747
#> koppen_description       0.991        0.933              1.000     0.771
#> soil_type                0.775        0.747              0.771     1.000
#> topo_slope               0.164        0.175              0.148     0.139
#>                    topo_slope
#> koppen_zone             0.164
#> koppen_group            0.175
#> koppen_description      0.148
#> soil_type               0.139
#> topo_slope              1.000