This function generates a model frame for statistical or machine learning analysis from these objects:
- : Dissimilarity data frame generated by - distantia(),- distantia_ls(),- distantia_dtw(), or- distantia_time_delay(). The output model frame will have as many rows as this data frame.
- : Data frame with static descriptors of the time series. These descriptors are converted to distances between pairs of time series via - distance_matrix().
- : List defining composite predictors. This feature allows grouping together predictors that have a common meaning. For example, - composite_predictors = list(temperature = c("temperature_mean", "temperature_min", "temperature_max")generates a new predictor named "temperature", which results from computing the multivariate distances between the vectors of temperature variables of each pair of time series. Predictors in one of such groups will be scaled before distance computation if their maximum standard deviations differ by a factor of 10 or more.
The resulting data frame contains the following columns:
- xand- y: names of the pair of time series represented in the row.
- response columns in - response_df.
- predictors columns: representing the distance between the values of the given static predictor between - xand- y.
- (optional) - geographic_distance: If- predictors_dfis an sf- sfdata frame, then geographic distances are computed via- sf::st_distance().
This function supports a parallelization setup via future::plan().
Usage
distantia_model_frame(
  response_df = NULL,
  predictors_df = NULL,
  composite_predictors = NULL,
  scale = TRUE,
  distance = "euclidean"
)Arguments
- response_df
- (required, data frame) output of - distantia(),- distantia_ls(),- distantia_dtw(), or- distantia_time_delay(). Default: NULL
- predictors_df
- (required, data frame or sf data frame) data frame with numeric predictors for the the model frame. Must have a column with the time series names in - response_df$xand- response_df$y. If- sfdata frame, the column "geographic_distance" with distances between pairs of time series is added to the model frame. Default: NULL
- composite_predictors
- (optional, list) list defining composite predictors. For example, - composite_predictors = list(a = c("b", "c"))uses the columns- "b"and- "c"from- predictors_dfto generate the predictor- aas the multivariate distance between- "b"and- "c"for each pair of time series in- response_df. Default: NULL
- scale
- (optional, logical) if TRUE, all predictors are scaled and centered with - scale(). Default: TRUE
- distance
- (optional, string) Method to compute the distance between predictor values for all pairs of time series in - response_df. Default: "euclidean".
See also
Other distantia_support:
distantia_aggregate(),
distantia_boxplot(),
distantia_cluster_hclust(),
distantia_cluster_kmeans(),
distantia_matrix(),
distantia_spatial(),
distantia_stats(),
distantia_time_delay(),
utils_block_size(),
utils_cluster_hclust_optimizer(),
utils_cluster_kmeans_optimizer(),
utils_cluster_silhouette()
Examples
#covid prevalence in California counties
tsl <- tsl_initialize(
  x = covid_prevalence,
  name_column = "name",
  time_column = "time"
) |>
  #subset to shorten example runtime
  tsl_subset(
    names = 1:5
  )
#dissimilarity analysis
df <- distantia_ls(tsl = tsl)
#combine several predictors
#into a new one
composite_predictors <- list(
  economy = c(
    "poverty_percentage",
    "median_income",
    "domestic_product"
    )
)
#generate model frame
model_frame <- distantia_model_frame(
  response_df = df,
  predictors_df = covid_counties,
  composite_predictors = composite_predictors,
  scale = TRUE
)
head(model_frame)
#>              x            y      psi   economy area_hectares population
#> 1      Alameda Contra_Costa 1.162055 0.3641648    -1.0643659 -0.5863664
#> 2        Butte    El_Dorado 2.327273 2.1431460    -1.0153959 -1.5791633
#> 3      Alameda    El_Dorado 2.483755 1.2752482    -0.5878037  1.4302278
#> 4        Butte Contra_Costa 2.733068 2.8431733    -0.6367737  0.3676509
#> 5 Contra_Costa    El_Dorado 2.767442 1.1381072    -0.5870788  0.4025409
#> 6      Alameda        Butte 2.962963 2.7802798    -0.6374986  1.3953378
#>   poverty_percentage median_income domestic_product daily_miles_traveled
#> 1         -1.1950603    -1.4638414       -0.3502863            1.1013250
#> 2          0.4596386    -0.1929442       -1.3164373           -0.5591417
#> 3         -1.1950603    -0.3815169        1.2280181           -0.7659104
#> 4          0.4596386     1.0682077        0.2193528            0.8945563
#> 5         -1.3329518    -0.2921032        0.2406099            1.3550699
#> 6          0.3217470     0.9787940        1.2067609           -0.8128866
#>   employed_percentage geographic_distance
#> 1          -0.5448841          -1.5815924
#> 2           0.8453162          -0.7903724
#> 3          -1.2259232          -0.3342092
#> 4           0.1642772           0.2536703
#> 5          -0.9142071          -0.6687226
#> 6           1.2146392           0.6234367
#names of response and predictors
#and an additive formula
#are stored as attributes
attributes(model_frame)$predictors
#> [1] "area_hectares"        "population"           "poverty_percentage"  
#> [4] "median_income"        "domestic_product"     "daily_miles_traveled"
#> [7] "employed_percentage"  "geographic_distance" 
#if response_df is output of distantia():
attributes(model_frame)$response
#> [1] "psi"
attributes(model_frame)$formula
#> psi ~ area_hectares + population + poverty_percentage + median_income + 
#>     domestic_product + daily_miles_traveled + employed_percentage + 
#>     geographic_distance
#> <environment: 0x555b1ed623f8>
#example of linear model
# model <- lm(
#   formula = attributes(model_frame)$formula,
#   data = model_frame
# )
#
# summary(model)
