Skip to contents

This function generates a model frame for statistical or machine learning analysis from these objects:

  • : Dissimilarity data frame generated by distantia(). The output model frame will have as many rows as this data frame.

  • : Data frame with static descriptors of the time series. These descriptors are converted to distances between pairs of time series via distance_matrix().

  • : List defining new predictors as combinations of other existing predictors. This feature allows grouping together predictors that have a common meaning. For example, predictors_list = list(temperature = c("temperature_mean", "temperature_min", "temperature_max") generates a new predictor named "temperature", which results from computing the distances of the vector of temperature variables for each pair of time series. Predictors in one of such groups will be scaled before distance computation if their maximum standard deviations differ by a factor of 10 or more.

The resulting data frame contains the following columns:

  • x and y: names of the pair of time series represented in the row.

  • psi: dissimilarity between x and y.

  • predictors columns: representing the distance between the values of the given static predictor between x and y.

  • (optional) distance:If the static predictors data frame is an sf object, then this predictor is created via sf::st_distance().

Statistical or machine learning analyses based on this data frame may help uncover drivers of dissimilarity. Model coefficients or importance scores generated from this model frame represent the effect of the distance between predictors on the dissimilarity between time seriess.

This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.

Usage

distantia_model_frame(
  df = NULL,
  predictors_df = NULL,
  predictors_list = NULL,
  predictors_scaled = FALSE,
  distance = "euclidean"
)

Arguments

df

(required, data frame) Output of distantia() or distantia_aggregate(). Default: NULL

predictors_df

(required, data frame) data frame with numeric predictors to be added to the model frame. Must have a column with the names in df$x and df$y. If sf data frame, the predictor "distance" is added to the model frame. Default: NULL

predictors_list

(optional, list) list defining new predictors as combinations of other predictors in predictors_df. For example, predictors_list = list(a = c("b", "c")) uses the columns "b" and "c" from predictors_df to generate the predictor a in the model frame. Default: NULL

predictors_scaled

(optional, logical) if TRUE, all predictors are scaled and centered with scale(). Default: FALSE

distance

(optional, character vector) name or abbreviation of the distance method. Valid values are in the columns "names" and "abbreviation" of the dataset distances. Default: "euclidean".

Value

data frame: with attributes "response", "predictors" and "formula".

See also

Other dissimilarity_analysis_main: distantia(), momentum()

Examples


#covid prevalence in California counties
tsl <- tsl_initialize(
  x = covid_prevalence,
  name_column = "name",
  time_column = "time"
)

#dissimilarity analysis
df <- distantia(
  tsl = tsl,
  lock_step = TRUE
)

#combine several predictors
#into a new one
predictors_list <- list(
  economy = c(
    "poverty_percentage",
    "median_income",
    "domestic_product"
    )
)

#generate model frame
model_frame <- distantia_model_frame(
  df = df,
  predictors_df = covid_counties,
  predictors_list = predictors_list,
  predictors_scaled = TRUE
)

head(model_frame)
#>        psi   distance     economy area_hectares  population poverty_percentage
#> 1 2.962963 -0.3873446  0.54789048    -0.5099580  0.06080193          0.6712709
#> 2 1.162055 -1.1308600 -1.19993287    -0.7391627 -0.40057819         -1.2480968
#> 3 2.733068 -0.5120264  0.59338777    -0.5095688 -0.17846400          0.8457589
#> 4 2.483755 -0.7102539 -0.54085294    -0.4832746  0.06892502         -1.2480968
#> 5 2.327273 -0.8640679  0.08698636    -0.7128685 -0.63172103          0.8457589
#> 6 2.767442 -0.8230488 -0.64006109    -0.4828853 -0.17034091         -1.4225848
#>   median_income domestic_product daily_miles_traveled employed_percentage
#> 1    1.05695331     -0.008809640           -1.1239494          -0.1049333
#> 2   -1.24937126     -0.466809871           -0.6256303          -0.7943577
#> 3    1.14137732     -0.299252434           -0.6794576          -0.5164909
#> 4   -0.22744568     -0.002556924           -1.1117203          -1.0612055
#> 5   -0.04939628     -0.750999949           -1.0578930          -0.2496431
#> 6   -0.14302167     -0.292999718           -0.5595739          -0.9390675

#names of response and predictors
#and an additive formula
#are stored as attributes
attributes(model_frame)$response
#> [1] "psi"
attributes(model_frame)$predictors
#> [1] "distance"             "economy"              "area_hectares"       
#> [4] "population"           "poverty_percentage"   "median_income"       
#> [7] "domestic_product"     "daily_miles_traveled" "employed_percentage" 
attributes(model_frame)$formula
#> psi ~ distance + economy + area_hectares + population + poverty_percentage + 
#>     median_income + domestic_product + daily_miles_traveled + 
#>     employed_percentage
#> <environment: 0x5580991a45e8>


#linear model
model <- lm(
  formula = attributes(model_frame)$formula,
  data = model_frame
)

summary(model)
#> 
#> Call:
#> lm(formula = attributes(model_frame)$formula, data = model_frame)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.3579 -0.4988  0.1004  0.5309  1.6577 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           3.497370   0.029861 117.120  < 2e-16 ***
#> distance              0.213906   0.032047   6.675  5.5e-11 ***
#> economy               0.392631   0.260504   1.507   0.1323    
#> area_hectares         0.095988   0.030623   3.134   0.0018 ** 
#> population            0.057896   0.031454   1.841   0.0661 .  
#> poverty_percentage   -0.195852   0.119502  -1.639   0.1017    
#> median_income         0.031322   0.095096   0.329   0.7420    
#> domestic_product     -0.161710   0.155524  -1.040   0.2989    
#> daily_miles_traveled -0.009979   0.039144  -0.255   0.7989    
#> employed_percentage   0.115063   0.051094   2.252   0.0247 *  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.7495 on 620 degrees of freedom
#> Multiple R-squared:  0.1967,	Adjusted R-squared:  0.185 
#> F-statistic: 16.86 on 9 and 620 DF,  p-value: < 2.2e-16
#>