This function generates a model frame for statistical or machine learning analysis from these objects:
: Dissimilarity data frame generated by
distantia()
. The output model frame will have as many rows as this data frame.: Data frame with static descriptors of the time series. These descriptors are converted to distances between pairs of time series via
distance_matrix()
.: List defining new predictors as combinations of other existing predictors. This feature allows grouping together predictors that have a common meaning. For example,
predictors_list = list(temperature = c("temperature_mean", "temperature_min", "temperature_max")
generates a new predictor named "temperature", which results from computing the distances of the vector of temperature variables for each pair of time series. Predictors in one of such groups will be scaled before distance computation if their maximum standard deviations differ by a factor of 10 or more.
The resulting data frame contains the following columns:
x
andy
: names of the pair of time series represented in the row.psi
: dissimilarity betweenx
andy
.predictors columns: representing the distance between the values of the given static predictor between
x
andy
.(optional)
distance
:If the static predictors data frame is ansf
object, then this predictor is created viasf::st_distance()
.
Statistical or machine learning analyses based on this data frame may help uncover drivers of dissimilarity. Model coefficients or importance scores generated from this model frame represent the effect of the distance between predictors on the dissimilarity between time seriess.
This function supports a parallelization setup via future::plan()
, and progress bars provided by the package progressr.
Usage
distantia_model_frame(
df = NULL,
predictors_df = NULL,
predictors_list = NULL,
predictors_scaled = FALSE,
distance = "euclidean"
)
Arguments
- df
(required, data frame) Output of
distantia()
ordistantia_aggregate()
. Default: NULL- predictors_df
(required, data frame) data frame with numeric predictors to be added to the model frame. Must have a column with the names in
df$x
anddf$y
. Ifsf
data frame, the predictor "distance" is added to the model frame. Default: NULL- predictors_list
(optional, list) list defining new predictors as combinations of other predictors in
predictors_df
. For example,predictors_list = list(a = c("b", "c"))
uses the columns"b"
and"c"
frompredictors_df
to generate the predictora
in the model frame. Default: NULL- predictors_scaled
(optional, logical) if TRUE, all predictors are scaled and centered with
scale()
. Default: FALSE- distance
(optional, character vector) name or abbreviation of the distance method. Valid values are in the columns "names" and "abbreviation" of the dataset distances. Default: "euclidean".
Examples
#covid prevalence in California counties
tsl <- tsl_initialize(
x = covid_prevalence,
name_column = "name",
time_column = "time"
)
#dissimilarity analysis
df <- distantia(
tsl = tsl,
lock_step = TRUE
)
#combine several predictors
#into a new one
predictors_list <- list(
economy = c(
"poverty_percentage",
"median_income",
"domestic_product"
)
)
#generate model frame
model_frame <- distantia_model_frame(
df = df,
predictors_df = covid_counties,
predictors_list = predictors_list,
predictors_scaled = TRUE
)
head(model_frame)
#> psi distance economy area_hectares population poverty_percentage
#> 1 2.962963 -0.3873446 0.54789048 -0.5099580 0.06080193 0.6712709
#> 2 1.162055 -1.1308600 -1.19993287 -0.7391627 -0.40057819 -1.2480968
#> 3 2.733068 -0.5120264 0.59338777 -0.5095688 -0.17846400 0.8457589
#> 4 2.483755 -0.7102539 -0.54085294 -0.4832746 0.06892502 -1.2480968
#> 5 2.327273 -0.8640679 0.08698636 -0.7128685 -0.63172103 0.8457589
#> 6 2.767442 -0.8230488 -0.64006109 -0.4828853 -0.17034091 -1.4225848
#> median_income domestic_product daily_miles_traveled employed_percentage
#> 1 1.05695331 -0.008809640 -1.1239494 -0.1049333
#> 2 -1.24937126 -0.466809871 -0.6256303 -0.7943577
#> 3 1.14137732 -0.299252434 -0.6794576 -0.5164909
#> 4 -0.22744568 -0.002556924 -1.1117203 -1.0612055
#> 5 -0.04939628 -0.750999949 -1.0578930 -0.2496431
#> 6 -0.14302167 -0.292999718 -0.5595739 -0.9390675
#names of response and predictors
#and an additive formula
#are stored as attributes
attributes(model_frame)$response
#> [1] "psi"
attributes(model_frame)$predictors
#> [1] "distance" "economy" "area_hectares"
#> [4] "population" "poverty_percentage" "median_income"
#> [7] "domestic_product" "daily_miles_traveled" "employed_percentage"
attributes(model_frame)$formula
#> psi ~ distance + economy + area_hectares + population + poverty_percentage +
#> median_income + domestic_product + daily_miles_traveled +
#> employed_percentage
#> <environment: 0x5580991a45e8>
#linear model
model <- lm(
formula = attributes(model_frame)$formula,
data = model_frame
)
summary(model)
#>
#> Call:
#> lm(formula = attributes(model_frame)$formula, data = model_frame)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3579 -0.4988 0.1004 0.5309 1.6577
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.497370 0.029861 117.120 < 2e-16 ***
#> distance 0.213906 0.032047 6.675 5.5e-11 ***
#> economy 0.392631 0.260504 1.507 0.1323
#> area_hectares 0.095988 0.030623 3.134 0.0018 **
#> population 0.057896 0.031454 1.841 0.0661 .
#> poverty_percentage -0.195852 0.119502 -1.639 0.1017
#> median_income 0.031322 0.095096 0.329 0.7420
#> domestic_product -0.161710 0.155524 -1.040 0.2989
#> daily_miles_traveled -0.009979 0.039144 -0.255 0.7989
#> employed_percentage 0.115063 0.051094 2.252 0.0247 *
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.7495 on 620 degrees of freedom
#> Multiple R-squared: 0.1967, Adjusted R-squared: 0.185
#> F-statistic: 16.86 on 9 and 620 DF, p-value: < 2.2e-16
#>