Prepare sequences for a comparison analysis. — prepareSequences • distantia

This function prepares two or more multivariate time-series that are to be compared. It can work on two different scenarios:

Two dataframes: The user provides two separated dataframes, each containing a multivariate time series. These time-series can be regular or irregular, aligned or unaligned, but must have at least a few columns with the same names (pay attention to differences in case between column names representing the same entity) and units. This mode uses exclusively the following arguments: sequence.A, sequence.A.name (optional), sequence.B, sequence.B.name (optional), and merge.model.
One long dataframe: The user provides a single dataframe, through the sequences argument, with two or more multivariate time-series identified by a grouping.column.

prepareSequences(
  sequence.A = NULL,
  sequence.A.name = "A",
  sequence.B = NULL,
  sequence.B.name = "B",
  merge.mode = "complete",
  sequences = NULL,
  grouping.column = NULL,
  time.column = NULL,
  exclude.columns = NULL,
  if.empty.cases = "zero",
  transformation = "none",
  paired.samples = FALSE,
  same.time = FALSE
  )

Arguments

sequence.A: dataframe containing a multivariate time-series.
sequence.A.name: character string with the name of sequence.A. Will be used as identificator in the id column of the output dataframe.
sequence.B: dataframe containing a multivariate time-series. Must have overlapping columns with sequence.A with same column names and units.
sequence.B.name: character string with the name of sequence.B. Will be used as identificator in the id column of the output dataframe.
merge.mode: character string, one of: "overlap", "complete" (default option). If "overlap", sequence.A and sequence.B are merged by their common columns, and non-common columns are dropped If "complete", columns absent in one dataset but present in the other are added, with values equal to 0. This argument is ignored if sequences is provided instead of sequence.A and sequence.B.
sequences: dataframe with multiple sequences identified by a grouping column.
grouping.column: character string, name of the column in sequences to be used to identify separates sequences within the file. If two sequences are provided through the arguments sequence.A and sequence.B, this argument defines the name of the grouping column in the output dataframe. If two or several sequences are provided as a single dataframe through the argument sequences, then grouping.column must be a column in this dataset.
time.column: character string, name of the column with time/depth/rank data. If sequence.A and sequence.B are provided, time.column must have the same name and units in both dataframes.
exclude.columns: character string or character vector with column names in sequences, or squence.A and sequence.B, to be excluded from the transformation.
if.empty.cases: character string with two possible values: "omit", or "zero". If "zero" (default), NA values are replaced by zeroes. If "omit", rows with NA data are removed.
transformation: character string. Defines what data transformation is to be applied to the sequences. One of: "none" (default), "percentage", "proportion", "hellinger", and "scale" (the latter centers and scales the data using the scale function).
paired.samples: boolean. If TRUE, the function will test if the datasets have paired samples. This means that each dataset must have the same number of rows/samples, and that, if available, the time.column must have the same values in every dataset. This is only mandatory when using the functions distancePairedSamples or workflowPsi with paired.samples = TRUE after preparing the sequences. The default setting is FALSE.
same.time: boolean. If TRUE, samples in the sequences to compare will be tested to check if they have the same time/age/depth according to time.column. This argument is only useful when the user needs to compare two sequences taken at different sites but same time frames.

Value

A dataframe with the multivariate time series. If squence.A and sequence.B are provided, the column identifying the sequences is named "id". If sequences is provided, the time-series are identified by grouping.column.

Author

Blas Benito <blasbenito@gmail.com>

Examples


#two sequences as inputs
data(sequenceA)
data(sequenceB)

AB.sequences <- prepareSequences(
 sequence.A = sequenceA,
 sequence.A.name = "A",
 sequence.B = sequenceB,
 sequence.B.name = "B",
 merge.mode = "complete",
 if.empty.cases = "zero",
 transformation = "hellinger"
 )


#several sequences in a single dataframe
data(sequencesMIS)
MIS.sequences <- prepareSequences(
 sequences = sequencesMIS,
 grouping.column = "MIS",
 if.empty.cases = "zero",
 transformation = "hellinger"
 )