This workflow works under the following scenario: the user has a short sequence, and a long sequence, and has the objective of finding the segment in the long sequence that better matches the short sequence. The function identifies automatically the short and the long sequence, but throws an error if more than two sequences are introduced. The lengths of the segments in the long sequence to be compared with the long sequence are defined through the arguments min.length and max.length. If left empty, min.length and max.length equal 0, meaning that the segment to be searched for will have the same number of cases as the short sequence. Note that this is a brute force algorithm, can have a large memory footpring if the interval between min.length and max.length is too long. It might be convenient to pre-check the number of iterations to be performed by computing sum(nrow(long.sequence) - min.length:max.length) + 1. The algorithm is parallelized and optimized as possible, so still, large searches are possible.

workflowPartialMatch(
  sequences = NULL,
  grouping.column = NULL,
  time.column = NULL,
  exclude.columns = NULL,
  method = "manhattan",
  diagonal = FALSE,
  paired.samples = FALSE,
  min.length = NULL,
  max.length = NULL,
  ignore.blocks = FALSE,
  parallel.execution = TRUE
  )

Arguments

sequences

dataframe with multiple sequences identified by a grouping column generated by prepareSequences.

grouping.column

character string, name of the column in sequences to be used to identify separates sequences within the file.

time.column

character string, name of the column with time/depth/rank data.

exclude.columns

character string or character vector with column names in sequences to be excluded from the analysis.

method

character string naming a distance metric. Valid entries are: "manhattan", "euclidean", "chi", and "hellinger". Invalid entries will throw an error.

diagonal

boolean, if TRUE (default), diagonals are included in the computation of the least cost path. This is the best option if the user suspects that a given segment in the short sequence might be identical to the short sequence.

paired.samples

boolean, if TRUE, the sequences are assumed to be aligned, and distances are computed for paired-samples only (no distance matrix required). Default value is FALSE.

min.length

integer, minimum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If NULL (default), the subset of the long sequence to be matched will thave the same number of samples as the short sequence.

max.length

integer, maximum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If NULL (default), the subset of the long sequence to be matched will thave the same number of samples as the short sequence.

ignore.blocks

boolean. If TRUE, the function leastCostPathNoBlocks analyzes the least-cost path of the best solution, and removes blocks (straight-orthogonal sections of the least-cost path), which happen in highly dissimilar sections of the sequences, and inflate output psi values.

parallel.execution

boolean, if TRUE (default), execution is parallelized, and serialized if FALSE.

Value

A dataframe with three columns:

  • first.row first row of the segment in the long sequence matched against the short one.

  • last.row last row of the segment in the long sequence matched against the short one.

  • psi psi values, ordered from lower (máximum similarity / minimum dissimilarity) to higher.

Examples

#loading the data data(sequencesMIS) #removing grouping column sequencesMIS$MIS <- NULL #mock-up short sequence MIS.short <- sequencesMIS[1:10, ] #mock-up long sequence MIS.long <- sequencesMIS[1:30, ] #preparing sequences MIS.sequences <- prepareSequences( sequence.A = MIS.short, sequence.A.name = "short", sequence.B = MIS.long, sequence.B.name = "long", grouping.column = "id", transformation = "hellinger" ) #matching sequences #min.length and max.length are #minimal to speed up execution MIS.psi <- workflowPartialMatch( sequences = MIS.sequences, grouping.column = "id", time.column = NULL, exclude.columns = NULL, method = "manhattan", diagonal = FALSE, min.length = nrow(MIS.short) - 1, max.length = nrow(MIS.short) + 1, parallel.execution = FALSE ) #output dataframe MIS.psi
#> first.row last.row psi #> 1 1 10 0.00000000 #> 2 1 9 0.04186785 #> 3 2 10 0.04186785 #> 4 1 11 0.04897612 #> 5 4 12 0.09703770 #> 6 2 12 0.10109105 #> 7 2 11 0.10299662 #> 8 4 13 0.10988795 #> 9 3 11 0.13761871 #> 10 3 13 0.14468749 #> 11 3 12 0.14670183 #> 12 4 14 0.18852418 #> 13 5 13 0.21432458 #> 14 5 14 0.30536969 #> 15 5 15 0.33488624 #> 16 6 14 0.57821139 #> 17 6 15 0.61305130 #> 18 6 16 0.63634091 #> 19 7 15 0.74761030 #> 20 7 16 0.78494032 #> 21 8 16 0.78770217 #> 22 8 18 0.79427869 #> 23 7 17 0.82110450 #> 24 8 17 0.84581066 #> 25 10 18 0.94973066 #> 26 10 19 0.94976310 #> 27 9 19 1.00295875 #> 28 10 20 1.01443940 #> 29 11 19 1.01904509 #> 30 9 18 1.02772464 #> 31 9 17 1.07003962 #> 32 11 20 1.10632553 #> 33 11 21 1.13146256 #> 34 12 20 1.16040938 #> 35 13 21 1.17892515 #> 36 13 22 1.19351857 #> 37 12 22 1.19748392 #> 38 12 21 1.20849627 #> 39 13 23 1.22813709 #> 40 18 26 1.23645088 #> 41 17 25 1.26786779 #> 42 16 24 1.28207220 #> 43 17 26 1.29484089 #> 44 18 27 1.31848523 #> 45 15 24 1.34176398 #> 46 17 27 1.34593612 #> 47 15 23 1.37536323 #> 48 14 24 1.37784193 #> 49 16 26 1.37813034 #> 50 16 25 1.38249181 #> 51 14 22 1.38491609 #> 52 15 25 1.40978944 #> 53 14 23 1.43844115 #> 54 18 28 1.47191850 #> 55 19 27 1.52764541 #> 56 19 28 1.70680392 #> 57 20 28 1.80959686 #> 58 19 29 2.06457705 #> 59 20 29 2.21228582 #> 60 21 29 2.25759183