workflowShortInLong.Rd
This workflow works under the following scenario: the user has a short sequence, and a long sequence, and has the objective of finding the segment in the long sequence that better matches the short sequence. The size of the segment in the long sequence is either defined by the user through the arguments min.length
and max.length
. If left empty, min.length
equals to a 75 percent of the length of the short sequence, and max.length
equals to a 125 percent of the length of the short sequence. Note that this is a brute force algorithm, and a large difference between both arguments may generate a very large of subsets of the long sequence. The algorithm is parallelized and optimized as possible, so still, large searches are possible.
workflowShortInLong( sequences = NULL, grouping.column = NULL, time.column = NULL, exclude.columns = NULL, method = "manhattan", diagonal = FALSE, paired.samples = FALSE, min.length = NULL, max.length = NULL, parallel.execution = TRUE )
sequences | dataframe with multiple sequences identified by a grouping column generated by |
---|---|
grouping.column | character string, name of the column in |
time.column | character string, name of the column with time/depth/rank data. |
exclude.columns | character string or character vector with column names in |
method | character string naming a distance metric. Valid entries are: "manhattan", "euclidean", "chi", and "hellinger". Invalid entries will throw an error. |
diagonal | boolean, if |
paired.samples | boolean, if |
min.length | integer, minimum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If |
max.length | integer, maximum length (in rows) of the subsets of the long sequence to be matched against the short sequence. If |
parallel.execution | boolean, if |
A dataframe with three columns:
first.row first row of the segment in the long sequence matched against the short one.
last.row last row of the segment in the long sequence matched against the short one.
psi psi values, ordered from lower (máximum similarity / minimum dissimilarity) to higher.
#loading the data data(sequencesMIS) #removing grouping column sequencesMIS$MIS <- NULL #mock-up short sequence MIS.short <- sequencesMIS[1:10, ] #mock-up long sequence MIS.long <- sequencesMIS[1:30, ] #preparing sequences MIS.sequences <- prepareSequences( sequence.A = MIS.short, sequence.A.name = "short", sequence.B = MIS.long, sequence.B.name = "long", grouping.column = "id", time.column = "age", transformation = "hellinger" )#> Warning: The argument 'time.column' has the value age but I couldn't find that column name in the input datasets. I will ignore this column.#> Error in data.frame(time = time.column.data, sequences, stringsAsFactors = FALSE): object 'time.column.data' not found#matching sequences #min.length and max.length are #minimal to speed up execution MIS.psi <- workflowShortInLong( sequences = MIS.sequences, grouping.column = "id", time.column = NULL, exclude.columns = NULL, method = "manhattan", diagonal = FALSE, min.length = nrow(MIS.short) - 1, max.length = nrow(MIS.short) + 1, parallel.execution = FALSE )#> Error in is.data.frame(x): object 'MIS.sequences' not found#output dataframe MIS.psi#> Error in eval(expr, envir, enclos): object 'MIS.psi' not found