title: “distantia: an R package to compute the dissimilarity between multivariate time-series” output: github_document: toc: true toc_depth: 2 pandoc_args: –webtex always_allow_html: yes —
The package distantia allows to measure the dissimilarity between multivariate time-series (sequences hereafter). The package assumes that the target sequences are ordered along a given dimension, being depth and time the most common ones, but others such as latitude or elevation are also suitable. Furthermore, the target sequences can be regular or irregular, and have their samples aligned (same age/time/depth) or unaligned (different age/time/depth). The only requirement is that the sequences must have at least two (but ideally more) columns with the same name and units.
In this document I briefly explain the logics behind the method, show how to use it, and demonstrate how the distantia package introduces useful tools to compare multivariate time-series. The topics covered in this document are:
You can install the released version of distantia from CRAN with:
And the development version from GitHub with:
Loading the library, plus other helper libraries:
This section assumes that the user wants to compare two sequences. The package provides two example datasets based on the Abernethy pollen core (Birks and Mathewes, 1978):
data(sequenceA)
data(sequenceB)
str(sequenceA)
#> 'data.frame': 49 obs. of 9 variables:
#> $ betula: int 79 113 51 130 31 59 78 71 140 150 ...
#> $ pinus : int 271 320 420 470 450 425 386 397 310 323 ...
#> $ corylu: int 36 42 39 6 6 12 29 52 50 34 ...
#> $ junipe: int 0 0 0 0 0 0 2 2 2 2 ...
#> $ empetr: int 4 4 2 0 3 0 0 0 0 0 ...
#> $ gramin: int 7 3 1 2 2 2 0 6 4 11 ...
#> $ cypera: int 25 11 12 4 3 3 2 3 3 2 ...
#> $ artemi: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ rumex : int 0 0 0 0 0 0 0 0 0 0 ...
str(sequenceB)
#> 'data.frame': 41 obs. of 8 variables:
#> $ betula: int 19 18 30 26 31 24 23 48 29 23 ...
#> $ pinus : int 175 119 99 101 99 97 105 112 108 110 ...
#> $ corylu: int NA 28 37 29 30 28 34 46 16 21 ...
#> $ junipe: int 2 1 0 0 0 0 0 0 0 0 ...
#> $ gramin: int 34 36 2 0 1 2 1 0 6 2 ...
#> $ cypera: int 39 44 20 18 10 9 6 12 3 11 ...
#> $ artemi: int 1 0 0 0 0 0 0 0 0 0 ...
#> $ rumex : int 0 4 1 0 0 0 0 0 0 1 ...
kable(sequenceA[1:15, ], caption = "Sequence A")
betula | pinus | corylu | junipe | empetr | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|---|
79 | 271 | 36 | 0 | 4 | 7 | 25 | 0 | 0 |
113 | 320 | 42 | 0 | 4 | 3 | 11 | 0 | 0 |
51 | 420 | 39 | 0 | 2 | 1 | 12 | 0 | 0 |
130 | 470 | 6 | 0 | 0 | 2 | 4 | 0 | 0 |
31 | 450 | 6 | 0 | 3 | 2 | 3 | 0 | 0 |
59 | 425 | 12 | 0 | 0 | 2 | 3 | 0 | 0 |
78 | 386 | 29 | 2 | 0 | 0 | 2 | 0 | 0 |
71 | 397 | 52 | 2 | 0 | 6 | 3 | 0 | 0 |
140 | 310 | 50 | 2 | 0 | 4 | 3 | 0 | 0 |
150 | 323 | 34 | 2 | 0 | 11 | 2 | 0 | 0 |
175 | 317 | 37 | 2 | 0 | 11 | 3 | 0 | 0 |
181 | 345 | 28 | 3 | 0 | 7 | 3 | 0 | 0 |
153 | 285 | 36 | 2 | 0 | 8 | 3 | 0 | 1 |
214 | 315 | 54 | 2 | 1 | 13 | 5 | 0 | 0 |
200 | 210 | 41 | 6 | 0 | 10 | 4 | 0 | 0 |
betula | pinus | corylu | junipe | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|
19 | 175 | NA | 2 | 34 | 39 | 1 | 0 |
18 | 119 | 28 | 1 | 36 | 44 | 0 | 4 |
30 | 99 | 37 | 0 | 2 | 20 | 0 | 1 |
26 | 101 | 29 | 0 | 0 | 18 | 0 | 0 |
31 | 99 | 30 | 0 | 1 | 10 | 0 | 0 |
24 | 97 | 28 | 0 | 2 | 9 | 0 | 0 |
23 | 105 | 34 | 0 | 1 | 6 | 0 | 0 |
48 | 112 | 46 | 0 | 0 | 12 | 0 | 0 |
29 | 108 | 16 | 0 | 6 | 3 | 0 | 0 |
23 | 110 | 21 | 0 | 2 | 11 | 0 | 1 |
5 | 119 | 19 | 0 | 1 | 1 | 0 | 0 |
30 | 105 | NA | 0 | 9 | 7 | 0 | 0 |
22 | 116 | 17 | 0 | 1 | 7 | 0 | 0 |
24 | 115 | 20 | 0 | 2 | 4 | 0 | 0 |
26 | 119 | 23 | 0 | 4 | 0 | 0 | 0 |
Notice that sequenceB has a few NA values (that were introduced to serve as an example). The function prepareSequences gets them ready for analysis by matching colum names and handling empty data. It allows to merge two (or more) multivariate time-series into a single table ready for further analyses. Note that, since the data represents pollen abundances, a Hellinger transformation (square root of the relative proportions of each taxa) is applied.
help(prepareSequences)
AB.sequences <- prepareSequences(
sequence.A = sequenceA,
sequence.A.name = "A",
sequence.B = sequenceB,
sequence.B.name = "B",
merge.mode = "complete",
if.empty.cases = "zero",
transformation = "hellinger"
)
kable(AB.sequences[1:15, ], digits = 4)
id | betula | pinus | corylu | junipe | empetr | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|---|---|
A | 0.4327 | 0.8014 | 0.2921 | 0.0002 | 0.0974 | 0.1288 | 0.2434 | 2e-04 | 0.0002 |
A | 0.4788 | 0.8057 | 0.2919 | 0.0001 | 0.0901 | 0.0780 | 0.1494 | 1e-04 | 0.0001 |
A | 0.3117 | 0.8944 | 0.2726 | 0.0001 | 0.0617 | 0.0436 | 0.1512 | 1e-04 | 0.0001 |
A | 0.4609 | 0.8763 | 0.0990 | 0.0001 | 0.0001 | 0.0572 | 0.0808 | 1e-04 | 0.0001 |
A | 0.2503 | 0.9535 | 0.1101 | 0.0001 | 0.0778 | 0.0636 | 0.0778 | 1e-04 | 0.0001 |
A | 0.3432 | 0.9210 | 0.1548 | 0.0001 | 0.0001 | 0.0632 | 0.0774 | 1e-04 | 0.0001 |
A | 0.3962 | 0.8813 | 0.2416 | 0.0634 | 0.0001 | 0.0001 | 0.0634 | 1e-04 | 0.0001 |
A | 0.3657 | 0.8647 | 0.3129 | 0.0614 | 0.0001 | 0.1063 | 0.0752 | 1e-04 | 0.0001 |
A | 0.5245 | 0.7804 | 0.3134 | 0.0627 | 0.0001 | 0.0886 | 0.0768 | 1e-04 | 0.0001 |
A | 0.5361 | 0.7866 | 0.2552 | 0.0619 | 0.0001 | 0.1452 | 0.0619 | 1e-04 | 0.0001 |
A | 0.5667 | 0.7627 | 0.2606 | 0.0606 | 0.0001 | 0.1421 | 0.0742 | 1e-04 | 0.0001 |
A | 0.5650 | 0.7800 | 0.2222 | 0.0727 | 0.0001 | 0.1111 | 0.0727 | 1e-04 | 0.0001 |
A | 0.5599 | 0.7642 | 0.2716 | 0.0640 | 0.0001 | 0.1280 | 0.0784 | 1e-04 | 0.0453 |
A | 0.5952 | 0.7222 | 0.2990 | 0.0575 | 0.0407 | 0.1467 | 0.0910 | 1e-04 | 0.0001 |
A | 0.6516 | 0.6677 | 0.2950 | 0.1129 | 0.0001 | 0.1457 | 0.0922 | 1e-04 | 0.0001 |
The computation of dissimlarity between two sequences requires several steps.
1. Computation of a distance matrix among the samples of both sequences. It is computed by the distanceMatrix function, which allows the user to select a distance metric (so far the ones implemented are manhattan, euclidean, chi, and hellinger). The function plotMatrix allows an easy visualization of the distance matrix.
#computing distance matrix
AB.distance.matrix <- distanceMatrix(
sequences = AB.sequences,
method = "manhattan"
)
#plotting distance matrix
plotMatrix(
distance.matrix = AB.distance.matrix,
color.palette = "viridis")
2. Computation of the least-cost path within the distance matrix. This step uses a dynamic programming algorithm to find the least-cost when moving between the cell 1,1 of the matrix (lower left in the image above) and the last cell of the matrix (opposite corner). It does so by solving first every partial solution (in the neighborhood of every cell in the matrix), and then propagating the sum towards the opposite extreme of the matrix.
The value of the upper-right cell in the plotted matrix (actually, the lower-right cell in the actual data matrix, the matrix is rotated in the plot) is the sum of the minimum distance across all samples of both time-series.
AB.least.cost.matrix <- leastCostMatrix(
distance.matrix = AB.distance.matrix
)
plotMatrix(
distance.matrix = AB.least.cost.matrix,
color.palette = "viridis")
Optional In this optional step, the function leastCostPath finds the least-cost path in the matrix above. It returns a dataframe with the coordinates of the cells within the path, the distance between consecutive cells, and the cumulative distance of any given step in the path. It can be used to plot the path, which helps to better understand the alignmnet between both sequences.
This dataframe will be used later on to generate other products, such the slotting of both sequences (a unique sequences with the samples of both sequences in the order that minimizes the distance between consecutive samples), or to transfer attributes such as age/time/depth from one sequence with them to another without them. Note: these applications are not yet implemented.
AB.least.cost.path <- leastCostPath(
distance.matrix = AB.distance.matrix,
least.cost.matrix = AB.least.cost.matrix
)
kable(AB.least.cost.path[[1]][1:15, ], digits = 4)
A | B | distance | cumulative.distance |
---|---|---|---|
49 | 41 | 2.0278 | 77.5398 |
49 | 40 | 1.7083 | 74.4425 |
48 | 40 | 1.0461 | 72.7342 |
48 | 39 | 0.9972 | 71.6881 |
47 | 39 | 1.0389 | 70.6909 |
46 | 39 | 0.9777 | 69.6520 |
45 | 39 | 1.3507 | 68.6743 |
44 | 39 | 1.3165 | 67.3236 |
43 | 39 | 1.4085 | 66.0071 |
42 | 39 | 1.3132 | 64.5986 |
42 | 38 | 1.5524 | 63.2853 |
42 | 37 | 1.2139 | 61.7329 |
42 | 36 | 1.6284 | 60.5191 |
42 | 35 | 1.7988 | 58.8907 |
41 | 35 | 1.2870 | 57.0919 |
par(mfrow=c(2,1))
plotMatrix(
distance.matrix = AB.distance.matrix,
least.cost.path = AB.least.cost.path,
color.palette = viridis(100, alpha = 0.7)
)
plotMatrix(
distance.matrix = AB.least.cost.matrix,
least.cost.path = AB.least.cost.path,
color.palette = viridis(100, alpha = 0.7)
)
Optional: least cost path using diagonals.
The least cost path can be computed also in diagonal, which allows to pair together these samples with the higher similarity that contribute to minimize the overall cost of the least cost path.
#computing the least cost using diagonals
diagonal.least.cost <- leastCostMatrix(
distance.matrix = AB.distance.matrix,
diagonal = TRUE
)
#computing least cost path using diagonals
diagonal.least.cost.path <- leastCostPath(
distance.matrix = AB.distance.matrix,
least.cost.matrix = AB.least.cost.matrix,
diagonal = TRUE
)
plotMatrix(
distance.matrix = diagonal.least.cost,
least.cost.path = diagonal.least.cost.path,
color.palette = viridis(100, alpha = 0.7)
)
Using diagonals is useful when the goal is to transfer attributes (time, age, depth, etc) from one sequence with them to another sequence without them. This will be shown later NOTE: not implemented yet.
3. Getting the least-cost value. The Upper right cell of the least-cost matrix in the image of the least-cost matrix (lower right cell in the actual data-matrix) holds the cumulative sum of the least-cost path. This value will be used later to compute dissimilarity.
AB.least.cost <- leastCost(
least.cost.matrix = AB.least.cost.matrix
)
AB.least.cost
#> $`A|B`
#> [1] 77.53981
4. Autosum, or sum of the distances among adjacent samples on each sequence. This is another requirement to compute a normalized measure of dissimilarity. It just computes the distance between adjacent samples of each sequence and sums them up. Notice that the output is a list with named slots. Lists are the main output device of this package, I’ll explain later why.
AB.autosum <- autoSum(
sequences = AB.sequences,
method = "manhattan"
)
AB.autosum
#> $A
#> [1] 22.6535
#>
#> $B
#> [1] 24.20855
5. Compute dissimilarity. The dissimilarity measure used in this package is named psi, that was first described in the book “Numerical methods in Quaternary pollen analysis” (Birks and Gordon, 1985). Psi is computed as follows:
$$\psi = \frac{LC - (\sum{A_{i-j}} + \sum{B_{i-j}})}{\sum{A_{i-j}} + \sum{B_{i-j}}} $$
where:
Which basically is the least-cost normalizado by the autosum of both sequences. The psi function only requires the least cost, and the autosum of both sequences, as follows. Notice that in the output list, the slot with the psi value is named after the two sequences separated by a vertical line (“A|B”). This convention will be followed by any function in the package that returns objects resulting from comparing two sequences.
The output of psi is a list, that can be transformed to a dataframe or a matrix by using the formatPsi function.
#to dataframe
AB.psi.dataframe <- formatPsi(
psi.values = AB.psi,
to = "dataframe")
kable(AB.psi.dataframe, digits = 4)
A | B | psi |
---|---|---|
A | B | 2.3093 |
#to matrix
AB.psi.matrix <- formatPsi(
psi.values = AB.psi,
to = "matrix")
kable(AB.psi.matrix, digits = 4)
A | B | |
---|---|---|
A | 0.0000 | 2.3093 |
B | 2.3093 | 0.0000 |
Or can also be transformed from matrix to dataframe, or from dataframe to matrix, as convenient.
#from matrix to dataframe again
AB.psi.dataframe <- formatPsi(
psi.values = AB.psi.matrix,
to = "dataframe")
kable(AB.psi.dataframe, digits = 4)
A | B | psi |
---|---|---|
A | B | 2.3093 |
#or from dataframe to matrix
AB.psi.matrix <- formatPsi(
psi.values = AB.psi.dataframe,
to = "matrix"
)
kable(AB.psi.matrix, digits = 4)
A | B | |
---|---|---|
A | 0.0000 | 2.3093 |
B | 2.3093 | 0.0000 |
All the steps required to compute psi, including the format options provided by formatPsi are wrapped together in the function workflowPsi, that works as follows:
The package can work seamlessly with any given number of sequences, as long as there is memory enough available. To do so, almost every function uses the package “foreach”, that allows to parallelize the execution of the functions by using all the processors in your machine but one. This speeds up operations considerably.
The example dataset sequencesMIS contains 12 sections of the same sequence belonging to different marine isotopic stages identified by a column named “MIS”. MIS stages with odd numbers are generally interpreted as warm periods (interglacials), while the odd ones are interpreted as cold periods (glacials). In any case, this interpretation is not important to illustrate this capability of the library.
MIS | Quercus | Betula | Pinus | Alnus | Tilia | Carpinus |
---|---|---|---|---|---|---|
MIS-1 | 55 | 1 | 5 | 3 | 4 | 5 |
MIS-1 | 86 | 21 | 35 | 8 | 0 | 10 |
MIS-1 | 120 | 15 | 8 | 1 | 0 | 1 |
MIS-1 | 138 | 16 | 12 | 6 | 1 | 3 |
MIS-1 | 130 | 12 | 17 | 2 | 1 | 1 |
MIS-1 | 128 | 0 | 6 | 4 | 2 | 2 |
MIS-1 | 140 | 0 | 19 | 9 | 4 | 0 |
MIS-1 | 113 | 0 | 15 | 12 | 2 | 5 |
MIS-1 | 98 | 0 | 27 | 2 | 2 | 0 |
MIS-1 | 92 | 1 | 16 | 7 | 3 | 0 |
MIS-1 | 73 | 3 | 22 | 3 | 0 | 0 |
MIS-1 | 91 | 1 | 21 | 3 | 7 | 0 |
MIS-1 | 148 | 1 | 22 | 1 | 4 | 0 |
MIS-1 | 148 | 0 | 1 | 7 | 13 | 0 |
MIS-1 | 149 | 1 | 2 | 5 | 4 | 0 |
unique(sequencesMIS$MIS)
#> [1] "MIS-1" "MIS-2" "MIS-3" "MIS-4" "MIS-5" "MIS-6" "MIS-7"
#> [8] "MIS-8" "MIS-9" "MIS-10" "MIS-11" "MIS-12"
The dataset is checked and prepared with prepareSequences (note the change in the argument grouping.column).
MIS.sequences <- prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
if.empty.cases = "zero",
transformation = "hellinger"
)
The dissimilarity measure psi can be computed for every combination of sequences through the function workflowPsi shown below. Note the argument output that allows to select either “dataframe” (ordered from lower to higher dissimilarity) or “matrix” to obtain an output with a given structure (if empty, the function returns a list).
MIS.psi <- workflowPsi(
sequences = MIS.sequences,
grouping.column = "MIS",
time.column = NULL,
exclude.columns = NULL,
method = "manhattan",
diagonal = FALSE,
format = "dataframe"
)
#ordered with lower psi on top
kable(MIS.psi[order(MIS.psi$psi), ], digits = 4)
A | B | psi | |
---|---|---|---|
65 | MIS-10 | MIS-12 | 0.4656 |
24 | MIS-3 | MIS-6 | 0.5412 |
13 | MIS-2 | MIS-4 | 0.5560 |
62 | MIS-9 | MIS-11 | 0.5639 |
30 | MIS-3 | MIS-12 | 0.5686 |
61 | MIS-9 | MIS-10 | 0.5772 |
59 | MIS-8 | MIS-11 | 0.5841 |
12 | MIS-2 | MIS-3 | 0.5843 |
66 | MIS-11 | MIS-12 | 0.5949 |
60 | MIS-8 | MIS-12 | 0.6004 |
63 | MIS-9 | MIS-12 | 0.6266 |
51 | MIS-6 | MIS-12 | 0.6487 |
22 | MIS-3 | MIS-4 | 0.6830 |
53 | MIS-7 | MIS-9 | 0.6877 |
32 | MIS-4 | MIS-6 | 0.6964 |
28 | MIS-3 | MIS-10 | 0.7231 |
45 | MIS-5 | MIS-12 | 0.7274 |
64 | MIS-10 | MIS-11 | 0.7361 |
56 | MIS-7 | MIS-12 | 0.7378 |
21 | MIS-2 | MIS-12 | 0.7444 |
26 | MIS-3 | MIS-8 | 0.7621 |
47 | MIS-6 | MIS-8 | 0.7630 |
57 | MIS-8 | MIS-9 | 0.7648 |
19 | MIS-2 | MIS-10 | 0.7737 |
27 | MIS-3 | MIS-9 | 0.7884 |
40 | MIS-5 | MIS-7 | 0.7900 |
15 | MIS-2 | MIS-6 | 0.7912 |
54 | MIS-7 | MIS-10 | 0.8147 |
29 | MIS-3 | MIS-11 | 0.8182 |
49 | MIS-6 | MIS-10 | 0.8244 |
58 | MIS-8 | MIS-10 | 0.8452 |
43 | MIS-5 | MIS-10 | 0.8882 |
42 | MIS-5 | MIS-9 | 0.9065 |
48 | MIS-6 | MIS-9 | 0.9500 |
44 | MIS-5 | MIS-11 | 0.9556 |
55 | MIS-7 | MIS-11 | 0.9618 |
38 | MIS-4 | MIS-12 | 0.9874 |
52 | MIS-7 | MIS-8 | 1.0042 |
25 | MIS-3 | MIS-7 | 1.0658 |
23 | MIS-3 | MIS-5 | 1.0807 |
34 | MIS-4 | MIS-8 | 1.1241 |
50 | MIS-6 | MIS-11 | 1.1681 |
46 | MIS-6 | MIS-7 | 1.2160 |
17 | MIS-2 | MIS-8 | 1.2658 |
41 | MIS-5 | MIS-8 | 1.2964 |
18 | MIS-2 | MIS-9 | 1.4517 |
36 | MIS-4 | MIS-10 | 1.4584 |
39 | MIS-5 | MIS-6 | 1.4667 |
20 | MIS-2 | MIS-11 | 1.4832 |
6 | MIS-1 | MIS-7 | 1.7293 |
35 | MIS-4 | MIS-9 | 1.8136 |
10 | MIS-1 | MIS-11 | 1.8177 |
16 | MIS-2 | MIS-7 | 1.8869 |
8 | MIS-1 | MIS-9 | 1.8992 |
33 | MIS-4 | MIS-7 | 1.9877 |
37 | MIS-4 | MIS-11 | 1.9949 |
7 | MIS-1 | MIS-8 | 2.0264 |
5 | MIS-1 | MIS-6 | 2.0409 |
4 | MIS-1 | MIS-5 | 2.1046 |
14 | MIS-2 | MIS-5 | 2.1585 |
11 | MIS-1 | MIS-12 | 2.1738 |
2 | MIS-1 | MIS-3 | 2.4295 |
9 | MIS-1 | MIS-10 | 2.7392 |
31 | MIS-4 | MIS-5 | 2.7452 |
3 | MIS-1 | MIS-4 | 2.7457 |
1 | MIS-1 | MIS-2 | 2.7670 |
x <- distanceMatrix(
sequences = MIS.sequences,
grouping.column = "MIS",
time.column = NULL,
exclude.columns = NULL,
method = "manhattan"
)
A dataframe like this can be plotted as an adjacency network with the qgraph package as follows:
#psi values to matrix
MIS.psi.matrix <- formatPsi(
psi.values = MIS.psi,
to = "matrix"
)
#dissimilariy to distance
MIS.distance <- 1/MIS.psi.matrix**4
#plotting network
qgraph::qgraph(MIS.distance,
layout='spring',
vsize=5,
labels = colnames(MIS.distance),
colors = viridis::viridis(2, begin = 0.3, end = 0.8, alpha = 0.5, direction = -1)
)
Or as a matrix with ggplot2.
#ordering factors to get a triangular matrix
MIS.psi$A <- factor(MIS.psi$A, levels=unique(sequencesMIS$MIS))
MIS.psi$B <- factor(MIS.psi$B, levels=unique(sequencesMIS$MIS))
#plotting matrix
ggplot(data=na.omit(MIS.psi), aes(x=A, y=B, size=psi, color=psi)) +
geom_point() +
viridis::scale_color_viridis(direction = -1) +
guides(size = FALSE)
The dataframe of dissimilarities between pairs of sequences can be also used to analyze the drivers of dissimilarity. To do so, attributes such as differences in time (when sequences represent different times) or distance (when sequences represent different sites) between sequences, or differences between physical/climatic attributes between sequences such as topography or climate can be added to the table, so models such as $psi = A + B + C$ (were A, B, and C are these attributes) can be fitted.
This section assumes that the target multivariate time-series have the same number of samples/rows, and samples have the same time/depth/order if there is a time/age/depth column available. In this particular case, distances are computed only between samples with the same time/depth/order, and no distance matrix (nor least cost analysis) is required. When the argument paired.samples in prepareSequences is set to TRUE, the function checks if the sequences have the same number of rows, and, if time.column is provided, it selects the samples that have valid time/depth columns for every sequence in the dataset.
Here we test these ideas with the climate dataset included in the library. It represents simulated palaeoclimate over 200 ky. at four sites identified by the column sequenceId. Note that this time the transformation applied is “scaled”, which uses the scale function of R base to center and scale the data.
#loading sample data
data(climate)
#preparing sequences
climate <- prepareSequences(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
paired.samples = TRUE,
transformation = "scale"
)
The function distancePairedSamples computes the between paired samples (same order/time/age/depth), and therefore does not use the least cost algorithm to minimize the distances between sequences shown above. If the argument time.column is not provided, or if it’s provided but same.time is false, the samples are paired by their order (first sample of one sequence vs. the first sample of the other sequence, the second against the second, and so on), and samples that only have an age for one of the sequences are removed from the comparison. This is useful to compare subsets of the same sequence, or subsets of different sequences belonging to different times.
If there is a time.column and same.time is set to true, only samples with the same time/age/depth are compared. This is useful to compare sequences taken at the same time in different sites.
The result is a list, each slot named as the id of the given sequence according to grouping.columns, and containing a vector with pairwise distances. Each vector position is named after time.column if available.
climate.distances <- distancePairedSamples(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
same.time = TRUE,
exclude.columns = NULL,
method = "manhattan",
sum.distances = FALSE
)
str(climate.distances)
#> List of 6
#> $ 1|2: Named num [1:200] 6.13 6.15 5.76 4.97 4.04 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
#> $ 1|3: Named num [1:200] 2.354 1.681 1.404 0.906 1.043 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
#> $ 1|4: Named num [1:200] 6.56 6.82 6.79 6.07 5.04 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
#> $ 2|3: Named num [1:200] 5.76 5.75 5.21 4.52 4.19 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
#> $ 2|4: Named num [1:200] 0.432 0.667 1.025 1.161 1.449 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
#> $ 3|4: Named num [1:200] 5.92 6.39 6.23 5.63 4.88 ...
#> ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
If the argument sum.distances is set to TRUE, it sums them up, doing in fact the same work done by distanceMatrix, leastCostMatrix and leastCost at once.
climate.distances <- distancePairedSamples(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
exclude.columns = NULL,
same.time = TRUE,
method = "manhattan",
sum.distances = TRUE
)
climate.distances
#> $`1|2`
#> [1] 859.1105
#>
#> $`1|3`
#> [1] 712.0338
#>
#> $`1|4`
#> [1] 1034.932
#>
#> $`2|3`
#> [1] 819.8022
#>
#> $`2|4`
#> [1] 669.1838
#>
#> $`3|4`
#> [1] 755.5203
Computing psi from there only requires the autosum of each sequence, and the application of the psi function.
#computing autosum
climate.autosum <- autoSum(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
method = "manhattan"
)
#computing psi
climate.psi <- psi(
least.cost = climate.distances,
autosum = climate.autosum
)
climate.psi
#> $`1|2`
#> [1] 3.378791
#>
#> $`1|3`
#> [1] 2.767909
#>
#> $`1|4`
#> [1] 4.043498
#>
#> $`2|3`
#> [1] 3.761143
#>
#> $`2|4`
#> [1] 2.551667
#>
#> $`3|4`
#> [1] 3.169788
The function workflowPsi executes this sequences when the argument paired.samples is set to TRUE, as follows.
climate.psi <- workflowPsi(
sequences = climate,
grouping.column = "sequenceId",
time.column = "time",
method = "manhattan",
paired.samples = TRUE, #this bit is important
format = "dataframe"
)
kable(climate.psi, digits = 4)
A | B | psi |
---|---|---|
1 | 2 | 3.3788 |
1 | 3 | 2.7679 |
1 | 4 | 4.0435 |
2 | 3 | 3.7611 |
2 | 4 | 2.5517 |
3 | 4 | 3.1698 |
One question that may arise when comparing time series is “to what extent are dissimilarity values random?”. Answering this question requires to compare a given dissimilarity value (psi) with a distribution of dissimilarity values resulting from chance. However… how do we simulate chance in a multivariate time-series? The natural answer is “permutation”. Since samples in a multivariate time-series are ordered, re-shuffling samples is out of the question, but it is possible to re-shuffle the values of the variables within each sample. This kind of permutation is named “restricted permutation”.
A restricted permutation test on psi values requires the following steps:
Such a proportion represents the probability of obtaining a value lower than real psi by chance.
Since the restricted permutation only happens at a local scale within each column of each sequence, the probability values returned are very conservative and shouldn’t be interpreted in the same way p-values are interpreted.
The process described above has been implemented in the workflowNullPsi function. We will apply it to three groups of the sequencesMIS dataset.
#getting example data
data(sequencesMIS)
#working with 3 groups (to make this fast)
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]
#preparing sequences
sequencesMIS <- prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
transformation = "hellinger"
)
The computation of the null psi values goes as follows:
random.psi <- workflowNullPsi(
sequences = sequencesMIS,
grouping.column = "MIS",
method = "manhattan",
paired.samples = FALSE,
repetitions = 9 #recommended values: 99, 999
)
Note that the number of repetitions has been set to 9 in order to speed-up execution. The actual number should be 99, and ideally, 999.
The output is a list with two dataframes, psi and p.
The dataframe psi contains the real and random psi values. The column psi contains the dissimilarity between the sequences in the columns A and B. The columns r1 to r9 contain the psi values obtained from permutations of the sequences.
A | B | psi | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 |
---|---|---|---|---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | 2.7452 | 3.5801 | 1.0108 | 2.0398 | 3.5858 | 1.2728 | 2.0929 | 3.9632 | 1.0159 | 2.1560 |
MIS-4 | MIS-6 | 0.6964 | 3.4371 | 1.0574 | 1.9265 | 3.6948 | 1.1597 | 1.9834 | 3.4991 | 1.0731 | 1.9729 |
MIS-5 | MIS-6 | 1.4667 | 3.5061 | 1.1663 | 2.1745 | 3.8673 | 1.1517 | 2.4507 | 3.7685 | 1.0278 | 2.0605 |
The information in this dataframe can be easily plotted as follows:
#extracting the dataframe
psi <- random.psi[[1]]
#colors
cols <- viridis::viridis(2, begin = 0.2, end = 0.8)
#multipanel plot
par(mfrow=c(1,3))
for(i in 1:nrow(psi)){
plot(density(x = t(psi[i, 3:ncol(psi)]), from = 0),
main = paste(psi[i, 1], " vs. ", psi[i, 2], sep=""),
xlab = "Psi",
col = cols[1],
lwd = 3
)
abline(v = psi[i, 3],
col = cols[2],
lwd=3
)
}
According to this data, the psi value obtained for MIS-4 and MIS-6 is more robust (less likely obtained by chance) the the other two. The dataframe p contains the probability of obtaining the real psi value by chance for each combination of sequences.
A | B | p |
---|---|---|
MIS-4 | MIS-5 | 0.7 |
MIS-4 | MIS-6 | 0.1 |
MIS-5 | MIS-6 | 0.4 |
What variables are more important in explaining the dissimilarity between two sequences?, or in other words, what variables contribute the most to the dissimilarity between two sequences? One reasonable answer is: the one that reduces dissimilarity the most when removed from the data.
This section explains how to use the function workflowImportance to evaluate the importance of given variables in explaining differences between sequences.
First, we prepare the data. It is again sequencesMIS, but with only three groups selected (MIS 4 to 6) to simplify the analysis.
#getting example data
data(sequencesMIS)
#getting three groups only to simplify
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]
#preparing sequences
sequences <- prepareSequences(
sequences = sequencesMIS,
grouping.column = "MIS",
time.column = NULL,
merge.mode = "complete",
exclude.columns = NULL
)
The workflow function is pretty similar to the ones explained above. It allows to work with paired samples (aligned time-series) or with unaligned ones by switching between workflowPsi and workflowPsiPairedSamples as requested by the user through the argument paired.samples. Unlike the other functions in the package, that parallelize the execution of combinations of sequences, this one parallelizes the computation of psi on combinations of columns, removing one column each time.
WARNING: the argument ‘exclude.columns’ of ‘workflowImportance’ does not work in version 1.0.0 (available in CRAN), but the bug is fixed in version 1.0.1 (available in GitHub). If you are using 1.0.0, I recommend you to subset ‘sequences’ so only the grouping column and the numeric columns to be compared are available for the function.
psi.importance <- workflowImportance(
sequences = sequencesMIS,
grouping.column = "MIS",
time.column = NULL,
exclude.columns = NULL,
method = "manhattan",
diagonal = FALSE,
parallel.execution = TRUE,
paired.samples = FALSE
)
The output is a list with two slots named psi and psi.drop.
The dataframe psi contains psi values for each combination of variables (named in the coluns A and B) computed for all columns in the column All variables, and one column per variable named Without variable_name containing the psi value when that variable is removed from the compared sequences.
A | B | All variables | Without Carpinus | Without Tilia | Without Alnus | Without Pinus | Without Betula | Without Quercus |
---|---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | 2.2093 | 2.2259 | 2.2257 | 2.2786 | 2.8885 | 2.2505 | 0.9186 |
MIS-4 | MIS-6 | 0.7145 | 0.7070 | 0.7139 | 0.6955 | 0.7377 | 0.7045 | 0.6033 |
MIS-5 | MIS-6 | 1.4622 | 1.4546 | 1.4714 | 1.4858 | 1.7097 | 1.5020 | 0.7573 |
This table can be plotted as a bar plot as follows:
#extracting object
psi.df <- psi.importance$psi
#to long format
psi.df.long <- tidyr::gather(psi.df, variable, psi, 3:ncol(psi.df))
#creating column with names of the sequences
psi.df.long$name <- paste(psi.df.long$A, psi.df.long$B, sep=" - ")
#plot
ggplot(data=psi.df.long, aes(x=variable, y=psi, fill=psi)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap("name") +
scale_fill_viridis() +
ggtitle("Contribution of separated variables to dissimilarity.") +
labs(fill = "Psi")
The second table, named psi.drop contains the drop in psi values, in percentage, when the given variable is removed from the analysis. Large positive numbers indicate that dissimilarity drops (increase in similarity) when the given variable is removed, confirming that the variable is important to explain the dissimilarity between both sequences. Negative values indicate an increase in dissimilarity between the sequences when the variable is dropped.
In summary:
A | B | Carpinus | Tilia | Alnus | Pinus | Betula | Quercus |
---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | -0.75 | -0.74 | -3.13 | -30.74 | -1.86 | 58.42 |
MIS-4 | MIS-6 | 1.04 | 0.07 | 2.66 | -3.25 | 1.40 | 15.56 |
MIS-5 | MIS-6 | 0.52 | -0.63 | -1.61 | -16.93 | -2.72 | 48.21 |
#extracting object
psi.drop.df <- psi.importance$psi.drop
#to long format
psi.drop.df.long <- tidyr::gather(psi.drop.df, variable, psi, 3:ncol(psi.drop.df))
#creating column with names of the sequences
psi.drop.df.long$name <- paste(psi.drop.df.long$A, psi.drop.df.long$B, sep=" - ")
#plot
ggplot(data=psi.drop.df.long, aes(x=variable, y=psi, fill=psi)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap("name") +
scale_fill_viridis(direction = -1) +
ggtitle("Drop in dissimilarity when variables are removed.") +
ylab("Drop in dissimilarity (%)") +
labs(fill = "Psi drop (%)")
In this scenario the user has one short and one long sequence, and the goal is to find the section in the long sequence that better matches the short one. To recreate this scenario we will use the dataset sequencesMIS. We will extract the first 10 samples as short sequence, and the first 40 samples as long sequence. These small subsets are selected to speed-up the execution time of this example.
A more realistic example would be to select, for example, the group “MIS-5” as short sequence, and the complete sequence as long sequence, in order to find what segments in sequencesMIS is more similar to the “MIS-5” group.
#loading the data
data(sequencesMIS)
#removing grouping column
sequencesMIS$MIS <- NULL
#subsetting to get the short sequence
MIS.short <- sequencesMIS[1:10, ]
#subsetting to get the long sequence
MIS.long <- sequencesMIS[1:40, ]
The sequences have to be prepared and transformed. For simplicity, the sequences are named short and long, and the grouping column is named id, but the user can name them at will. Since the data represents community composition, a Hellinger transformation is applied.
MIS.short.long <- prepareSequences(
sequence.A = MIS.short,
sequence.A.name = "short",
sequence.B = MIS.long,
sequence.B.name = "long",
grouping.column = "id",
transformation = "hellinger"
)
kable(MIS.short.long, digits = 4)
id | Quercus | Betula | Pinus | Alnus | Tilia | Carpinus |
---|---|---|---|---|---|---|
short | 0.8680 | 0.1170 | 0.2617 | 0.2027 | 0.2341 | 0.2617 |
short | 0.7331 | 0.3623 | 0.4677 | 0.2236 | 0.0002 | 0.2500 |
short | 0.9097 | 0.3216 | 0.2349 | 0.0830 | 0.0003 | 0.0830 |
short | 0.8855 | 0.3015 | 0.2611 | 0.1846 | 0.0754 | 0.1306 |
short | 0.8931 | 0.2713 | 0.3229 | 0.1108 | 0.0783 | 0.0783 |
short | 0.9494 | 0.0003 | 0.2056 | 0.1678 | 0.1187 | 0.1187 |
short | 0.9022 | 0.0002 | 0.3324 | 0.2287 | 0.1525 | 0.0002 |
short | 0.8768 | 0.0003 | 0.3194 | 0.2857 | 0.1166 | 0.1844 |
short | 0.8716 | 0.0003 | 0.4575 | 0.1245 | 0.1245 | 0.0003 |
short | 0.8793 | 0.0917 | 0.3667 | 0.2425 | 0.1588 | 0.0003 |
long | 0.8680 | 0.1170 | 0.2617 | 0.2027 | 0.2341 | 0.2617 |
long | 0.7331 | 0.3623 | 0.4677 | 0.2236 | 0.0002 | 0.2500 |
long | 0.9097 | 0.3216 | 0.2349 | 0.0830 | 0.0003 | 0.0830 |
long | 0.8855 | 0.3015 | 0.2611 | 0.1846 | 0.0754 | 0.1306 |
long | 0.8931 | 0.2713 | 0.3229 | 0.1108 | 0.0783 | 0.0783 |
long | 0.9494 | 0.0003 | 0.2056 | 0.1678 | 0.1187 | 0.1187 |
long | 0.9022 | 0.0002 | 0.3324 | 0.2287 | 0.1525 | 0.0002 |
long | 0.8768 | 0.0003 | 0.3194 | 0.2857 | 0.1166 | 0.1844 |
long | 0.8716 | 0.0003 | 0.4575 | 0.1245 | 0.1245 | 0.0003 |
long | 0.8793 | 0.0917 | 0.3667 | 0.2425 | 0.1588 | 0.0003 |
long | 0.8502 | 0.1723 | 0.4667 | 0.1723 | 0.0003 | 0.0003 |
long | 0.8601 | 0.0902 | 0.4132 | 0.1562 | 0.2386 | 0.0003 |
long | 0.9170 | 0.0754 | 0.3536 | 0.0754 | 0.1508 | 0.0002 |
long | 0.9358 | 0.0002 | 0.0769 | 0.2035 | 0.2774 | 0.0002 |
long | 0.9620 | 0.0788 | 0.1115 | 0.1762 | 0.1576 | 0.0002 |
long | 0.8692 | 0.0005 | 0.4714 | 0.1491 | 0.0005 | 0.0005 |
long | 0.9315 | 0.0003 | 0.2227 | 0.0909 | 0.2727 | 0.0003 |
long | 0.8919 | 0.1508 | 0.3482 | 0.1741 | 0.1741 | 0.0003 |
long | 0.8722 | 0.2265 | 0.2615 | 0.2615 | 0.2265 | 0.0003 |
long | 0.8882 | 0.3496 | 0.2108 | 0.0003 | 0.2108 | 0.0003 |
long | 0.7977 | 0.0010 | 0.4264 | 0.4264 | 0.0010 | 0.0010 |
long | 0.9295 | 0.1549 | 0.1549 | 0.2828 | 0.0894 | 0.0003 |
long | 0.9225 | 0.1786 | 0.1031 | 0.3262 | 0.0003 | 0.0003 |
long | 0.9010 | 0.1407 | 0.2814 | 0.1990 | 0.2225 | 0.0003 |
long | 0.8898 | 0.3819 | 0.2500 | 0.0005 | 0.0005 | 0.0005 |
long | 0.8935 | 0.2593 | 0.3550 | 0.0003 | 0.0917 | 0.0003 |
long | 0.8452 | 0.3780 | 0.3780 | 0.0008 | 0.0008 | 0.0008 |
long | 0.6070 | 0.3244 | 0.6489 | 0.0007 | 0.3244 | 0.0007 |
long | 0.0010 | 0.5222 | 0.8528 | 0.0010 | 0.0010 | 0.0010 |
long | 0.0011 | 0.4714 | 0.8819 | 0.0011 | 0.0011 | 0.0011 |
long | 0.5000 | 0.6124 | 0.6124 | 0.0011 | 0.0011 | 0.0011 |
long | 0.0010 | 0.4472 | 0.8944 | 0.0010 | 0.0010 | 0.0010 |
long | 0.0006 | 0.3922 | 0.8987 | 0.1961 | 0.0006 | 0.0006 |
long | 0.0007 | 0.3086 | 0.9512 | 0.0007 | 0.0007 | 0.0007 |
long | 0.2041 | 0.2887 | 0.9129 | 0.2041 | 0.0005 | 0.0005 |
long | 0.0007 | 0.3162 | 0.9220 | 0.2236 | 0.0007 | 0.0007 |
long | 0.2611 | 0.3371 | 0.8919 | 0.1508 | 0.0005 | 0.0005 |
long | 0.0006 | 0.3536 | 0.9186 | 0.1768 | 0.0006 | 0.0006 |
long | 0.1104 | 0.2469 | 0.9627 | 0.0003 | 0.0003 | 0.0003 |
long | 0.0004 | 0.1796 | 0.9837 | 0.0004 | 0.0004 | 0.0004 |
The function workflowPartialMatch shown below is going to subset the long sequence in sizes between min.length and max.length. In the example below this search space is reduced to the minimum (the rows of MIS.short plus and minus one) to speed-up the execution of this example. If left empty, the length of the segment in the long sequence to be matched will have the same number of samples as the short sequence. In the example below we look for segments of the same length, two samples shorter, and two samples longer than the shorter sequence.
MIS.psi <- workflowPartialMatch(
sequences = MIS.short.long,
grouping.column = "id",
method = "manhattan",
paired.samples = FALSE,
min.length = nrow(MIS.short) - 2,
max.length = nrow(MIS.short) + 2
)
The function returns a dataframe with three columns: first.row (first row of the matched segment of the long sequence), last.row (last row of the matched segment of the long sequence), and psi (ordered from lower to higher). In this case, since the long sequence contains the short sequence, the first row shows a perfect match.
first.row | last.row | psi |
---|---|---|
1 | 10 | 0.0000 |
1 | 9 | 0.0419 |
1 | 11 | 0.0490 |
1 | 12 | 0.0540 |
1 | 8 | 0.1168 |
2 | 10 | 0.3338 |
4 | 12 | 0.3408 |
4 | 11 | 0.3507 |
4 | 13 | 0.3514 |
2 | 12 | 0.3603 |
Subsetting the long sequence to obtain the segment best matching with the short sequence goes as follows.
When regular time-series are provided, and the analysis needs to be done comparing only aligned samples, the argument paired.samples can be set to TRUE, and instead of the least-cost algorithm to find the minimized distance between sequences, the function distancePairedSamples (shown above) is used to compute the distance between sequences of the same length (short sequence versus subsets of the long sequence). In such a case, the arguments min.length and max.length are irrelevant, and the sample-size of the matching segments in the long sequence have equals the sample-size of the short sequence. This application is ideal when the user has regular time-series.
The example below shows how it is done with the climateShort and climateLong datasets. The goal is to find the subset in climateLong, with lengths between min.length and max.length, better matching climateShort, with the same number of rows (11) as climateShort.
#loading sample data
data(climateLong) #has "age" column
data(climateShort) #no "age" column
#merging and scaling both datasets
climate.short.long <- prepareSequences(
sequence.A = climateShort,
sequence.A.name = "short",
sequence.B = climateLong,
sequence.B.name = "long",
time.column = "age",
grouping.column = "id",
transformation = "scale"
)
#> Warning in prepareSequences(sequence.A = climateShort, sequence.A.name
#> = "short", : I couldn't find 'time.column' in 'sequenceA'. Added one and
#> filled it with NA.
#computing psi
climate.psi <- workflowPartialMatch(
sequences = climate.short.long,
grouping.column = "id",
time.column = "age",
method = "manhattan",
paired.samples = FALSE,
min.length = 10,
max.length = 12
)
kable(climate.psi[1:10, ], digits = 4)
first.row | last.row | psi |
---|---|---|
17 | 27 | 0.0000 |
17 | 28 | 0.0083 |
17 | 26 | 0.0367 |
16 | 27 | 0.0880 |
16 | 26 | 0.1268 |
16 | 25 | 0.1756 |
18 | 27 | 0.3294 |
18 | 28 | 0.3355 |
18 | 29 | 0.3747 |
15 | 26 | 0.3906 |
Since climateShort is a subset of climateLong, there is a perfect match from the rows 17 to 27.
The output table can be used to easily plot similarity over the indices of the long dataset.
#reordering table by index of climateLong
climate.psi <- climate.psi[with(climate.psi, order(first.row, last.row)), ]
#plot
plot(climate.psi$first.row,
climate.psi$psi,
type = "l",
xlab = "Rows of the long sequence",
ylab = "Dissimilarity (psi)"
)
Under this scenario, the objective is to combine two sequences in order to obtain a single composite sequence. The basic assumption followed by the algorithm building the composite sequence is most similar samples should go together, but respecting the original ordering of the sequences. Therefore, the output will contain the samples in both sequences ordered in a way that minimizes the multivariate distance between consecutive samples. This scenario assumes that at least one of the sequences do not have a time/age/depth column, or that the values in such a column are uncertain. In any case, time/age/depth is not considered as a factor in the generation of the composite sequence.
The example below uses the pollenGP dataset, which contains 200 samples, with 40 pollen types each. To create a smalle case study, the code below separates the first 20 samples of the sequence into two different sequences with 10 randomly selected samples each. Even though this scenario assumes that these sequences do not have depth or age, these columns will be kept so the result can be assessed. That is why these columns are added to the exclude.columns argument. Also, note that the argument transformation is set to “none”, so the output is not transformed, and the outcome can be easily interpreted. This will give more weight to the most abundant taxa, which will in fact guide the slotting.
#loading the data
data(pollenGP)
#getting first 20 samples
pollenGP <- pollenGP[1:20, ]
#sampling indices
set.seed(10) #to get same result every time
sampling.indices <- sort(sample(1:20, 10))
#subsetting the sequence
A <- pollenGP[sampling.indices, ]
B <- pollenGP[-sampling.indices, ]
#preparing the sequences
AB <- prepareSequences(
sequence.A = A,
sequence.A.name = "A",
sequence.B = B,
sequence.B.name = "B",
grouping.column = "id",
exclude.columns = c("depth", "age"),
transformation = "none"
)
kable(AB[1:15, ], digits = 4)
id | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | Pinus | Ranunculaceae | Rhamnus | Caryophyllaceae | Dipsacaceae | Betula | Acer | Armeria | Tilia | Hippophae | Salix | Thalictrum | Labiatae | Valeriana | Nymphaea | Umbelliferae | Sanguisorba_minor | Plantago.lanceolata | Campanulaceae | Asteroideae | Gentiana | Fraxinus | Cichorioideae | Taxus | Rumex | Cedrus | Ranunculus.subgen..Batrachium | Cyperaceae | Corylus | Myriophyllum | Filipendula | Vitis | Rubiaceae | Polypodium |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 3 | 3.97 | 11243 | 0 | 5 | 12 | 21 | 95 | 0 | 20 | 0 | 0 | 6 | 1 | 0 | 7 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 0 |
A | 6 | 4.05 | 11459 | 0 | 20 | 73 | 94 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
A | 7 | 4.07 | 11514 | 0 | 20 | 80 | 100 | 0 | 0 | 0 | 0 | 0 | 10 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 |
A | 8 | 4.10 | 11595 | 0 | 34 | 80 | 155 | 0 | 0 | 0 | 2 | 0 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
A | 9 | 4.17 | 11784 | 0 | 22 | 44 | 131 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 5 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
A | 10 | 4.20 | 11865 | 0 | 35 | 30 | 112 | 0 | 0 | 0 | 0 | 0 | 4 | 8 | 0 | 0 | 0 | 2 | 2 | 0 | 1 | 2 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 0 |
A | 11 | 4.22 | 11919 | 0 | 30 | 45 | 150 | 0 | 0 | 0 | 0 | 0 | 4 | 11 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
A | 12 | 4.25 | 12000 | 0 | 44 | 35 | 150 | 0 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
A | 15 | 4.32 | 12189 | 0 | 43 | 17 | 120 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 6 | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 |
A | 16 | 4.35 | 12270 | 0 | 36 | 25 | 91 | 0 | 0 | 2 | 0 | 0 | 6 | 17 | 0 | 0 | 1 | 0 | 5 | 0 | 4 | 3 | 0 | 1 | 6 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
B | 1 | 3.92 | 11108 | 0 | 7 | 0 | 5 | 20 | 0 | 13 | 0 | 2 | 1 | 0 | 2 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 2 | 0 | 0 |
B | 2 | 3.95 | 11189 | 0 | 3 | 3 | 15 | 47 | 0 | 15 | 0 | 3 | 1 | 0 | 3 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 87 | 0 | 0 | 0 |
B | 4 | 4.00 | 11324 | 0 | 8 | 43 | 60 | 65 | 0 | 10 | 0 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 0 | 2 | 0 |
B | 5 | 4.02 | 11378 | 0 | 44 | 76 | 110 | 0 | 0 | 0 | 0 | 0 | 2 | 11 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
B | 13 | 4.27 | 12054 | 0 | 37 | 30 | 150 | 0 | 0 | 1 | 0 | 0 | 6 | 10 | 0 | 0 | 0 | 0 | 7 | 0 | 2 | 2 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 0 | 0 |
Once the sequences are prepared, the function workflowSlotting will allow to combine (slot) them. The function computes a distance matrix between the samples in both sequences according to the method argument, computes the least cost matrix, and generates the least cost path.
AB.combined <- workflowSlotting(
sequences = AB,
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
method = "manhattan",
plot = TRUE
)
The function reads the least cost path in order to find the combination of samples of both sequences that minimizes dissimilarity, constrained by the order of the samples on each sequence. The output dataframe has a column named original.index, which has the index of each sample in the original datasets.
kable(AB.combined[1:15,1:10], format = "html", digits = 4) %>%
row_spec(c(4, 6, 11, 12, 13), bold = T)
id | original.index | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | |
---|---|---|---|---|---|---|---|---|---|---|
11 | B | 1 | 1 | 3.92 | 11108 | 0 | 7 | 0 | 5 | 20 |
12 | B | 2 | 2 | 3.95 | 11189 | 0 | 3 | 3 | 15 | 47 |
13 | B | 3 | 4 | 4.00 | 11324 | 0 | 8 | 43 | 60 | 65 |
1 | A | 1 | 3 | 3.97 | 11243 | 0 | 5 | 12 | 21 | 95 |
2 | A | 2 | 6 | 4.05 | 11459 | 0 | 20 | 73 | 94 | 1 |
14 | B | 4 | 5 | 4.02 | 11378 | 0 | 44 | 76 | 110 | 0 |
3 | A | 3 | 7 | 4.07 | 11514 | 0 | 20 | 80 | 100 | 0 |
4 | A | 4 | 8 | 4.10 | 11595 | 0 | 34 | 80 | 155 | 0 |
5 | A | 5 | 9 | 4.17 | 11784 | 0 | 22 | 44 | 131 | 0 |
15 | B | 5 | 13 | 4.27 | 12054 | 0 | 37 | 30 | 150 | 0 |
6 | A | 6 | 10 | 4.20 | 11865 | 0 | 35 | 30 | 112 | 0 |
7 | A | 7 | 11 | 4.22 | 11919 | 0 | 30 | 45 | 150 | 0 |
8 | A | 8 | 12 | 4.25 | 12000 | 0 | 44 | 35 | 150 | 0 |
16 | B | 6 | 14 | 4.30 | 12135 | 0 | 50 | 10 | 120 | 0 |
9 | A | 9 | 15 | 4.32 | 12189 | 0 | 43 | 17 | 120 | 0 |
Note that several samples (highlighted in bold) show inverted ages with respect to the previous samples. This is expected, since the slotting algorithm only takes into account distance/dissimilarity between adjacent samples to generate the ordering.
This scenario assumes that the user has two multivariate sequences, one of them with a given attribute (age/time) that needs to be transferred to the other sequence by using similarity/dissimilarity (constrained by sample order) as a transfer criterion.
The code below prepares the data for the example. The sequence pollenGP is the reference sequence, and contains the column age. The sequence pollenX is the target sequence, without an age column. We generate it by taking 40 random samples between the samples 50 and 100 of pollenGP. The sequences are prepared with prepareSequences, as usual, with the identificators “GP” and “X”
#loading sample dataset
data(pollenGP)
#subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]
#generating a subset of pollenGP
set.seed(10)
pollenX <- pollenGP[sort(sample(1:50, 40)), ]
#we separate the age column
pollenX.age <- pollenX$age
#and remove the age values from pollenX
pollenX$age <- NULL
pollenX$depth <- NULL
#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]
#prepare sequences
GP.X <- prepareSequences(
sequence.A = pollenGP,
sequence.A.name = "GP",
sequence.B = pollenX,
sequence.B.name = "X",
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
transformation = "none"
)
The transfer of “age” values from GP to X can be done in two ways, both constrained by sample order:
A direct transfer of an attribute from the samples of one sequence to the samples of another requires to compute a distance matrix between samples, the least cost matrix and its least cost path (both with the option diagonal activated), and to parse the least cost path file to assign attribute values. This is done by the function workflowTransfer with the option $mode = “direct”$.
#parameters
X.new <- workflowTransfer(
sequences = GP.X,
grouping.column = "id",
time.column = "age",
method = "manhattan",
transfer.what = "age",
transfer.from = "GP",
transfer.to = "X",
mode = "direct"
)
kable(X.new[1:15, ], digits = 4) %>%
row_spec(c(5, 6, 10, 11, 12, 13, 14, 15), bold = T)
id | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | Pinus | Ranunculaceae | Rhamnus | Caryophyllaceae | Dipsacaceae | Betula | Acer | Armeria | Tilia | Hippophae | Salix | Thalictrum | Labiatae | Valeriana | Nymphaea | Umbelliferae | Sanguisorba_minor | Plantago.lanceolata | Campanulaceae | Asteroideae | Gentiana | Fraxinus | Cichorioideae | Taxus | Rumex | Cedrus | Ranunculus.subgen..Batrachium | Cyperaceae | Corylus | Myriophyllum | Filipendula | Vitis | Rubiaceae | Polypodium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | X | 0 | 3.92 | 11108 | 0 | 7 | 0 | 5 | 20 | 0 | 13 | 0 | 2 | 1 | 0 | 2 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 2 | 0 | 0 |
42 | X | 0 | 4.00 | 11324 | 0 | 8 | 43 | 60 | 65 | 0 | 10 | 0 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 0 | 2 | 0 |
43 | X | 0 | 4.02 | 11378 | 0 | 44 | 76 | 110 | 0 | 0 | 0 | 0 | 0 | 2 | 11 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
44 | X | 0 | 4.05 | 11459 | 0 | 20 | 73 | 94 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
45 | X | 0 | 4.07 | 11514 | 0 | 20 | 80 | 100 | 0 | 0 | 0 | 0 | 0 | 10 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 |
46 | X | 0 | 4.07 | 11595 | 0 | 34 | 80 | 155 | 0 | 0 | 0 | 2 | 0 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
47 | X | 0 | 4.17 | 11784 | 0 | 22 | 44 | 131 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 5 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
48 | X | 0 | 4.20 | 11865 | 0 | 35 | 30 | 112 | 0 | 0 | 0 | 0 | 0 | 4 | 8 | 0 | 0 | 0 | 2 | 2 | 0 | 1 | 2 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 0 |
49 | X | 0 | 4.22 | 11919 | 0 | 30 | 45 | 150 | 0 | 0 | 0 | 0 | 0 | 4 | 11 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
50 | X | 0 | 4.25 | 12000 | 0 | 44 | 35 | 150 | 0 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
51 | X | 0 | 4.25 | 12054 | 0 | 37 | 30 | 150 | 0 | 0 | 1 | 0 | 0 | 6 | 10 | 0 | 0 | 0 | 0 | 7 | 0 | 2 | 2 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 0 | 0 |
52 | X | 0 | 4.32 | 12135 | 0 | 50 | 10 | 120 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 8 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
53 | X | 0 | 4.32 | 12189 | 0 | 43 | 17 | 120 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 6 | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 |
54 | X | 0 | 4.40 | 12324 | 0 | 50 | 11 | 86 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 2 | 1 | 4 | 0 | 5 | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 5 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 0 |
55 | X | 0 | 4.40 | 12405 | 0 | 51 | 6 | 70 | 0 | 0 | 0 | 0 | 0 | 1 | 16 | 0 | 0 | 4 | 1 | 2 | 0 | 4 | 2 | 0 | 0 | 5 | 2 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
The algorithm finds the most similar samples, and transfers attribute values directly between them. This can result in duplicated attribute values, as highlighted in the table above. The Pearson correlation between the original ages (stored in pollenX.age) and the assigned ones is 0.9996, so it can be concluded that in spite of its simplicity, this algorithm yields accurate results.
If we consider:
The unknwon value $Bt_{k}$ is computed as:
$$Bt_{k} = w_{i} \times At_{i} + w_{j} \times At_{j}$$
The code below exemplifies the operation, using the samples 1 and 4 of the dataset pollenGP as $Ai$ and $Aj$, and the sample 3 as $Bk$.
#loading data
data(pollenGP)
#samples in A
Ai <- pollenGP[1, 3:ncol(pollenGP)]
Aj <- pollenGP[4, 3:ncol(pollenGP)]
#ages of the samples in A
Ati <- pollenGP[1, "age"]
Atj <- pollenGP[4, "age"]
#sample in B
Bk <- pollenGP[2, 3:ncol(pollenGP)]
#computing distances between Bk, Ai, and Aj
DBkAi <- distance(Bk, Ai)
DBkAj <- distance(Bk, Aj)
#normalizing the distances to 1
wi <- DBkAi / (DBkAi + DBkAj)
wj <- DBkAj / (DBkAi + DBkAj)
#computing Btk
Btk <- wi * Ati + wj * Atj
The table below shows the observed versus the predicted values for $Btk$.
Observed | 3.9700 |
Predicted | 3.9735 |
Below we create some example data, where a subset of pollenGP will be the donor of age values, and another subset of it, named pollenX will be the receiver of the age values.
#loading sample dataset
data(pollenGP)
#subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]
#generating a subset of pollenGP
set.seed(10)
pollenX <- pollenGP[sort(sample(1:50, 40)), ]
#we separate the age column
pollenX.age <- pollenX$age
#and remove the age values from pollenX
pollenX$age <- NULL
pollenX$depth <- NULL
#removing some samples from pollenGP
#so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]
#prepare sequences
GP.X <- prepareSequences(
sequence.A = pollenGP,
sequence.A.name = "GP",
sequence.B = pollenX,
sequence.B.name = "X",
grouping.column = "id",
time.column = "age",
exclude.columns = "depth",
transformation = "none"
)
To transfer attributes from GP to X we use the workflowTransfer function with the option mode = “interpolate”.
#parameters
X.new <- workflowTransfer(
sequences = GP.X,
grouping.column = "id",
time.column = "age",
method = "manhattan",
transfer.what = "age",
transfer.from = "GP",
transfer.to = "X",
mode = "interpolated"
)
kable(X.new[1:15, ], digits = 4) %>%
row_spec(c(8, 13), bold = T)
id | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | Pinus | Ranunculaceae | Rhamnus | Caryophyllaceae | Dipsacaceae | Betula | Acer | Armeria | Tilia | Hippophae | Salix | Thalictrum | Labiatae | Valeriana | Nymphaea | Umbelliferae | Sanguisorba_minor | Plantago.lanceolata | Campanulaceae | Asteroideae | Gentiana | Fraxinus | Cichorioideae | Taxus | Rumex | Cedrus | Ranunculus.subgen..Batrachium | Cyperaceae | Corylus | Myriophyllum | Filipendula | Vitis | Rubiaceae | Polypodium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | X | 0 | 3.9498 | 11108 | 0 | 7 | 0 | 5 | 20 | 0 | 13 | 0 | 2 | 1 | 0 | 2 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 2 | 0 | 0 |
42 | X | 0 | 3.9705 | 11324 | 0 | 8 | 43 | 60 | 65 | 0 | 10 | 0 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 0 | 2 | 0 |
43 | X | 0 | 4.0003 | 11378 | 0 | 44 | 76 | 110 | 0 | 0 | 0 | 0 | 0 | 2 | 11 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
44 | X | 0 | 4.0212 | 11459 | 0 | 20 | 73 | 94 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
45 | X | 0 | 4.0512 | 11514 | 0 | 20 | 80 | 100 | 0 | 0 | 0 | 0 | 0 | 10 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 |
46 | X | 0 | 4.1308 | 11595 | 0 | 34 | 80 | 155 | 0 | 0 | 0 | 2 | 0 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
47 | X | 0 | 4.1985 | 11784 | 0 | 22 | 44 | 131 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 5 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
48 | X | 0 | NA | 11865 | 0 | 35 | 30 | 112 | 0 | 0 | 0 | 0 | 0 | 4 | 8 | 0 | 0 | 0 | 2 | 2 | 0 | 1 | 2 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 0 |
49 | X | 0 | 4.2014 | 11919 | 0 | 30 | 45 | 150 | 0 | 0 | 0 | 0 | 0 | 4 | 11 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
50 | X | 0 | 4.2223 | 12000 | 0 | 44 | 35 | 150 | 0 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
51 | X | 0 | 4.2301 | 12054 | 0 | 37 | 30 | 150 | 0 | 0 | 1 | 0 | 0 | 6 | 10 | 0 | 0 | 0 | 0 | 7 | 0 | 2 | 2 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 0 | 0 |
52 | X | 0 | 4.2726 | 12135 | 0 | 50 | 10 | 120 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 8 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
53 | X | 0 | NA | 12189 | 0 | 43 | 17 | 120 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 2 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 6 | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 |
54 | X | 0 | 4.3521 | 12324 | 0 | 50 | 11 | 86 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 2 | 1 | 4 | 0 | 5 | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 5 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 0 |
55 | X | 0 | 4.4177 | 12405 | 0 | 51 | 6 | 70 | 0 | 0 | 0 | 0 | 0 | 1 | 16 | 0 | 0 | 4 | 1 | 2 | 0 | 4 | 2 | 0 | 0 | 5 | 2 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
When interpolated values of the age column (transferred attribute via interpolation) show the value NA, it means that the interpolation yielded an age lower than the previous one. This happens when the same $Ai$ and $Aj$ are used to evaluate two or more different samples $Bk$, and the second $Bk$ is more similar to $Ai$ than the first one. These NA values can be removed with na.omit(), or interpolated with the functions imputeTS::na.interpolation or zoo::na.approx.
Without taking into account these NA values, the Pearson correlation of the interpolated ages with the real ones is 0.9984.
IMPORTANT: the interpretation of the interpolated ages requires a careful consideration. Please, don’t do it blindly, because this algorithm has its limitations. For example, significant hiatuses in the data can introduce wild variations in interpolated ages.