MAT‐Index: An index for fast multiple aspect trajectory similarity measuring

The semantic enrichment of mobility data with several information sources has led to a new type of movement data, the so‐called multiple aspect trajectories. Comparing multiple aspect trajectories is crucial for several analysis tasks such as querying, clustering, similarity, and classification. Multiple aspect trajectory similarity measurement is more complex and computationally expensive, because of the large number and heterogeneous aspects of space, time, and semantics that require a different treatment. Only a few works in the literature focus on optimizing all these dimensions in a single solution, and, to the best of our knowledge, none of them proposes a fast point‐to‐point comparison. In this article we propose the Multiple Aspect Trajectory Index, an index data structure for optimizing the point‐to‐point comparison of multiple aspect trajectories, considering its three basic dimensions of space, time, and semantics. Quantitative and qualitative evaluations show a processing time reduction of up to 98.1%.

shows an example of a multiple aspect trajectory where each trajectory point possesses several semantic attributes considering multiple points of view (personal, environmental, transportation means, social media posts, etc.). The example shows the movement of an object who wears a smartwatch and works at a smart office equipped with numerous sensors and microphones.
We observe from the figure that the MAT is a very complex data type. The richer a trajectory is in terms of semantic aspects, the more knowledge about the moving object can be extracted, but also the more costly it becomes to process the data. Trajectory data are complex by nature, being composed of a sequence of spatiotemporal points, each one having the dimensions of space, time, and a set of semantics. By semantic aspect or dimension, we mean any type of information that can be added to trajectories that is neither spatial nor temporal.
In trajectory data analysis, comparing trajectories is crucial for several analysis tasks such as querying, clustering, similarity, and classification, all research topics that have attracted great interest in recent years. Trajectory similarity measurement is fundamental to querying and clustering moving objects with similar characteristics.
There are several similarity/distance measures in the literature such as dynamic time warping (DTW), modified dynamic time warping, longest common subsequence (LCSS), edit distance for real sequences (EDR), and Fréchet distance, but most of them were either developed for time series or do not support all three dimensions of mobility data, that is, space, time, and semantics.
Similarity measures that were specifically developed for trajectories include uncertain movement similarity (UMS; Furtado, Alvares, Pelekis, Theodoridis, & Bogorny, 2018), multidimensional similarity measure (MSM; Furtado, Kopanaki, Alvares, & Bogorny, 2016) and multiple aspect trajectory similarity (MUITAS; Petry, Ferrero, Alvares, Renso, & Bogorny, 2019). UMS is very robust for spatial similarity, but it does not consider time and semantic dimensions, which are fundamental to mobility data. On the other hand, MSM and MUITAS have outperformed the well-known older measures DTW (Berndt & Clifford, 1994), LCSS (Vlachos, Kollios, & Gunopulos, 2002), and EDR (Chen, Özsu, & Oria, 2005). EDR and LCSS require a matching in all dimensions of two points to consider them as similar, while MSM and MUITAS do not. MSM and MUITAS are flexible measures that consider similar two trajectories that do similar things but not necessarily in the same order, thus not forcing a match in all dimensions. This flexibility is reasonable since it is rare that two moving objects do precisely the same things, at the same place and time, in the same sequence. MUITAS is also flexible in considering the dimensions as independent or dependent in the matching process, covering MSM and part of LCSS and EDR. Indeed, MSM and MUITAS allow a different distance function to be used for measuring the similarity of each dimension, apart from defining weights that give more or less importance to each dimension.
To better understand the similarity problem addressed in this work, let us consider the simple example of trajectories A and B in Figure 2. Trajectory A has three points (a 1 , a 2 , a 3 ) and trajectory B has five points (b 1 , b 2 , b 3 , b 4 , b 5 ). Both trajectories visit the same places (same semantics) but in a different order. A and B visit hotel, bank, and F I G U R E 1 Example of a multiple aspect trajectory (Mello et al., 2019) mall, but not necessarily in this order. As MUITAS and MSM consider any type of trajectory dimension, they have hitherto been the most robust methods of measuring the trajectory similarity, independently of the dimensions present in the dataset.
In the example of Figure 2, MSM and MUITAS need to compare each point of trajectory A to all points of the trajectory B, and for all dimensions, to discover that semantically trajectory A is totally contained in B, that is, they share the semantic similarity of three points. In other words, MSM compares each dimension of point a 1 to all points of B, in order to discover that A and B visit hotel, bank and mall and that they move at similar times. Because of the point-to-point comparison, MSM and MUITAS have a high complexity, and require a smart indexing data structure that supports all three dimensions for fast similarity search in real datasets.
In 2018, Furtado proposed fast trajectory similarity measuring (FTSM), an index for fast similarity measurement of UMS and MSM (Furtado et al., 2018). However, it indexes only the spatial dimension, which is not sufficient to support large trajectory datasets that have many semantic aspects. A survey of general indexing data structures is presented in Mahmood, Punni, and Aref (2018) and shows that only a limited number of works propose indexes considering all three dimensions (space, time, and semantics). To the best of our knowledge, none of these indexes was developed for trajectory similarity purposes. In general, they focus on indexing only space, or space and time for range and top-K queries. Another common limitation of the index data structures is the considerable storage cost due to the redundant data structures when indexing the three dimensions.
In this work we aim to answer the following question: can we build an efficient index for point-to-point multiple aspect trajectory similarity measuring that takes into account all dimensions of space, time and semantics? In this article we propose an index data structure for historical data called the MAT-Index that significantly reduces the processing time for measuring the multidimensional similarity of trajectories with MSM and MUITAS. The MAT-Index construction avoids the need for point-to-point comparison of MSM and MUITAS, and its main advantage and difference from the state-of-the-art is that, apart from being able to consider all different dimensions in a single data structure, the final index contains the matching scores. Our proposal is for an index support for similarity measures of multiple aspect trajectories, and how the similarity algorithm measures the similarity among different aspects depends on the definition of the measure itself. A qualitative evaluation shows how the MAT-Index can drastically reduce the number of comparisons, and a quantitative evaluation shows a gain for all tested scenarios of up to 98.1% processing time reduction and up to 87.2% in scalability evaluations.
The remainder of the article is structured as follows. Section 2 describes the concepts required to understand this work. Section 3 discusses related works, highlighting their limitations. Section 4 introduces the proposed index for multiple aspect trajectories. Section 5 discusses the evaluation results that assess work efficiency.
Finally, Section 6 concludes the article and suggests directions for future work. F I G U R E 2 Two semantic trajectories (Furtado et al., 2016)

| BA S IC CON CEP TS
This section presents the main concepts needed to understand this work. Section 2.1 defines multiple aspect trajectories, Section 2.2 presents the basics about the similarity measures developed for this type of data, focusing on MSM (Furtado et al., 2016) and MUITAS (Petry et al., 2019), and Section 2.3 presents some basic definitions.

| Multiple aspect trajectories
A multiple aspect trajectory is a sequence of points T = ⟨p 1 , p 2 ,…, p n ⟩, such that p i = (x, y, t, A) is its ith point composed of a location (x, y), also called the spatial dimension, a time-stamp t, also called the temporal dimension, and a nonempty set of aspects A = {a 1 , a 2 ,…, a m }, representing the semantic dimension.
An aspect a = (desc, ATV) is a relevant real-world fact for mobility data analysis. It is composed of a description (desc) and an instance of its corresponding aspect type (a type ), that is represented as a set of attribute-value pairs ATV = {att 1 :v 1 , att 2 :v 2 ,…, att z :v z }.
An aspect type a type = {att 1 , att 2 ,…, att z } is a categorization of a real-world fact composed of a set of attributes (att). In other words, an aspect type and its attributes act as a metadata definition for an aspect.
As an aid to understanding, consider the following example adapted from Mello et al. (2019) where an aspect type hotel is defined by the following attributes: geographic coordinates, address, and stars. A possible aspect related to this type could be Il Campanario Resort with the following attribute values: geographic coordinates, −27.439771, −48.500802; address, Buzios Ave., Florianopolis; stars, 5.

| Similarity measures
Similarity measures express on a numerical scale how similar two points are. Several similarity measures have been developed for sequential data, such as LCSS (Vlachos et al., 2002), EDR (Chen et al., 2005), and w-constrained Discrete Fréchet distance (Ding, Trajcevski, & Scheuermann, 2008), or for trajectories, such as common visit time interval (Kang, Kim, & Li, 2009), maximal semantic trajectory pattern similarity (Ying, Lu, Lee, Weng, & Tseng, 2010), maximum travel match (Xiao, Zheng, Luo, & Xie, 2012), MSM (Furtado et al., 2016), UMS (Furtado et al., 2018), stops and moves similarity measure (SMSM) (Lehmann, Alvares, & Bogorny, 2019), and multiple aspect trajectory similarity measure (MUITAS) (Petry et al., 2019). To the best of our knowledge, only MSM, SMSM, and MUITAS were specifically developed for semantic or multiple aspect trajectories, supporting all their three dimensions: space, time, and semantics. Only MSM and MUITAS deal with independent attributes, and MUITAS is the only one that also considers semantically related attributes. MSM and MUITAS compute the similarity of two trajectories considering a point-to-point analysis, and are currently the most robust for similarity, therefore we will focus on these measures in the indexing proposal.

| Multidimensional similarity measure
MSM (Furtado et al., 2016) computes the similarity between two trajectories P = (p 1 ,…, p n ) and Q = (q 1 ,…, q m ), comparing each point p ∈ P to all points q ∈ Q. It computes the parity of two trajectories, (P, Q) and (Q, P), using the maximum matching score given by: The score is computed as: and provides a matching score of two points p and q for each dimension. It involves pairwise comparison of the attributes A of p with q, considering the user-defined thresholds max Dist k , and scores the matches according to their respective weights (ω): The total similarity of two trajectories is given by summing both parities and dividing the result by the sum of both trajectory lengths (P and Q):

| Multiple aspect trajectory similarity
MUITAS (Petry et al., 2019) is the first similarity measure natively developed to work with multiple aspect trajectories. MUITAS introduced an essential concept of relationship between attributes, being the first to consider trajectory dimensions as totally independent, partially independent, or dependent. When a set of attributes is defined as dependent or partially independent, this set is called a feature. Such features make MUITAS a flexible measure that supports MSM when attributes are independent and is similar to LCSS when the attributes are defined as dependent, forcing a match of all attributes in the feature. It shares with MSM and SMSM the support of different data types and the ability to assign different distance functions to each attribute.
A feature f = {a 1 , a 2 ,…, a z } is a non-empty set of attributes of a multiple aspect trajectory. It is possible to aggregate attributes to work as independent and dependent by using this concept. For instance, a feature f i = {place category, price tier, rating} represents information about places of interest (POIs) visited. There are three associated attributes, and this analysis unit is dependent. However, the feature f j = {weather condition} represents an independent analysis unit.
Suppose P and Q are trajectories and p and q are trajectory points, such that p ∈ P and q ∈ Q. Then the matching score between p and q is computed as: For each attribute A of a feature ℱ, the points will match if the distance between them is less than a given threshold ( ).
For each feature, the score is computed as the weighted sum (ω) of matching points: After comparing all points, MUITAS calculates the parity(P, Q) as the sum of the best scores of each attribute of each point p ∈ P compared to Q: and similarly parity(Q, P).
The final similarity score between two trajectories is the sum of parities, divided by the sum of the trajectory lengths: Note that the flexibility of both point-to-point approaches outperformed the other previously mentioned methods. However, it also made them very expensive in terms of processing time.

| Indexes
There are several indexes developed specifically for spatial data. Examples are the Quadtree (Finkel & Bentley, 1974), Z-Ordering Tree (Morton, 1966), and OCTree (Meagher, 1980). In fact, trees are one of the most commonly adopted data structures for indexing purposes, but the previously mentioned indexes do not support multidimensional spatiotemporal similarity.
For semantic contents, an inverted index (also known as an inverted list) is frequently used. It saves the data in a smaller size and more organized way, like a dictionary composed of two main parts: the search structure (aka keyword or key), and a list of references (aka value) (Büttcher, Clarke, & Cormack, 2016). Figure 3 illustrates an example of an inverted index containing seven documents that have keyword occurrences previously processed (left). The corresponding inverted index (right) stores the distinct keywords Hotel, Cinema, Park, Home, and Work as keys containing its references corresponding to the table in the left.
A naive strategy for text retrieval compares a list of queried words with all keyword documents. The inverted list provides single access to obtain all references that contain the same keyword, which limits the universe to be compared in a single access.
Despite the several solutions for different data, indexing space, time, and semantic data together is far more complex, especially for point-to-point comparison purposes. It is necessary to keep fast access to the entire dataset content for a problem that requires quadratic computation and does not allow pruning strategies.

| REL ATED WORK S
In the state of the art there is an extensive list of access methods for trajectories organized in Mahmood et al. (2018) according to the temporal context of the data (past, present, and future; we focus on the past), with different indexing arrangements. We observe that only a few of these works index textual with space and/or temporal data. Regarding these works, some papers do not deal with all three dimensions together;  Zheng et al. (2013Zheng et al. ( , 2015 aim to efficiently answer top-K searches. The approaches for solving range queries and top-k problems lead to intrinsic pruning strategies that, in general, restrict the indexed access to a small portion of the entire dataset, facilitating processing. Works limited to exact matches (Han et al., 2015;Skovsgaard et al., 2014) have the same problem since each queried term can restrict the content where the next term must occur, otherwise, it does not attend the search. However, before drawing conclusions, it is crucial to comprehend the whole picture by comparing trajectories, not just isolated points. In this way, we can more assertively extract information based on similar behaviors, not in circumstantial cases. Thus, these approaches are incompatible with our aim of comparing all datasets, since it is not feasible to prune neither of multiple aspect trajectories dimensions and still to establish a precise comparison.
All indexes mentioned in this section have hybrid data structures that are connected. It demands a high storage cost to link them and keeping redundant data to optimize the access. The high demand for memory allocation could force the operating system to excessively transfer data between memory levels, delaying access; this is a known problem referred to as thrashing. Thus, some solutions design hybrid (i.e., part memory, part disk; Han et al., 2015;Liu et al., 2017;Zheng et al., 2013Zheng et al., , 2015 or disk (Issa & Damiani, 2016) allocation strategies to avoid system collapse. However, the disk and hybrid allocation solutions require transferring data, thus multiple access.
The secondary memory is a very slow resource, and is therefore inefficient for processing large and complex trajectory datasets. Still, some works deal with the performance by adopting approximate solutions (Skovsgaard et al., 2014;Zheng et al., 2015), which for similarity search purposes, would propagate a possible error to all other related comparisons, affecting the reliability of the score.
To the best of our knowledge, there are no works in the literature that fully index the three multiple aspect trajectory dimensions for similarity measurement. Indeed, existing indexing works provide neither a data structure to avoid the point-to-point matching nor the number of matches between points for each dimension, including the partial matches. Therefore, a novel data structure is needed to accurately process an entire trajectory dataset and return the top matching scores, preventing redundant comparisons.

| MULTIPLE A S PEC T TR A JEC TORY INDE X
In this section we propose the MAT-Index, a novel access method designed to speed up the similarity analysis of multiple aspect trajectories. It focuses on indexing all three dimensions of space, time, and semantics in a single data structure, in order to facilitate the comparison between trajectories, eliminating redundant operations.
Calculation of the MAT-Index is divided into six steps, as presented in Figure 4. The load step stores each dimension in a separated data structure that is processed as follows: the spatial indexing and the temporal indexing treat the spatial and temporal data, respectively; the semantic combine and the semantic compress steps process the semantic content. The Dimensions Integration step integrates the three resulting data structures built in the previous steps. The following sections detail each step of the MAT-Index.

| Load
The load stage saves the spatial, temporal, and semantic dimensions into separate data structures to be merged in the last index step. Figure 5 shows the data structures saved in the load step. Figure 5a shows an example of the dataset, where each row contains the information associated to a trajectory point. Each row contains the trajectory identifier (tid), followed by the spatial coordinates (x, y), the time (time), and the semantic attributes price, poi, and weather. The tid is repeated according to the number of points the trajectory has. In the example, 11 points belong to three trajectories: trajectory 126 has 4 points, trajectory 127 has 4 points, and trajectory 128 has 3 points. Algorithm 1 presents the load pseudo-code. It consists of reading the dataset and saving each trajectory dimension in each proper data structure. The first row (rId = 0), the header of the dataset, is processed by the addAttributeToTwoLevelIndex function (line 1). It saves the names of the semantic attributes contained in the header, as they come right after the spatial coordinates and the temporal dimension. The aim is to group the trajectory points with the same semantic attribute values and considering their contexts. In the example, the attribute names price, poi, and weather are the first-level keys of the TwoLevelIndex data structure, as presented in Figure 5e in the delimited dashed area on the left, called the first-level key. From rId = 1 to rId = 11, that is, for all trajectory points, the process is repeated, sequentially reading the dataset. For each row, it stores the trajectory dimensions as follows: the command addToSpatialIndex saves the SpatialIndex in the format shown in Figure 5b; the command addToTemporalIndex saves the TemporalIndex in the format shown in Figure 5c; and functions addToSemDictionary and addValueToTwoLevelIndex save the semantic The MAT-Index spatial logical grid does not demand allocating a matrix that gives rise to the sparsity issue (note that only 9 of the 198 cells in the corresponding grid are non-empty), and the difficulty of balancing the size of the interval and the memory required to process it. The cell size is calculated based on the threshold defined by the similarity measure, as MUITAS and MSM define a threshold τ that specifies the maximum spatial distance between two points to consider them as similar. Therefore, MAT-Index builds the SpatialIndex as squared cells such that the maximum distance between two points in the same cell never exceeds this threshold. Since the maximum distance in a square is its diagonal, we assume this diagonal to be τ.

F I G U R E 4 MAT-Index flow
Thus, all the points in the same cell of Figure 6 (left)-as the pairs of trajectory points [1,2] and [8,11] -automatically match, since they are below the threshold, thus do not require spatial comparison in these cases. The SpatialIndex is then created as an inverted list that allocates as keys the position/address of the cell and as the values the list of rIds that represent the points inside the cell.
Regarding the temporal content, MAT-Index allocates only actual occurrences in an inverted list, similar to the spatial indexing process. However, it is worth observing that a daytime unit can be expressed as 24 h, 1440 min, or even 86,400 s, even if this last option is less likely to be used. It is a small number of possibilities if compared with the considerable amount of data to be processed. Therefore, the time is a unidimensional value that can be seg-  Figure 5e. It is worth mentioning that the two-level model preserves the context of the values, since each attribute name will contain its particular distinct values as appear in the dataset of Figure 5. For instance, the value 1 could represent a price, a rating, or an age. As previously mentioned, line 1 of Algorithm 1 saves the first level of the two-level inverted index, while line 6 populates its second level. The combine step (Section 4.4) merges both data structures, and the algorithm finishes by returning all intermediate files.

| Spatial computation
For the spatial dimension, MAT-Index uses a logical grid, storing only the cell addresses that contain at least one trajectory point. The spatial dimension demands a pairwise comparison of two trajectory points to check if the distance between them does not exceed the similarity threshold (τ). Therefore, MAT-Index uses the auxiliary SpatialIndex data structure generated in the load step to avoid the comparison among all trajectory points. The SpatialIndex presented in Figure 5b is an inverted list where each entry is a pair ⟨key, value⟩, each key being a cell address and the value a list of rIds (trajectory points) belonging to the same key (cell address).
Algorithm 2 explains how the spatial computation step works. First, it sequentially reads each entry of the . After the cell address processing, the entry is removed from the SpatialIndex (line 11) to prevent the points from being double-checked, which is unnecessary due to the symmetry of the spatial distance.
Going back to the running example, for the sake of understanding, suppose the spatial threshold is 1.42, resulting in a cell size equal to 1. We use the Euclidean distance to compute the spatial distance, but any other distance measure could be used. In Figure 5b,

| Temporal computation
For the time indexing, the strategy is to create, for each temporal index entry, as shown in the example of Figure 5c, a list with all the matching rIds, which are the t < rajectory points belonging to the cells in the interval admitted by the temporal threshold τ.
Algorithm 3 shows the pseudo-code that receives as input the TemporalIndex generated in the load step and provides updated TemporalMatches as output. It starts by sequentially reading the entries in the TemporalIndex  The temporal approach brings two benefits: first, the allocation by unit allows us to get the matches by aggregating the rIds belonging to the groups in the same interval. This way, we prevent comparing the temporal content among all trajectories; and second, the number of index entries is limited to the threshold unit (i.e., if expressed in minutes, 1440 possibilities), thus it tends to require fewer iterations to process the temporal dimension. It is worth noticing that real datasets are bigger than our running example. Thus, the list of rIds tends to contain, on average, more points, since we have a very limited number of possible cells allocated (1440 in this case). It saves memory and especially processing time because the idea is to process the matches in groups (by cell), not by rId.

| Semantic combine
The  is N. Thus, note that rIds 1,4 and 9 are composed of the same set of attribute values, so for three attributes values the maximum score is 3. Figure 9 shows the entire composite index after the combine step execution, including that previously mentioned in Figure 8. Here we can observe one of the main advantages of our proposal: once the composite index is ready, a single direct access may retrieve the number of semantic features that match between trajectory points, although the points were never pairwise compared.
The match counter of a composite key shows the number of matches for all trajectory points. For instance, every trajectory point that has the semantic combination ⟨$$, Home, Clear⟩ (see the first row of the tables in

| Semantic compress
The similarity algorithms for multiple aspect trajectories MSM (Furtado et al., 2016) and MUITAS (Petry et al., 2019) retrieve the best semantic match of a point when compared to a trajectory. In this case, we can further compress the index by keeping the maximum score. Therefore, the compress step stores only the top scores by trajectory, saving memory and avoiding redundant comparisons that would degrade the performance.
Algorithm 5 shows the pseudo-code for the compress step. For each semCompositeKey in SemCompositeIndex

F I G U R E 9
Semantic index after match counter computation Figure 10 presents the scores of the running example before and after the compression, by trajectory point ( Figure 10a) and by trajectory (Figure 10b). We observe that the 11 trajectory points turned into only three points, corresponding to the number of trajectories, since we only keep the maximum score for each trajectory and not all the points. The first column of the MatchCounter in Figure 10a corresponds to the header position in the dataset (see Figure 5a at rId = 0), that is, where the names of attributes are placed. The header does not contain trajectory content, being maintained at first to avoid testing the position multiple times during the load and the combine processing. Therefore, the rId = 0 is discarded in this step.

| Dimensions integration
Indexing space, time, and several semantic dimensions in a single data structure avoids redundant comparisons, speeding up the trajectory similarity analysis. Therefore, the index integration phase consolidates the matching scores into a single data structure, finalizing the MAT-Index construction.
The pseudo-code for this step is presented in Algorithm 6. For each point p (line 1), the semantic compos- ite key of p is used to retrieve its corresponding compressed match counter auxCMatchCounter and store it in an auxiliary variable (line 2). The compressed match counter holds the top scores by trajectory for a valid combination of semantic attribute values. In line 3, the function gets all the points that match both in space and time with p, using an AND operator to obtain the list of rIds. Thus, using the auxiliary compressed match counter (auxCMatchCounter) of p for each rId in bothMatch, the auxCMatchCounter is updated at the compressed id position. The compressed id (cId) is the corresponding trajectory id to which the rId belongs. The CMatchCounter is then updated with the maximum value between its current value at cId position, as shown in Figure 11a, and the value of the MatchCounter at rId position is increased by 2, as shown in Figure 11b. By checking the maximum value, the method avoids repeating the compression step for each trajectory point.
After updating the cases that match in both space and time, the algorithm treats the cases that exclusively auxCMatchCounter is added to the MAT-Index. After all points p are processed, the compressed composite index in Figure 11a and the composite index in Figure 11b data structures are no longer required, so they are discarded (lines 13 and 14). The MAT-Index (illustrated in Figure 11c) is ready to use.
Returning to our example in Figure  Note that the final score in Figure 11c is the maximum of the result incremented in Figure 11b and the auxiliary top score preserved in Figure 11a, as explained in Algorithm 6. Take the example of updating the match between rIds 3 and 11, starting with rId = 11 (⟨$$, University, Clouds⟩). The trajectory point rId = 3 corresponds to the trajectory cId = 0. In Figure 11b, the composite key ⟨$$, University, Clouds⟩ at position rId = 3 has score of 3. This is higher than the old top score 2 in Figure 11a. So, the method updates the score of trajectory point rId = 11 at cId = 0 to 3 in the MAT-Index (Figure 11c). The method repeats this process for all matches. The updates are highlighted in Figure 11c.
The final MAT-Index data structure, depicted in Figure 11c, considers the spatial, temporal, and semantic dimensions for each trajectory point. Therefore, it is sufficient to access the trajectory point using the row id to directly access the score at the corresponding trajectory position (the compressed row id, cId).

| Complexity analysis
The MAT-Index algorithm sequentially executes its six steps. All data are stored in hash maps using O(1) get/put operations.
The first step (load, Algorithm 1) performs a linear scan over the trajectory datasets and populates the data structures. It has complexity O(N), where N is the total number of trajectory points.
The second step (spatial computation, Algorithm 2) retrieves each rId distributed among the cell addresses.
The rIds belonging to the same cell automatically match; thus, only the rIds placed in adjacent cells require further computation. Here again we have O(N) complexity.
The third step (temporal computation, Algorithm 3) reads all the hash map entries sequentially. Let k be the number of distinct groups and let us assume we have N/k points in each group. Temporal computation performs 2τN/k operations for each group, resulting in 2τN operations plus N operations to assign the result to each trajectory point. We thus have O(2τ N ) complexity.
The fourth step (semantic combine, Algorithm 4) processes each semantic composite key in the semantic dictionary. By taking advantage of the BitSet data structure used in the TwoLevelIndex, we simply obtain the rIds that must be updated. We thus sum the occurrences in the match counter data structure. After computing all match counters, we have added each attribute in A precisely N times (i.e., A × N times). Considering that N tends to be much bigger than A in order of magnitude, the process has linear complexity O(AN). The final step (dimensions integration, Algorithm 6) updates only the cases that match. Since it maintains BitSets for spatial and temporal matches, the logical operations prevent double retrieval of the same content, which is done in linear time.

F I G U R E 11
Dimensions Integration process: the semantic data structures (a) compressed; (b) expanded; and (c) the resulting MAT-Index after the step After analyzing all MAT-Index processes, we can conclude that the MAT-Index bottleneck lies in the compression step, the complexity of which can vary from linear to quadratic depending on the number of semantic composite keys. Again, both situations are unlikely, as demonstrated in all the state-of-the-art assessed datasets.
Considering that MAT-Index processes the dimensions independently before integrating the data, in cases where the matching criteria are changed, only the affected dimension must be reprocessed and then reintegrated.
Considering the building costs for each dimension, the update requires minimal time compared to the original similarity measures.

| E VALUATING MAT-INDE X FOR TR A JEC TORY S IMIL ARIT Y ME A SUREMENT
In this section we evaluate the performance of MSM and MUITAS using MAT-Index. We also compare MAT-Index We evaluate MAT-Index considering both qualitative and quantitative aspects. Section 5.1 concerns the qualitative perspective where we show via an example how using our index simplifies the similarity computation, thus drastically reducing the number of comparisons needed to compute the similarity for both evaluated algorithms.
Section 5.2 presents a quantitative evaluation of MAT-Index, focusing on time and scalability metrics.

| Qualitative evaluation
We present a scenario in which both MUITAS and MSM employ MAT-Index to obtain the similarity score of two is composed of four points with rIds {1, 2, 3, 4}. It is necessary to retrieve the value in key = rId at position cId = 2 (128) to get their top scores. Once MAT-Index indexes five features (space, time, and the semantics price, poi, and weather), the sum of scores must be divided by 5. Therefore, The process must be repeated by inverting the references for all points of 128 to get parity(128, 126), thus summing the retrieved scores at position cId = 0: After calculating both parities, the final similarity score, generically indicated by Sim(126,128), is the result for both MSM(126,128) and MUITAS(126,128) computations are the sum of parities divided by the sum of each trajectory length (number of points), such that: This process can be repeated for any pair of trajectories, using their rIds as key to find the top scores of all trajectory points in the dataset.

| Quantitative evaluation
The quantitative evaluation is organized into two parts. In the first part we employ two publicly available datasets composed of spatial, temporal, and multiple semantic dimensions to compare the running times of the similarity measures MSM and MUITAS with and without MAT-Index support. In the second part we employ a synthetic dataset to evaluate the index scalability to state the impact of the trajectory size on processing times.
All quantitative experiments were performed on an Intel® Core™ i7-9750H Coffee Lake CPU @ 2.60 GHz (12 MB cache), 32 GB Crucial Dual-Channel @ 1.330 , 500 GB Samsung SSD 970 EVO Plus, and 4 GB NVIDIA GeForce GTX 1650 using Windows 10 Education 64-bit, with a command prompt boot option without graphical and network support to avoid overlapping second plan processes.

| Evaluating MAT-Index processing time for similarity measuring
We split this experiment into two parts, where we are interested in how much faster both similarity measures process the entire dataset when employing MAT-Index and FTSM access methods, and how the number of attributes and the distinct semantic combinations affect the performance. We used two publicly available datasets that contain spatial, temporal, and multiple semantic dimensions. One is the Foursquare NYC (Yang, Zhang, Zheng, & Yu, 2015) dataset containing real data, while the other is a benchmark dataset generated by BerlinMOD (Düntgen, Behr, & Güting, 2009 We observe from Figure 12 that, in the original implementations, MUITAS performed worse than MSM in all datasets due to the computation overhead for checking the semantic-related attributes. We also observe that, although FTSM efficiently indexes the spatial dimension in average cases, its integration into both MSM and MUITAS similarity measures for space, time and semantics overloaded the processing time in all datasets.
However, MAT-Index treats all dimensions in an integrated data structure; thus, the results are much better, reducing execution times by between 93.6 and 98.1%. The semantic relationships among the attributes did not affect the performance, keeping very close results in both similarity measures. In the larger dataset, BerlinMOD, both MSM and MUITAS performed even better than when processing Foursquare, suggesting that the larger the dataset, the better the results with MAT-Index. We observe that since grouping data is the basic idea of MAT-Index, BerlinMOD performance better due to the low variability of values, including time and space, which reduced the computation cost. This indicates that the variability of values influences performance, where the larger a dataset is, the more its data tend to repeat.
It is worth recalling that each set of attributes leads to a different number of distinct semantic composite keys.
Therefore, we evaluate how the number of semantic composite keys and attributes impacts the processing times in each dataset. Considering that the semantic related attributes did not affect the processing time in the first evaluation ( Figure 12), both datasets were executed in all possible ways, excluding the semantically related attributes: the five semantic attributes from the Foursquare dataset (Table 1)   increasing with the number of attributes. As expected, the results with MAT-Index tend to improve as the number of trajectory aspects/attributes increases. This characteristic is particularly important for multiple aspect trajectories since they tend to have a large number of attributes.
In order to better understand how the variability of values affects performance, Figure 14 presents the same elapsed times of the previous Figure 13, now grouped by the number of semantic composite keys processed. For instance, in the Foursquare dataset, the scenario ℱ = {{price}} produces only five distinct price possibilities (-1, 1, 2, 3, and 4); the weather ℱ = {{weather}} has six distinct possibilities (clear, clouds, fog, unknown, rain, and snow).
However, ℱ = {{price}, {weather}} has 29 distinct possibilities in the dataset (one fewer than if we individually combine price and weather values). Figure 14 shows the number of composite keys sorted in ascending order, grouped by the number of processed attributes. We segregate both results by data set (indicated at the top) to compare how the length influences performance.
We observe how the dataset size and the number of attributes generate an explosion in the elapsed times for the original implementations. However, we also observe that the MAT-Index performance is more associated with the number of processed semantic composite keys, and this is the reason why the BerlinMOD scenarios are more rapidly processed than the Foursquare dataset, even if they have the same number of attributes. In conclusion, MAT-Index successfully reduces the execution time by between 84.5 and 98.1%.

| Evaluating the MAT-Index scalability performance
In this experiment we evaluate the scalability of MAT-Index by studying the impact of the trajectory size in the processing times. For this reason, we use a synthetic dataset designed by Ferrero et al. (2020) with 200,000 points, having as attributes latitude, longitude, time, weekday, price, weather, and poi-category. The points are distributed into six versions of the dataset with increasing trajectory lengths of 10, 20, 50, 100, 200, and 500 points. The computation time required by each similarity measure with and without MAT-Index support is reported in Figure 15.
It is worth observing that, as expected, the shorter a trajectory is, the more the problem tends to a linear comparison. The two public datasets ( Considering all the results presented in this evaluation, we can conclude that MAT-Index performs consistently faster. Its efficiency is more associated with the number of semantic composite keys than with the dataset size. For this reason, our algorithm performs much faster in the larger BerlinMOD dataset performed than the Foursquare dataset. The larger a dataset is, the more it tends to exhibit spatial, temporal, and semantic data repetition. For instance, most attributes such as weekdays, price categories, ratings, and even POIs are finite sets. Besides, according to the Pareto principle (Pareto, 1906), 20% of the causes encompass 80% of the problems, reinforcing the repetition tendency. Thus, MAT-Index perfectly fits the aims of this work, namely, to speed up the processing of similarity for larger datasets.

| CON CLUS I ON S AND FUTURE WORK
Conducting similarity measurement of multidimensional data such as trajectories is a very costly task for clustering, nearest-neighbor queries, etc. Existing similarity measures cannot deal with the high volume of real datasets.
Current indexes cannot efficiently manage all trajectory dimensions for similarity analysis. The proposed MAT-Index aims to fill this gap by indexing semantic content with multiple attributes, having spatial and temporal dimensions, into a compact data structure, ensuring efficiency in similarity computation while reducing data redundancy. The index is a combination of a dictionary and inverted indexes. The MAT-Index evaluation used the state-of-the-art trajectory similarity algorithms MSM and MUITAS. We show qualitatively how MAT-Index support for MSM and MUITAS drastically reduces the required comparisons. We also compute the index performance in terms of running time using two public datasets and scalability using one synthetic dataset. Experiments show an improvement of 98.1% in running time and 87.2% in scalability.
In future works we aim to test MAT-Index with additional semantically enriched trajectories datasets and to develop an efficient way to update the index without the need for full reprocessing when new data are added to the dataset.

ACK N OWLED G M ENTS
This study was financed in part by the Coordenação de Aperfei¸coamento de Pessoal de Nível Superior Brasil (CAPES) Finance Code 001 and through the research project Big Data Analytics: Lançando Luz dos Genes ao F I G U R E 1 5 Elapsed times with varying trajectory size for the same content