Odyssey: A Journey in the Land of Distributed Data Series Similarity Search

This paper presents Odyssey, a novel distributed data-series processing framework that efficiently addresses the critical challenges of exhibiting good speedup and ensuring high scalability in data series processing by taking advantage of the full computational capacity of modern clusters comprised of multi-core servers. Odyssey addresses a number of challenges in designing efficient and highly scalable distributed data series index, including efficient scheduling, and load-balancing without paying the prohibitive cost of moving data around. It also supports a flexible partial replication scheme, which enables Odyssey to navigate through a fundamental trade-off between data scalability and good performance during query answering. Through a wide range of configurations and using several real and synthetic datasets, our experimental analysis demonstrates that Odyssey achieves its challenging goals.

As the size of the data series collections grows larger Palpanas [2015Palpanas [ , 2017]], Palpanas and Beckmann [48(3], recently proposed State-of-the-Art (SotA) data series indexes exploit parallelism through the use of multiple threads and the utilization of the SIMD capabilities of modern hardware Peng et al. [2018Peng et al. [ , 2020a]], Echihabi et al. [2022].However, the unprecedented growth in size that data series collections experience nowadays, renders even SotA parallel data series indexes inadequate Palpanas [2017], Echihabi et al. [2018Echihabi et al. [ , 2019]], Palpanas and Beckmann [48(3], Bagnall et al. [9(7], Gogolou et al. [2019], Echihabi et al. [2023], mainly due to the large number of random disk page reads required for exact query answering Echihabi et al. [2018].To address these issues, fast in-memory solutions have been proposed Peng et al. [2020bPeng et al. [ , 2021a,b],b].However, these solutions do not take advantage of distributed systems, and hence, are limited by the amount of memory of a single machine.This is the limitation we address, thus allowing the above SotA solutions to handle datasets that far exceed the main memory capacity of any single node.
Challenges.In the context of data series similarity search, exact query answering is very demanding in terms of resources, even when using a data series index.We need to either prune, or visit every leaf of the index.Previous works Echihabi et al. [2018], Gogolou et al. [2019] though, have shown that pruning is not very effective, especially for some hard datasets.
The main goal we need to satisfy is (naturally) scalability.That is, increasing the available hardware resources (e.g., the number of nodes) should decrease the time cost, ideally by an equivalent amount, or should enable to process an equivalent amount of additional data (at about the same time cost).In order to meet this goal, we need to ensure that all nodes of the distributed system equally contribute to completing the work, during the entire duration of the execution.In turn, this translates to producing effective solutions to the following two problems: (i) query scheduling: given a query workload, decide which queries to assign to each system node; and (ii) load-balancing: devise mechanisms so that system nodes that have finished their work can help other system nodes finish theirs.
The challenges in this context are the following.First, to achieve effective query scheduling, we need to come up with mechanisms for estimating the execution cost of data series similarity search queries, which do not currently exist.Second, this observation renders a load balancing scheme necessary, yet, this also means that we need to replicate data in order to make such a mechanism viable, as moving big volumes of data series around would be prohibitively expensive.Data replication works against data scalability and is more costly in whatever regards index creation time, but results in better query answering times, thus leading in interesting trade-offs through which an effective solution should navigate.Third, along with all the above considerations, we also need to ensure that our solutions will still maintain their good parallelization properties for efficient execution in multi-core CPUs inside each system node, and also achieve high pruning power during query answering.
Our Approach.We propose a novel distributed data-series (DS) indexing and processing framework, called Odyssey, that efficiently addresses the high scalability objective by taking advantage of the full computational capacity of the computing platform.
To come up with an appropriate scheduling scheme for Odyssey, we performed a query analysis that shows correlation between the total execution time and a parameter of the category of the single-node data series indexes we consider.This analysis drove the design of efficient scheduling schemes, by generating an execution time prediction for each query of the input query batch.
To achieve Load Balancing (LB) even in settings where predictions may not be accurate, Odyssey provides a LB mechanism, which ensures that nodes sitting idle can take away (or steal) work from other nodes which have still work to do (provided that these nodes store similar data).Combining Odyssey scheduler with this LB technique results in very good performance and high scalability for all query batches we experiment with.
Ensuring data scalability and, at the same time, good performance for query answering are contradicting goals.A scheme where data are not replicated would result in the lowest space overhead, but experiments show that this technique does not ensure the best performance during query answering, because no data replication means that Odyssey's LB mechanism cannot be used.
Odyssey manages to effectively unify these two contradicting goals by supporting a flexible partial replication scheme.This way, it navigates through the fundamental trade-off between data scalability and good performance during query answering.The degree of replication is one of Odyssey's parameters.By specifying it appropriately, users can choose the time-space trade-off that best suits their application and setting.Experiments show that Odyssey achieves good performance even for small replication degrees.
Supporting the components for efficient distributed computation that Odyssey provides, on top of an index that exploits the computation power of a single node as efficiently as SotA parallel indexes Peng et al. [2020bPeng et al. [ , 2021a,b],b], was one more challenging task we undertook while designing Odyssey.A simple approach of using an instance of the SotA MESSI index Peng et al. [2021a] in each node did not result in good performance mainly due to two reasons.First, different data series queries may exhibit variable degrees of locality (revealed only at runtime), resulting in low pruning in some of the nodes, and thus, in severe load balancing problems and performance degradation.Second, supporting load-balancing on top of such a simple approach would require moving data around, which is often prohibitively expensive.Odyssey single-node indexing scheme borrows some techniques from SotA indexes Peng et al. [2018Peng et al. [ , 2020aPeng et al. [ ,b, 2021a,b],b], and couples those with new components and mechanisms, to achieve load balancing and come up with a scheme in which work from an overloaded node can be given away to idle nodes without having to pay the prohibited cost of moving any data around.
Odyssey is innovative in different ways.First, it employs a different pattern of parallelism from all existing approaches in traversing the index tree to produce the set of data series that cannot be pruned.Second, it presents new implementations for populating and processing the data structures needed for efficient query answering.To achieve load balancing among the threads, it is critical to choose an appropriate threshold on the size of these data structures, and Odyssey proposes an effective mechanism for predicting a good threshold.Additionally, Odyssey provides efficient communication and book-keeping mechanisms, to enable fast exchange of information among nodes to ensure good pruning degrees in all of them.
Odyssey is up to 6.6x faster than its competitors and more than 3.5x better than its best competitor.Additionally, Odyssey's index creation perfectly scales with both the dataset size and the number of node.Moreover, Odyssey's best performing scheduling strategy is more than 2.5x faster than its initial one.
Contributions.The main contributions of the paper are as follows: • We describe Odyssey, a scalable framework for distributed data series similarity search in clusters with multi-core servers.This makes our approach the first customized data series solution that exploits parallelization both inside and across system nodes.
• We develop a scheduling algorithm for assigning queries to the nodes of the cluster, which tries to balance the workload across the nodes by computing a (good-enough) estimation of the execution time of each query.
• We present a novel exact search algorithm that supports work-stealing between nodes that share the same index (full replication).Thus, our approach leads to high performance, even when the work is not (or cannot be) equally distributed over the nodes of the cluster.We further extend our solution to work even when only a part of the index is shared among nodes (partial replication).
• Our approach supports different replication degrees among the nodes, allowing users to navigate the entire spectrum of solutions, trading space (replication degree) for speed (query answering time).
• We also present a density-aware data partitioning method that can efficiently partition data in a way that improves the work balancing capabilities of our approach.
• Finally, we conduct an experimental evaluation (code and data available online cod [2022]) with a wide range of configurations, using real and synthetic datasets.The evaluation demonstrates the efficiency of Odyssey, which exhibits an almost linear scale-up, and up to 6.6x times faster exact query answering times than the competitors.

Preliminaries and Related Work
Data Series.A data series, denoted as S = {p 1 , ..., p n }, is a sequence of points, where each point p i is a pair (u i , t i ), 1 ≤ i ≤ n, of a real value u i and the position t i of p i in the sequence; n is the size (or dimensionality) of the sequence.When t i represents time, we talk about time series.In several cases, we omit the t i , e.g., when they are equally spaced, or only play the role of an index for the values u i Echihabi et al. [2018]; for simplicity, we omit them, as well.iSAX Summary.The iSAX summary Shieh and Keogh [2008] of a data series splits the x-axis in equal segments and represents each segment with the mean value of the points of the data series that it contains (see Figure 1).Then it partitions the y axis into regions of sizes determined by the normal distribution and represents each region using a number of bits (cardinality).The number of bits can be different for each region, and this enables the creation of a hierarchical index tree (iSAX-based index tree Palpanas [2020]; see Figure 1).Similarity Search.Given a collection of data series C and an input data series S, called the query, similarity search is the task of finding the data series in C which are most similar to S. We focus on finding a single best answer, known as the 1-NN problem.We also focus on Euclidean Distance (ED).The euclidean distance (or real distance) between two time series T = {t 1 , ..., t n } and S = {s 1 , ..., s n } is defined as ED(T, S) = n i=1 (t i − s i ) 2 .We call the distance between the iSAX summaries of T and S, lower-bound distance.The lower-bound distance between any two data series is always smaller than or equal to the real distance between them.et al. [2022], Azizi et al. [2023] exploit multiple threads (and SIMD) to create an index tree and answer queries on top of this tree.They are usually comprised of two main phases, the index tree construction and the query answering phases.In the index tree construction phase, they first calculate, in parallel, summarizations of all data series in the collection.If the summarizations are iSAX summaries, we talk about iSAX-based DS indexing.To achieve a good degree of locality and low synchronization overheads, they store these summaries into a set of summarization buffers.Data series that have similar summarizations are placed into the same buffer.Subsequently, the data series of each of these buffers are stored into each of the subtrees of the index tree that they construct.These design decisions allow them to build the index tree in an almost embarrassingly parallel way (thus, without incurring synchronization overheads), and achieve locality in accessing the data during tree construction.They thus respect crucial principles for achieving good performance that should be respected when designing a parallel index.
To answer a query, these indexes first calculate the summarization of the query.Subsequently, they traverse the index tree to find the most appropriate data series based on the iSAX summary lower bound distances.The distance of these data series from the query series is stored in a variable called best-so-far (BSF), and serves as an initial approximate answer to the active query.Then, BSF is used to prune data series from the initial collection.A data series S is pruned when the lower bound distance between S and the query is higher than the current value of the BSF.This process outputs a hopefully small subset of the initial DS collection, containing series that need to be further examined.These series are often stored in (one or more) priority queues Peng et al. [2020bPeng et al. [ , 2021a,b],b].Multiple threads process, concurrently, the elements of the priority queues, calculating real distances (if needed), and updating the BSF each time a new minimum is met (see Figure 2).Once this process completes, the distance to the answer is contained in BSF.
Multi-node Systems and Query Processing.The system consists of a number of asynchronous nodes which communicate by exchanging messages.Each node is a multi-core machine, capable to support multiple threads (and possibly SIMD computation).Threads communicate by accessing shared variables.A shared variable can be atomically read and written.Stronger primitives, such as Fetch&Add may also be provided.Fetch&Add(V, val) atomically adds the value val to the current value of variable V and returns the value that V had before this update.
An arbitrarily large batch of queries is provided in the system as input.The goal is to utilize the system's computational power to execute these queries in a way that minimizes the makespan, i.e., the length of time that elapses from the time that any node starts processing a query of the batch to the first point that all nodes have completed their computation.
Our techniques can easily be adjusted to work with queries that arrive in the system dynamically.
The data series in the initial collection can be stored in all nodes (full replication), or may be scattered to the different nodes so that nodes store disjoint subsets of the data (no replication).A partial replication scheme is also possible, where nodes store subsets of the data which are not necessarily pairwise disjoint (e.g., more than one node may store the same subset of data series).A data partitioning mechanism determines how to split and distribute the data of the initial data-series collection to nodes.
Query scheduling algorithms aim to schedule the input queries to nodes in a way that each node has approximately the same amount of work to do.Considering full replication, a Static Query Scheduler (SQS) partitions the sequence of queries into N subsequences and each node gets one of these subsequences to answer.A Dynamic Query Scheduler (DQS) employs a coordinator node, and has other nodes requesting queries to execute from the coordinator.The coordinator may serve requests by assigning the next unprocessed query to a worker when it receives its request, or it may preprocess the sequence of queries (e.g., by re-arranging the queries based on some property) before it starts assigning queries to nodes.To avoid loosing computational power, the coordinator can answer queries itself between serving requests from other nodes.

Related Work
Data series similarity search queries require the use of specialized index structures in order to be executed fast on very large collections of data sequences.In general, data series indexes operate by pruning the search space based on the summarizations of the series and corresponding lower bounds, and only use the raw data of the series in order to filter out the false positives.
Data Series Indexes.Agrawal et al.Agrawal et al. [1993] presented the first work that argued for the use of a spatial indexing structure for indexing data sequences, based on the R-Tree Sellis et al. [1987], and was later optimized Rafiei and Mendelzon [1998].Various indices, specific to data sequences, have been proposed in the literature Echihabi et al. [2021a].DSTree Wang et al. [2013] is an index based on the APCA summarization Keogh et al. [2001].The DSTree can adaptively perform split operations by increasing the detail of APCA as needed.The iSAX index is based on the SAX summarization, and its extension, iSAX Shieh and Keogh [2008].In this case, the data series summarization is bitwise, leading to a concise representation and overall index.Several other iSAX-based indices have been proposed in the literature Camerra et al. [2014], Zoumpatianos et al. [2014Zoumpatianos et al. [ , 2016]], Linardi andPalpanas [2019, 2020], Palpanas [2020], Wang and Palpanas [2021], Wang et al. [2023b].These indexes are among the SotA solutions in this area Echihabi et al.
Data  Distributed Data Series Indexes.KV-Match Wu et al. [2019] and its improvement, L-Match Feng et al. [2020], are index structures that can support similarity search.These indices can be implemented on top of Apache HBase, and operate in a distributed fashion within Apache Spark.We note that these solutions only support subsequence similarity search, and not whole-matching Echihabi et al. [2018], which is the focus of our paper.TARDIS Zhang et al. [2019] is an Apache Spark system for similarity search.It supports approximate queries, as well as exact match queries, where we want to know if the query appears exactly the same within the dataset, or not.This query type is much easier than the exact queries we consider in our work, and cannot be efficiently transformed to exact querying.Finally, DPiSAX Yagoubi et al. [2017Yagoubi et al. [ , 2020] ] is a distributed solution for data series similarity search, developed for Apache Spark using Scala.It was designed for answering batches of approximate search queries, but also supports exact search.DPiSAX exploits the iSAX summaries of a small sample of the dataset, in order to distribute the data to the nodes equally.Then, an iSAX index is built in each node on the local data, and is used to perform query answering.In order to produce the exact search results, all nodes need to send their partial results to the coordinator, which merges them and produces the final, exact answer.Note that DPiSAX was not explicitly designed for intra-node parallelization, but is the only distributed data series index in the literature that supports exact search.

The Odyssey Framework
We start with a high level overview of the Odyssey flowchart, which comprises of five stages (see Figure 3).
In the first stage, a coordinator node partitions the raw data-series collection to as many chunks as the number of system nodes, and assigns a chunk to each node (including itself).(Section 3.4 details Odyssey's partitioning schemes.)In the second stage, each node (i) loads its chunk of data in memory, (ii) computes their iSAX summaries and stores them into a number of summarization buffers, for achieving locality, and (iii) builds its index tree.To enhance performance at query answering, Odyssey employs data replication.It forms groups of nodes (replication groups, described in Section 3.3), where all nodes of each group store the same chunk of data.Each replication group has a coordinator node, called group coordinator, which schedules queries to the group's nodes.A batch of queries (e.g., originating from a k-NN classification task) to execute is submitted to all group coordinators (as different groups store different data chunks).In the third stage, the group coordinators start by estimating the execution time of each query, then sort queries in descending order of estimated execution times, and dynamically schedule them to the group's nodes (Section 3.1 describes query scheduling).In the fourth stage, each node processes the queries assigned to it.It first calculates an initial BSF, and then prunes the index tree using this BSF, populating the priority queues with leaves that cannot be pruned.Finally, it processes the elements of the priority queues to find the best local answer (corresponding to its data chunk).In this stage, Odyssey supports BSF-sharing and work-stealing (detailed in Section 3.2).In the last stage, the coordinator node collects the local answers from the group coordinators, and produces the final answers.

Query Scheduling
To correctly answer a query, it should be forwarded to at least one set of system nodes that collectively store all the data.We call such sets node clusters in Section 3.3.Thus, in the no-replication case this set contains all system nodes, so a scheduling algorithm should forward all queries to all nodes.Other replication settings (and especially full replication) are more interesting, as they enable the utilization of different scheduling techniques.
To come up with Odyssey scheduler, we experimented with a collection of scheduling techniques, including the simple static and dynamic schemes (SQS and DQS) for full replication settings, discussed in Section 2. Unfortunately, these schemes suffer from severe load imbalance problems for many categories of query batches.For the static case, consider for example, a query sequence which consists of progressively more difficult queries (i.e., of queries that each requires less time to run than the next one).SQS will assign to the first system nodes easy queries, while the last nodes will get more work to do.The dynamic method (DQS) may also result in load imbalances: even in simple cases where e.g., a query batch includes a single difficult query at the end, most nodes may be sitting idle, while a single node is running the difficult query.This may significantly degrade performance.
Some of these load imbalances could be avoided, if we knew the execution time of each query.Recent work Gogolou et al. [2020], Echihabi et al. [2023] illustrated that there exists a correlation between the initial BSF and the number of vertices visited in a single-node index tree.We performed a corresponding query analysis which showed that similarity search queries, for which the initial BSF is high, tend to also have high execution times.In this work, we use a linear regression model (other pediction schemes can be used, as well) to produce estimates for each query.An example of this outcome is shown in Figure 4 (for Seismic; we follow the same process for the other datasets).
These observations led us to design two scheduling algorithms.The first, static prediction-based scheduling, statically allocates the queries to nodes based on their estimations.Each node maintains a load variable, which stores the sum of the estimations of the queries that are assigned to it.The algorithm uses a greedy approach to assign queries to nodes so that load balancing is achieved.There are two variations of the algorithm: the first (unsorted) schedules the queries using their order in the sequence, and the second (sorted) sorts the sequence based on decreasing execution time estimations. .The second scheduling algorithm, called dynamic prediction-based scheduling, is an enhanced version of DQS, where queries are assigned to nodes after sorting the entire query batch, based on estimations (in decreasing order).
Consider a system of two nodes, sn 1 and sn 2 , and let Q = {q 1 , q 2 , q 3 , q 4 , q 5 } be a query batch to execute.Assume that ES = {100, 50, 200, 250, 80} is the set of the estimated execution times, where the i-th element of ES is the estimated execution time for q i , 1 ≤ i ≤ 5. Unsorted static prediction-based scheduling, with load variables l 1 and l 2 (for sn 1 and sn 2 , respectively), proceeds as follows: q 1 is assigned to sn 1 (so, l 1 = 100), and q 2 is assigned to sn 2 (so, l 2 = 50).Since l 2 < l 1 , q 3 is assigned to sn 2 (thus, l 2 = 250).Following a similar strategy, q 4 is assigned to sn 1 , and q 5 is assigned to sn 2 .So, sn 1 receives {q 1 , q 4 } and sn 2 receives {q 2 , q 3 , q 5 }.In sorted static prediction-based scheduling, the queries of Q are first sorted in decreasing order of their estimated times, resulting in Q = {q 4 , q 3 , q 1 , q 5 , q 2 } (which corresponds to ES = {250, 200, 100, 80, 50}).After applying the static prediction-based scheduling algorithm (as above) on these sets, {q 4 , q 5 } is assigned to sn 1 and {q 3 , q 1 , q 2 } is assigned to sn 2 .Finally, dynamic prediction-based scheduling also sorts the queries of Q.In this case, q 4 is assigned to sn 1 , q 3 to sn 2 , while the rest of the queries are dynamically assigned to nodes (in order) upon request (thus, based on actual execution times).
The Odyssey framework supports all of the above scheduling algorithms.The Odyssey index utilizes dynamic prediction-based scheduling, which turned out to be the best approach in most cases.

Load Balancing
Odyssey provides a load balancing (LB) mechanism, which can be applied on top of any of the scheduling schemes described in Section 3.1.Specifically, idle nodes can steal work from other nodes which still have work to do (provided that they store similar data).
This is necessary as predictions may not always be accurate, or the query batch may be produced dynamically at run time, in which case sorting of the entire query batch is not possible.It is also necessary for achieving high scalability.
As the number of utilized nodes increases, the number of batch queries that each node has to process becomes smaller and smaller.Thus, problematic scenarios as those described in Section 3.1, may appear, where just one or a few nodes work on difficult queries, while others are sitting idle.
Overview of our approach.We performed a number of experiments to get a break-down of the query answering time.This break-down illustrated that the biggest part of the time for query answering goes to priority queues' processing.We thus focus on designing a method that allows nodes to steal work during the execution of that phase.For simplicity, we first focus on the full-replication case, where the initial collection of data is available in every node; partial replication is then discussed in Section 3.3.
A simple work-stealing scheme Blumofe and Leiserson [1999], Cil [1996] would not work, mainly because moving data (stored in priority queues) around from one node to another is expensive and should be avoided.Thus, the main challenge in our setting is to take work away from one node and assign it to another without ever moving any data around.
Odyssey's load-balancing mechanism works as follows.An idle system node sn randomly chooses another node sn and sends it a steal request.If sn has still work to do, it chooses a number of priority queues to give away to sn.To avoid paying the cost of transferring data around, Odyssey employs a technique that informs sn on how to locally build the priority queues to work on, based on its own index.Node sn traverses the identified part of its index tree and re-constructs these priority queues.As the time to create the priority queues is relatively small in comparison to that for processing them, this scheme works quite well.
Note that the approaches followed by existing SotA indexes Peng et al. [2020bPeng et al. [ , 2021a,b] ,b] for creating and processing the priority queues are too naive to support work-stealing without moving any data around.In Odyssey, we propose (in Section 3.2.1) a new implementation of a single-node, multi-threaded index, which respects the good design principles described for parallel indexes in Section 2, while it simultaneously copes with the problem mentioned above.

Single-Node Query Answering
Consider any system node sn and assume that an iSAX-based index tree has been created and an initial value for the BSF has been computed in sn.An outline of the single-node query answering algorithm of Odyssey is depicted in Figure 5.The pseudocode is provided in Algorithms 1 and 2.
Description.Node sn executes each of the queries in the query batch assigned to it one by one (Algorithm 1).For each such query Q, it creates a number of search workers to execute it (line 8).As soon as, all queries in sn's query batch have been processed, sn informs other nodes that it has completed (line 12).Then, it tries to help other active nodes by executing PerformWorkStealing (line 13).Each node allocates a thread to play the role of the work-stealing manager (line 6).This thread simply processes all work-stealing requests that the node will receive (Algorithm 3).
(Work-stealing is discussed in Section 3.2.2.) The query answering algorithm in sn splits the tree into root subtree (RS) batches, i.e., sets of consecutive root subtrees (see Figure 5), and allocates a number of threads to work on them.Each thread begins by getting an RS-batch to work on using Fetch&Add (Algorithm 2).Then, the thread executes the ProcessBatch routine, which traverses the tree recursively and inserts the leaves that cannot be pruned into one of a set of priority queues that belong to the RS-batch.
For every RS-batch, there exists one active priority queue at each point in time.When the size of this priority queue surpasses a threshold, this queue is abandoned and another one is initialized for the RS-batch.As soon as an idle thread th discovers that all RS-batches have been assigned for processing, it tries to help some other still active thread, th , to complete processing its assigned RS-batch (lines 11-14, Algorithm 2).To reduce the synchronization cost, there is a threshold, HelpTH , on the number of threads that help on each RS-batch (line 12).This phase ends when the subtrees of all RS-batches have been traversed and all priority queues have been populated.Experiments showed that we get the best performance when the number of RS-batches, N sb , equals the number of worker threads.
As soon as this tree traversal phase is over, we have a set of priority queues for each RS-batch, stored in an array.For performance reasons, this array is sorted by the priority of the top element of each priority queue.This comprises the priority queue preprocessing phase (lines 15-21).This way, the algorithm processes the priority queues with the smallest lower bound distances to the query first.These queues contain data series that are more probable to be in closer real distance to the query, thus enabling further pruning.
Then, the priority queue processing phase starts (lines 23-29).Every thread gets a priority queue from the PQueues array to process (using Fetch&Add).Routine ProcessPriorityQueue processes those data series stored in the priority queue, which cannot be pruned.Whenever a lower real time distance between any of these series and the query series is calculated, the BSF is updated to contain this distance.This improved BSF is submitted to all nodes of the system.Finally, all answers are transmitted to the coordinator node, and the globally smallest value of the BSF is the response to the query.the following parameterized formula: where M ∈ [0, 1], m ≤ M, b, c ∈ R * , and d ∈ R are the parameters of the sigmoid function (Figure 6a).The final threshold value for each query is the median value estimation as it comes from the sigmoid function, divided by a factor (e.g. for seismic this factor has to be 16, based on the diagram shown in Figure 6b).
Experiments show that after the tree traversal phase is completed, we end up with a set of RS-batches that have a number of priority queues with most of them being the same size.This results in load balancing among the threads when processing priority queues.

Work-Stealing Algorithm
If a system node sn becomes idle, sn initiates the work-stealing protocol (Algorithm 4, lines 15-17).It randomly chooses a system node sn from the set of those nodes that sn knows to be still active and sends a steal request to it3 .
A thread in each node acts as the work-stealing manager (Algorithm 3).As soon as the work-stealing manager of sn receives the request, it tries to give away work to sn (lines 2-4 of Algorithm 3).
Earlier work has demonstrated that a large amount of the query answering execution time is devoted to verifying that there is no better answer after the correct answer has been processed Gogolou et al. [2019Gogolou et al. [ , 2020]], Echihabi et al. [2023].Based on these findings, Odyssey's work-stealing mechanism chooses to give away an RS-batch B which satisfies the Take-Away Property, namely that B is not yet stolen and its first priority queue is located in the rightmost possible index of the PQueue array.This priority queue is then marked as stolen.If more than one batches are to be given away, this process is applied repeatedly to choose additional RS-batches.Recall that the P Queue array is sorted by the priority of the top element of each priority queue.Thus, by giving away batches in this way, sn assigns to helpers priority queues that may still contain work.Additionally, it gives away RS-batches that have the highest probability to be unprocessed.Throughout the process, the current BSF is shared among the nodes, every time it is updated, as a helper may steal a priority queue that contains a better answer (or the owner may compute a better BSF later).
The number, N send , of RS-bathes that a node gives-away during stealing affects performance.Theoretically, we would like to give away a number of RS-batches which on the one hand, it will enable the stealing node to do a noticeable amount of work, but on the other, the work to be given away should not result in higher query answering times.Experiments show that fixing N send to 4 was the best choice (so N send = 4 in Odyssey).

Data Replication
Odyssey aims at ensuring data scalability and, at the same time, good performance for query answering.Optimal data scalability requires to follow a no replication approach, but experiments show that the best query answering performance is noticed for fully replicated settings.Odyssey manages to effectively navigate through this trade-off between data scalability and good performance during query answering, by providing a flexible partial replication scheme.
The idea is to split the set of system nodes into clusters, where each cluster collectively stores the entire dataset (see Figure 7).Each cluster node stores (and indexes) a chunk of the dataset.The chunks stored in each node of a cluster are mutually disjoint.A replication group is a group of nodes such that each node stores the same dataset as every other node in the group.(We experimented with replication groups of the same size, but Odyssey can operate with replication groups of different sizes, as well.)The nodes of a replication group build their iSAX indices from the same data chunk.Thus, inside every replication group, we can apply the scheduling and load-balancing schemes described in Sections 3.1 and 3.2, respectively.We call the number of clusters the replication degree of the system.
Consider a system with N sn system nodes.We call PARTIAL-k, k ∈ {1, 2, 4, . . ., N sn }, a replication setting with k replication groups and N sn /k clusters.Observe that PARTIAL-N sn , or EQUALLY-SPLIT corresponds to no replication (each node stores a disjoint chunk of the dataset), and PARTIAL-1, or FULL corresponds to full replication (each node stores the full dataset).Note that Odyssey's data replication scheme supports 1 + log N sn different replication degrees.Smaller replication degrees lead to smaller space overheads (and thus better data scalability).Thus, Odyssey's data replication scheme allows us to tackle memory limitation problems.Moreover, more replication groups lead to scalability in index creation.

Data Partitioning
Odyssey framework supports more than one partitioning schemes.Under EQUALLY-SPLIT, each system node is assigned a discrete chunk and builds the corresponding index, resulting in a scheme where each node keeps a local index on its own part of the data.Queries are forwarded to all nodes.Each node produces an answer based on its local index and data.The minimum among them is the final answer.Before distributing the data, random shuffling (RS) can be applied to randomly rearrange the series of the initial collection.
To answer a query batch using partial data replication (or no replication), each query is sent to every replication group.Each node answers queries using its local data, and the partial answers for each query are gathered in the end to find the smallest answer.Very often for real data, the close answers to a query could be located into a small part of the dataset.The group that has these data will get a good initial answer, it will prune more and it will answer each query really fast, while other groups, will not necessarily compute good initial BSF values.Thus, they will have more work to do leading to imbalances.For this reason, we enhance our distributed index with a book-keeping method that supports BSF sharing.When a node is processing a query and finds an improved value for BSF, it shares this value through a common BSF-Sharing channel (as illustrated in Figure 7).Every node periodically checks this channel to see if an answer for a query has arrived.Because this process runs in parallel, a node may receive a better answer for a query that will be encountered later on.Odyssey's book-keeping method solves such synchronization problems.Each node holds an array that stores the improvements received from the channel for the BSF of each query, and before answering a query it checks the data held in this array.Thus, each node has the best answers extracted from all nodes, and our experimental evaluation shows that the use of this method is critical for performance.
In addition to these simple techniques, Odyssey also provides a sophisticated data partitioning scheme, based on preprocessing of the initial data series collection, which provides a density-aware distribution of the data among the available nodes.The required preprocessing incurs some time overhead.However, it occurs only once for answering as many queries as needed, and thus, as the number of queries to process increases, this overhead is amortized.We describe this scheme in Section 3.4.1.

DENSITY-AWARE Data Partitioning
We observe that a good partitioning strategy should not assign all similar series to the same system node.In such a case, we risk to create work imbalance for the following reason.Assume that we need to answer a similarity search query, for which all candidate series from the dataset that are similar to the query are stored in one of the system nodes, while all other nodes are storing series that are not similar to the query.Then, during query answering, the node with the similar series will need to perform many (lower bound and real distance) computations in order to determine which of the candidate series is the nearest neighbor to the query, with essentially little pruning (if at all).On the other hand, all the other nodes that store dissimilar series will be able to prune aggressively, and therefore, finish their part of the computations much faster.
The above observations led us to the design of the DENSITY-AWARE partitioning strategy, whose goal is to partition similar series across all system nodes, without incurring a high computational cost.This is achieved by exploiting Gray Code Gardner [1986] ordering for effectiveness (since it helps us split the similar series), and the summarization buffers of our index for efficiency (since we have to operate at the level of buffers, rather than individual series).
Figure 8 shows an example of partitioning the data series in the summarization buffers according to a simple strategy using binary code, and to a strategy based on Gray Code.In the former case, the buffers that end up in the same node contain similar series: their iSAX representations (the iSAX word of the buffer) are very close to one another, e.g., node 1 stores buffers "000" and "100", so series whose iSAX summaries only differ in one bit.In the latter case, this problem is addressed.The Gray Code ordering places similar buffers close to one another (by definition, two neighboring buffers in this order differ in only one bit), so it is then easy to assign them to different system nodes in a round-robin fashion.
We depict the flowchart of the DENSITY-AWARE partitioning strategy in Figure 9.We start by computing the iSAX summaries of the data series collection, and assigning each summary to the corresponding summarization buffer.These buffers are ordered according to Gray Code, and then the actual data partitioning starts (using round-robin scheduling).We first partition the series inside the λ largest buffers; this is necessary, since often times a small number of buffers will contain an unusually large number of series (that we do not want to assign them all to the same system node).Then, we partition the remaining buffers, and we check if the partitioning is balanced.If it is not, then we select the largest   buffer of the largest node, and we partition the series inside this buffer.Our experiments with several real datasets (omitted for brevity) showed that DENSITY-AWARE exhibits a very stable behavior as we vary λ from a few hundred to several thousands.In this study, we use λ = 400.

Extensions
We now discuss two extensions of Odyssey, in order to support k-NN search and the Dynamic Time Warping (DTW) distance.
k-NN Search.Extending Odyssey to support k-NN similarity search is straight-forward.Instead of computing a single BSF value, we simply need to keep track of the k smallest BSF values.DTW Distance.We also extend Odyssey to perform similarity search using Dynamic Time Warping (DTW), which is an elastic distance measure Keogh and Ratanamahatana [2005].Note that no changes are required in the index structure for this: the index we build can answer both Euclidean and DTW similarity search queries.Supporting DTW queries requires modifying the query answering algorithm only, and using LB_Keogh Keogh and Ratanamahatana [2005], which is a tight lower bound of the DTW distance.We note that a lower bound for the DTW distance between the query and a candidate series can be computed by considering the distances between the corresponding points of the candidate series and the points of the LB_Keogh envelope of the query.

Experimental Evaluation
Setup.Experiments conducted on a cluster of 16 SR645 nodes, connected through an HDR 100 Infiniband network.Each node has 128 cores (with no hyper-threading), 200GB RAM (available to users out of the 256GB physical respectively.The latter is our best scheduling algorithm (cf.paragraph "Queries scheduling").We note that that Odyssey's query scheduling and work-stealing mechanisms can be used together only with the FULL or PARTIAL data distribution strategies that provide some replication.
We compare Odyssey to: (i) MESSI Peng et al. [2021a], where we run the MESSI index independently in each system node; (ii) MESSI SW BSF, where we extend the previous solution by enabling system-wide sharing of the BSF values; and (iii) DPiSAX Yagoubi et al. [2020], where we implement (in C) the DPiSAX data partitioning strategy, and (for fair comparison) implement query answering in each node using MESSI.
Datasets.We evaluated Odyssey's strategies and algorithms using real and synthetic datasets, of varying sizes (refer to Table 1).The synthetic data series, called Random, were generated as random-walks (i.e., cumulative sums) of steps that follow a Gaussian distribution (0,1).This type of data has been extensively used in the past Faloutsos et al. [1994], Camerra et al. [2014], Zoumpatianos et al. [2015Zoumpatianos et al. [ , 2018]], Echihabi et al. [2018Echihabi et al. [ , 2019]], and models the distribution of stock market prices Faloutsos et al. [1994].Our five real datasets come from the domains of seismology (Seismic), astronomy (Astro), deep learning (Deep), image processing (Sift), and information retrieval (Yan-TtI).vectors that include image-and textual-embeddings in the same space; it represents typical cross-modal information retrieval tasks.
Evaluation Measures.During each experiment, E, and for each node, sn, we measure (i) the buffer time required to calculate the iSAX summaries and fill-in the receive buffers, (ii) the tree time required to insert the items of the receive buffers in the index tree, and (iii) the query answering time required to answer the queries assigned to sn.The sum of these times constitute the total time that sn works during E; also, buffer and tree times constitute the time required to create the index, called index time.To compute all the above times during E, we take the maximum among the corresponding times of each node participating in E. We report the average times of 10 experiments.
Query scheduling.To compare Odyssey's queries scheduling algorithms, the full replication strategy is selected, to avoid measuring any overheads resulting from the partial replicated strategies.Recall that scheduling algorithms can't be used together with the no replication strategies.We experimented with both Random (synthetic dataset) and Seismic (real dataset), and all of our algorithms positively affected performance in comparison with STATIC.Moreover, for the synthetic dataset, we have seen no remarkable differences between all our scheduling algorithms, since the randomness when producing the data series of both the dataset and the queries set, results in queries with almost the same effort to be answered.We present the results for Seismic, where the effort for answering queries varies.Specifically, Figure 10 shows that as the number of nodes increases, PREDICT-DN is the best scheduling policy in all cases and it is up to 150% better than STATIC.
Work-stealing. Figure 10a shows that WORK-STEAL-PREDICT greatly outperforms (up to almost 2x) PREDICT-DN for large number of nodes when using FULL replication, i.e. our work-stealing technique positively affects performance on these cases.The same is true for PARTIAL-2 replication, but to a lesser extent.Recall (from Section 3.2.2) that this happens since all the algorithms that do not use the work-stealing technique suffer from load-imbalance issues.Specifically, when a query set contains a few (significantly less than the number of nodes) queries that require significantly more effort to get answered (than the majority of queries), then as the number of nodes increases more nodes remain idle at the end of the corresponding query answering phase, since no such difficult query is assigned to them.
Query Scalability.To evaluate the scalability of Odyssey's algorithms with increasing number of queries, we conducted experiments with WORK-STEAL using synthetic and real datasets.In Figure 11a, we present the results for the Random dataset (results with the other tested datasets are similar) with FULL replication, for a total of 100, 200, 400, and 800 queries.As we can see, WORK-STEAL scales almost perfectly with the increasing number of queries, since the time to execute 100 queries in 1 node is the same with the time to execute j * 100 queries in j nodes, j ∈ {2, 4, 8}.We have observed the same trend for the PARTIAL scheduling algorithms (Figure 11b).Note that PARTIAL replication can be applied only with two or more nodes.Additionally, we present in Figure 12 scalability experiments, by increasing the dataset size, for Random (between 100-1600GB) and Yan-TtI (between 100-800GB).We measure the total query answering time for 100 queries, when using 8 nodes.Note that we could not execute all replication strategies for all dataset sizes, due to the memory capacity of our nodes.The results show that query answering time scales gracefully as we increase the dataset size, while increasing the replication degree leads to better performance.Moreover, we observe that Odyssey's query answering algorithm achieves good scalability as the number of nodes increases.This is better illustrated in Figure 13, which presents the WORK-STEAL throughput on the Random dataset.

Replication.
We study now Odyssey's different replication strategies using the Seismic dataset and WORK-STEAL-PREDICT that is our best scheduling algorithm, to avoid any overhead incurred by load-imbalances between nodes.Specifically, we test EQUALLY-SPLIT, PARTIAL-4, PARTIAL-2 and FULL, for varying number of queries.Figures 15a-15b present the query answering time4 , where we observe that the more a dataset is replicated, the less time is required to answer queries, and this is consistent for all number of queries.So, the FULL replication strategy has the smaller queries answering time.On the other hand, Figures 15c-15d    includes also the time for index tree construction.Interestingly, for small query numbers (100), we observe exactly the opposite: a larger amount of data replication, results in bigger total time, with FULL having now the bigger index tree construction time.This happens because the increased index tree construction time dominates in the total time.However, as the number of queries increases, the differences between the total execution time of algorithms become smaller.Remarkably, for large enough number of queries (e.g., 800), the increased index tree construction cost is amortized by the smaller query answering time, having FULL replication strategy performing better than EQUALLY-SPLIT.This analysis reveals an interesting trade-off (regarding the level of replication) between the query answering cost and the index tree construction cost, while the latter can be amortized using a large enough set of queries.Figure 16 shows the results of the query answering experiment with 100 queries for the rest of the real datasets.We observe similar trends to those of Seismic (Figure 15a).Overall, when query answering needs to be optimized, we recommend that Odyssey is used with the highest possible replication degree (given the dataset size and compute-cluster characteristics).Index Scalability.We present in Figure 14 the total index size in GBs, for every replication strategy when using 8 nodes, for all real datasets we used and for Random 100GB (Ran.100).In all cases, the index size is very small compared to the size of the dataset.Figures 17a and 17b illustrate the index creation time of Odyssey for our 1B series Deep dataset using EQUALLY-SPLIT, as the dataset size increases on a system with 16 nodes and as the number of nodes increases (while using the full size datasets), respectively.In both cases, we observe optimal speedup regarding index creation.Additionally, Figure 17c presents the scalability of Odyssey on the Random dataset as both the dataset  size and the number of nodes increase linearly, again using EQUALLY-SPLIT.As shown, Odyssey achieves perfect scalability since the corresponding buffer times and index times remain almost constant.
Data partitioning and comparison to competitors.Figure 17d presents (i) a comparison of WORK-STEAL-PREDICT Odyssey's best performing algorithm, against DMESSI, DMESSI-SW-BSF, and DPISAX; and (ii) the performance of Odyssey's different data partitioning schemes, i.e., EQUALLY-SPLIT and DENSITY-AWARE, as well as the FULL replication strategy, using Seismic.Interestingly, DMESSI performs significantly worse that all the other implementations, showing that by simply executing multiple instances of a SotA single-node algorithm like MESSI on a multi-node system (in order to scale its applicability on larger dataset sizes) does not perform well on real datasets; thus, more sophisticated approaches are required.On the other hand, Odyssey's WORK-STEAL-PREDICT with FULL replication strategy is significantly better than all its competitors.Specifically, it is up to 6.6x, 3.7x and 3.8x faster than DMESSI, DMESSI-SW-BSF, and DPISAX, respectively.Moreover, regarding Odyssey's data partitioning techniques, Figure 17d shows that WORK-STEAL-PREDICT with the DENSITY-AWARE partitioning performs better than EQUALLY-SPLIT.
Extensions to k-NN and DTW.Finally, we present experiments with k-NN queries, and the DTW distance, where we measure the query answering time for 100 queries as we increase the number of nodes, when using different replication strategies.We evaluated all replication strategies when varying k between 1 and 20 for k-NN, and when varying the warping window size between 1%-15% of the series length for DTW. Figure 18 shows the k-NN results for k = 10, and Figure 19 shows the DTW results for 5% warping (results with the rest of parameter values are similar).As expected, query answering times are in both cases higher than before, while using more nodes and higher replication degrees improves performance in the same way we have observed in previous experiments.Results with Seismic exhibit similar trends and are omitted for brevity.

Conclusions
In this work, we presented Odyssey, a novel distributed data-series processing framework that takes advantage of the full computational capacity of modern clusters comprised of multi-core servers.Odyssey addresses a number of challenges in designing an efficient and highly-scalable distributed data series index, including efficient scheduling, load-balancing, and flexible partial replication, and successfully navigates the fundamental trade-off between data scalability and good performance during query answering.In future work, we plan to extend Odyssey to support subsequence similarity search Linardi and Palpanas [2020], as well as approximate similarity search.

Figure 1 :
Figure 1: From data series to iSAX index

Figure 2 :
Figure 2: Algorithm Outline of Parallel DS Indexes.

Figure 5 :
Figure 5: Outline of the Odyssey single-node query-answering process.

Algorithm 4 :
PerformWorkStealing -Code for node sn Input: Index index, Function exact_search_workstealing_f unc, QuerySeries queries[], Integer total_nodes_per_nodegroup Upon Receiving a DONE message from node sn : add sn in set DoneN ds if DoneN ds contains all system's nodes then Terminate Upon Receiving a msg = S, Qs, BSFs from node sn : if |S| > 0 then Create threads to traverse the RS-batches with ids in S Populate and process the corresponding priorities queues BSFArray[Qs] := BSFs; computed by threads above Wait all threads to complete ResponseFlag := 0 Always-enabled event: it is executed repeatedly Upon receiving no message: if !(ResponseFlag) then sn := choose randomly a node not in DoneN ds send(StealingRequest, sn) to sn ResponseFlag := 1 (a) Sigmoid function fitting for determining TH .(b) Performance for different Threshold division factors.

Figure 8 :
Figure 8: Examples of partitioning the iSAX buffers' data to 4 system nodes, based on (a) simple iSAX and (b) Gray Code.

Figure 9 :
Figure 9: Flowchart of the DENSITY-AWARE data partitioning.
present the total execution time, which

Table 1 :
Details of datasets used in experiments.Algorithms.Our experimental analysis includes the entire range of Odyssey's data distribution strategies with k replication groups, PARTIAL-k, k ∈ {1, 2, 4, . . ., N sn }, as well as the density-aware data partitioning algorithm (DENSITY-AWARE).Recall that PARTIAL-N sn , or EQUALLY-SPLIT corresponds to no replication, and PARTIAL-1, or FULL corresponds to full replication.Additionally, our analysis evaluates Odyssey's queries scheduling algorithms: (i) static scheduling assigning equally sized query sets to nodes (STATIC); (ii) dynamic scheduling using a coordinator (DYNAMIC); and (iii) predictions-based scheduling, including: static without ordering (PREDICT-ST-UNSORTED), static with ordering (PREDICT-ST), and dynamic (PREDICT-DN).Moreover, we evaluate Odyssey's work-stealing mechanism using both DYNAMIC and PREDICT-DN, resulting in algorithms WORK-STEAL and WORK-STEAL-PREDICT, Seismic contains seismic instrument recordings and consists of 100M data series of size 256 for Seismology with Artificial Intelligence [2018].Astro represents celestial objects and consists of 100M data series of size 256 Soldi et al. [2014].Deep Vision [2018] contains 1B Deep vectors of size 96 extracted from the last layers of a convolutional neural network.Sift Jégou et al. [2011] is comprised of image descriptors and Yandex Text-to-Image (Yan-TtI) Simhadri et al. [2022] contains 1B