Scaling deep learning data management with Cassandra DB

Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability.In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.


I. INTRODUCTION
Deep learning (DL) techniques are now ubiquitous and have been adopted in countless applications, and, thanks to ever more powerful GPUs and accelerators, they produce increasingly accurate predictions. In order to be fully effective, DL algorithms require processing an increasingly large amount of data that can easily comprise millions of different files and the associated metadata. However, while a lot of effort has been spent in optimizing the DL computational process, the data-loading pipeline still lacks behind in ease of use, standardization and scalability [8,30,33].
As an example, let us consider the key problem of image classification, and tissue classification in particular, an important problem in Digital Pathology (e.g., the automated classification of breast or prostate tissue [20,27]). The input dataset consists of hundreds or thousands of gigapixel images (slides), from which smaller portions (patches) are extracted, together with their labels (e.g., normal/tumor), to make up the dataset for the DL training. Each patch has complex metadata that need to be tracked (coordinates within the slides, patient id, date, etc.) and which also need to be taken into account when creating train, test and validation sets. For example, patches must be divided into splits (e.g., train, validation, etc.) according to the patient id, and some target balance between classes in the training dataset (i.e., labels) is usually desired (e.g., 1:1 normal/tumor).
The typical workflow [16] requires some custom program to build the DL dataset, which will consist of millions of small files, saved in a filesystem, with their path encoding the split and class to which they belong (e.g., train/normal, validation/tumor, etc.). This process presents many drawbacks, affecting both usability and performance: • The dataset is static and if some changes are required (e.g, a different train/validation ratio) this could imply recreating or moving millions of files. • The user needs a custom way to keep track of the metadata (e.g., a database or a CSV file), since he might need them at later stages of the processing. • To allow parallel access to the dataset (e.g., for a distributed training or for different trainings working on the same dataset) one needs to either move the data to a network storage (which would decrease dramatically the data-loading performance) or set up a parallel filesystem, which is a complex task requiring dedicated hardware and careful configurations, while still often underperforming when accessing small files [5,13,19,21].
In this work we show how these problems can be addressed by adopting a data management strategy based on NoSQL databases, which can be leveraged to achieve horizontal scalability and low-latency access to the training data and metadata, via a user-friendly, flexible interface.
To illustrate our strategy we have developed an Apache Cassandra-based [17] data-loading module, which is focused on image classification and is being integrated with the DeepHealth Toolkit [4]. However, its design and architecture are of general interest and applicability.
The contributions are summarized as follows: • We present a scalable strategy, based on Cassandra DB, to easily and efficiently manage data and metadata for DL. • We describe a newly developed data loader, which implements our proposed management strategy, and analyze its performance and use. • We extend the EDDL library [4], using Message Passing Interface (MPI), to support synchronized data parallelism. • We show how our data loader can be used to easily distribute data, in a decentralized way, among the workers participating in a distributed training. The rest of this manuscript is structured as follows. Section II provides some technology background. In Sec. III we describe the high-level design of our dataloading module, while its implementation is detailed in Sec. IV. Section V presents and discusses the empirical performance of the data loader, and Sec. VI focuses in particular on distributed training. Finally, Sec. VII points the reader to the software and Sec. VIII concludes the manuscript.

A. Related work
Within the TensorFlow framework, an approach to mitigate the small file problem is provided by TFRecord [28], which is a serialization format that allows for efficient loading and saving of datasets to/from disks, by grouping more records into bigger files, and also allows the saving of labels along with the features. However, it does not support random access to the saved items (and hence, e.g., global shuffling of the dataset), nor it removes the disk and network bottlenecks when accessing the data.
An interesting approach for dealing with parallel filesystem bottlenecks in a distributed context has been adopted in [14]: instead of allowing full access to the dataset, each node is allowed to view only a subset of the data and the authors have developed a custom, multi-threaded data stager which reads partitioned data from GPFS and distributes them to the computing nodes via MPI calls. Yet, this procedure amounts to a static pre-distribution of data and hence cannot handle image metadata nor it can support random access to the images.
FanStore [32] presents a more general approach to overcome filesystem bottlenecks, by developing a custom, MPI-based, parallel filesystem, which exposes a POSIX interface, and is tailored for DL application. It offers good scalability, but since it exposes a filesystem interface, is not designed to easily handle image metadata.
A further step in overcoming parallel filesystem bottlenecks by completely avoiding communications can be seen in [15], where instead of transferring the data needed for the training, they are generated in real-time in each computing node. This technique, however, can be applied only in limited contexts, in which training is run on synthetic data.
There have also been approaches connecting DL frameworks to key-value DBs (e.g., the ml-pyxis plugin for PyTorch, which leverages the Lightning Memory-Mapped Database [22]), but they lack the scalability and flexibility of the data loader presented in this paper.

B. DeepHealth Toolkit
The DeepHealth Toolkit [4] is an open-source DL toolkit, particularly focused on enabling easy DL adoption in the medical field. It is written in C++, exposes C++ and Python APIs, and it natively supports cloud computing. Our data-loading module is written to interface with the DeepHealth Toolkit, but it may be adapted, without too much effort, to work with other popular DL frameworks, such as TensorFlow or PyTorch.

C. Apache Cassandra
NoSQL databases are data storage systems which support high availability and horizontal scalability, at the expense of lower consistency guarantees than standard SQL databases [7]. Apache Cassandra [17] is a distributed, decentralized and highly scalable NoSQL DB, it is a free and open-source project and is widely adopted both in industry (e.g., Netflix, Uber [6]) and big data analytics contexts [25] (e.g., in the CERN ATLAS project [26]). As for the performance, Apache Cassandra offers low-latency (typically less than a millisecond), high-bandwidth, concurrent accesses to the stored data, while supporting easy scalability, high availability and tunable data redundancy. Also, it does not require special hardware (apart from fast disks and/or large memory) and thus it could even be installed on the very same computing nodes, if needed.

III. ARCHITECTURE
Our effort in rethinking the data loading pipeline is twofold, having both usability and performance objectives: • we aim at storing and accessing features and metadata uniformly, i.e., using the same system (Cassandra DB in our case) for both; • we want to allow scalable, flexible and fast network access to the data; • we aim at offering dynamic, random access to the data, allowing full, unrestricted access to the dataset (since every DL training algorithm based on Stochastic Gradient Descent needs that data is sufficiently -ideally uniformly -reshuffled after every epoch [10]); • we want to decouple storage and splits management, so that different datasets (also insisting on the same set of images) may be used, also concurrently, whenever needed (e.g., initial tests can be run on a smaller subset of data, and subsets with different characteristics can easily be obtained and explored by filtering according to metadata); • we want to simplify the data distribution in parallel DL training.

A. Workflow
The workflow that we have designed to achieve the previous objectives is the following: • All the images that might be needed for DL are saved as BLOBs in the Cassandra DB (details in III-B), together with labels and metadata, and are identified by a UUID, thus allowing collision-free, distributed, uncoordinated data insertion [18]. • The DB is queried to get the full list of UUIDs and the metadata which are required for creating the splits. • The splits (expressed as lists of UUIDs) are then created automatically, based on target values and constraints involving metadata (see IV). • Finally, during the training and validation phases, when the data are needed they are efficiently pre-fetched by their UUID, and fed to the DL library.

B. Data model
When designing data models for NoSQL databases, particular attention must be devoted to choosing which keys to adopt and which tables to denormalize, since from these initial choices it will depend which queries will be allowed and how the system will perform when answering them. To allow both a fast retrieval of data as well as an easy access to metadata, we have chosen to organize datasets in three tables: metadata_by_nat Each record of this table contains all the metadata, the label, and a randomly generated UUID. Its partition keys are the "natural" ones of the dataset, plus the label (see example below). data_by_uuid Records in this tables contain only the minimum data needed by the training, i.e., the BLOB of the image files, the label, and finally the UUID as primary key. metadata_by_uuid This (optional) denormalized table contains all the fields of metadata_by_nat, but it has the UUID as primary key.
This data organization is extremely flexible and can easily be adopted in most image classification contexts.

C. CQL tables
The user has to identify the required metadata for each dataset that wants to use, and create the appropriate tables to store data and metadata in the Cassandra DB. The list of columns is then passed to the data loader (see Sec. IV and the example in Listing 1), that will accordingly use these columns when creating the splits.
As an example, here is a minimal CQL description of the tables that might be needed for the automated tissue classification. The metadata_by_nat table is used when creating the splits, and allows to efficiently retrieve the full list of patients, slides and labels and to fetch the UUIDs of the patches for any given patient/slide/label combination. Once we have these data it is relatively easy 1 to ask the system, e.g., to create 5 splits, with different patients in each split, using a total of 1 million patches with size ratios [5, 2, 1, 1, 1] among the 5 splits, and keep balanced labels (1:1 normal/tumor ratio).
The splits are expressed as lists of UUIDs and, once they have been created, they can easily be saved and loaded as needed. Subsequently, the training process will only need to access the data_by_uuid table, via efficient queries to single-row partitions. Note that more than one data table may be created, e.g., one might also want to save a color-normalized dataset, along with the original one.  Fig. 1. System architecture diagram of our data loader. Images are extracted from the raw dataset and pre-processed to be inserted, together with relevant metadata, in the Cassandra DB. The DB is subsequently queried to build the list of splits and to fetch images and labels whenever needed by the DL application.
Finally, the optional metadata_by_uuid table can be used at later stages of the DL workflow. E.g., when analyzing the training results one might want to trace misclassified patches back to the slides from which they have been extracted, to check for systematic errors in the original labeling.

IV. IMPLEMENTATION
The data loader module is written in C++ and Python, and it is made up of three main classes (as shown in the system architecture diagram of Fig 1): CassandraListManager This high-level Python class takes care of creating the splits, given the desired target parameters. Details in Sec. IV-A. BatchPatchHandler This low-level C++ class (with Python bindings exposed via pybind11 [23]) takes care of efficiently retrieving a batch of features and labels. It accepts in input a list of UUIDs and applies data-augmentation via the ECVL library [4], if needed. Details in Sec. IV-B. CassandraDataset This is the main interface for using the data loader. It is written in Python and offers simple methods to load split files and fetch batches of data (features and labels). Its use is pretty straight-forward, as can be seen in the minimal example in Listing 1.

A. CassandraListManager -Split creation
The automatic creation of splits works as follows: • First, the list of Cassandra DB partitions is read.
• Then, the list of UUIDs contained in each partition is read and they are aggregated based on the chosen keys (e.g., patches are aggregated based on patient_id, so that patches of the Listing 1 Example of data-loader use from c a s s a n d r a _ d a t a s e t \ import C a s s a n d r a D a t a s e t from c a s s a n d r a . a u t h \ import P l a i n T e same patient will all belong to the same split and thus will either be in the training or in the validation set). • Each aggregated partition (group) is assigned to a split, so that the target values for each split and class are approximately met. (In more detail, the desired target values are computed for each split/class combination and groups are assigned in round robin to the splits, provided they do not make them overflow. Finally, remaining groups are assigned to the splits randomly.) • Once a bag of groups for each split has been computed, the rows are extracted in round robin from each group (to maximize diversity), until the target values are reached (or no more rows are available). Note that given the standardized way in which data are stored in the DB and splits are formed (as lists of UUIDs), it is relatively easy to extend the split creation process with custom code, using the full list of rows which can be retrieved from the DB.

B. BatchPatchHandler -Performance optimizations
In order to increase the loader throughput, we have adopted the following optimizations in our code: • Data for each split are read in parallel by a thread pool (with 32 threads as default). • Data are prefetched in background, while the GPU is processing the previous (mini-)batch. • Data augmentations are applied in background as well. • Double-buffering is used to reduce the DB+network latency: i.e., the download of a second batch starts while the first one is still in progress, thus halving the average batch latency. • Expensive system resources, such as threads and Cassandra connections, are allocated lazily. This means that only splits effectively being used do consume resources and it is thus possible to have many unused splits without impacting the system performance (see application in Sec. VI).

V. EVALUATION AND DISCUSSION
In this section we analyze our data management strategy, with the objective of identifying possible performance bottlenecks. In particular we want to measure the communication performance both on the client (data loader) and server (Cassandra) side. For the client side, we will measure the maximum throughput achievable by a single data loader, when it does not have to wait for computations on the retrieved data (i.e., cutting the actual GPU work). For the server side we will verify that our Cassandra servers are able to saturate their outgoing bandwidth, and that their retrieval time does not grow too much under heavy traffic load.
To allow better reproducibility we have tested our data loader using the standard ImageNet-2017 dataset (166 GB, 1,281,167 images, 1000 classes) [24]. Our test system is a cluster of 18 nodes, up to 2 running Cassandra DB and up to 16 consuming the data. The nodes are equipped with Intel Xeon E5-2680 v3 CPUs (12 cores, 2 threads/core) and are connected via a 10 Gb/s Ethernet (used by Cassandra) and a 56 Gb/s InfiniBand (used by our MPI parallel DL trainer for exchanging data, see Sec. VI). The nodes do not have GPUs, but since in this work we are only interested in the data loading stage, we have simply simulated their presence (by means of appropriate time sleeps) whenever needed.
For portability reasons we chose not to run on bare metal, but we have instead adopted Docker containerization with Kubernetes orchestration, both for Cassandra servers and for our data loader. The use of containers can introduce some network overheads, but these are mostly negligible and more than compensated for by the ease of deploying and managing the system [1].

A. Populating the DB
We have resized and center-cropped all the images to the standard resolution 224x224x3 (RGB) and saved them as BLOBs in the Cassandra DB, both as JPEG (quality: 90, average size: 20 kB) and not compressed TIFF (size: 150 kB). This data preprocessing step is easily parallelizable and scalable (no synchronizations are needed) and we implemented it with PySpark [31].

B. Performance of the data loader (client)
We have first tested the raw performance of a "shortcircuited" data-loader, i.e., one which reads as many batches as possible, without actually consuming the data. Results are shown in Table I and Figure 2. For the smaller (JPEG) images the throughput is between 11,000 and 18,000 images per second, whereas for the not compressed TIFFs it is between 3800 and 4900 images per second, peaking when the batch size is 128. Since the thread parallelism is always 32, the smallest batch size pays the maximum latency out of 32 images, whereas, when the batch size increases, the latencies are averaged between subsequent rounds of retrieval, and hence the throughput increases (for example, if the batch size is 128, the retrieval time of each thread is the sum of 4 sequential transfers). However, as the batch size continues to grow, so does the stress on the Cassandra server while serving a batch, increasing the retrieval latency, which in turn decreases the overall throughput. This behavior is more evident in the case of uncompressed images as shown in Figure 2. In fact, assuming full bandwidth and a conservative network latency of 30 µs, transferring 20 KB and 150 KB on a 10 Gb Ethernet takes, respectively, less than 50 µs and 150 µs. Comparing these network transfer times with the DB retrieval times shown in Table II, we can see how the latter tend to dominate the overall communication time.
As for the computational resources required in the data loading we note that they depend roughly on the transaction rate (assuming no data augmentation is performed). At maximum throughput, transferring 18k JPEG/s results in a CPU load of about 1900% (i.e., 19 threads at full speed, hence close to CPU saturation), whereas when moving 5k TIFF/s the load is about 400%. Considering that ResNet-50 [12], the standard network when testing the ImageNet dataset, consumes about 200 images/s on an NVIDIA TITAN RTX GPU, we can see that, depending on the chosen batch size, a single data loader can sustain about 50-90 GPUs, when transferring compressed JPEG, and 19-24 when using not compressed TIFF.

C. Performance of Cassandra DB (server)
In this section we investigate the behavior of Cassandra server nodes under heavy load. We are interested in particular in seeing whether they can saturate the outgoing bandwidth (10 Gb/s) when flooded by data requests, and if they can service these requests while keeping the DB latency stable. To this purpose we have measured the distribution  of Cassandra retrieval latency (via the nodetool tablehistograms command) both with 1 and 16 active, short-circuited data loaders (with batch size = 256) which try to read as many data as possible from the servers. The results, in Table II, show that the network can be saturated both when retrieving compressed and not compressed images and that read latencies up to the 95th percentile remain almost constant even when the network is saturated, whereas the 99th percentile grows approximately by a factor 2.
As for the computational intensity, on a heavy-loaded Cassandra node we measure 1000% CPU usage when retrieving JPEGs and 700% with TIFFs.
The image rates at saturation are about 50,000 images/s for 20 kB JPEGs and 7000 images/s for not compressed 150 kB TIFFs, which amounts to serving enough data to feed, respectively, 250 and 35 GPUs, per Cassandra node.

D. Scaling up/down Cassandra DB
Cassandra DB allows for nodes to be added/removed to/from an existing ring (i.e., a Cassandra cluster), without any service disruption. We have verified that, when activating a second Cassandra node under heavy load, the outgoing bandwidth on both nodes is still saturated, and that when the second node is subsequently deactivated the load on the first one remains stable. Note that the scaling up/down of the ring has been performed while 16 loaders were continuously retrieving data from the DB, without any service interruption.
E. Discussion 1) Performance comparison with parallel filesystems: Our data loader, compared to high-end parallel filesystems, has a major performance disadvantage: communications to Cassandra servers are TCP based, whereas in parallel filesystems there can be RDMA transfers directly from the storage to the consuming nodes (if the network supports them) and this impacts both network latency and CPU usage. However, since the retrieval latency dominates the network one, the performance gap is not too wide: for example, a parallel filesystem installation using BeeGFS has latency in the order of 100 µs and can support up to 250,000 operations/s per node [3]. In our case, we can fetch 150 KB images with latency of about one millisecond and we can reach bandwidth saturation at rate of 50,000 transfers/s per server node for 20 KB images, using a 10 GbE network and general purpose nodes. Parallel filesystem, on the other side, do not simplify the management of splits and metadata, as our approach does. Overall, we think that our design can be of particular interest for small and medium size systems, showing a good trade-off among performance, cost and ease of deployment. 2) Floating point vs integer data: Our use case utilizes integer data (i.e., RGB images), but in many scientific application this does not happen: for example one might apply DL techniques to general tensors, where each spatial point is associated with floating point data (e.g., 64-bit), and, if the data are not compressed, an 8x increase in the required bandwidth has to be taken into account (compared to our not compressed TIFF case). This means that a 10 GbE can sustain only up to 4 GPUs. One approach that could be explored to help scaling the computation in this case (apart from using a faster network connection) would be using each computing node also as a Cassandra server, somehow resembling the solution adopted in [15] for synthetic data.

VI. DISTRIBUTED DEEP LEARNING
An application that can benefit hugely from our data loading solution is distributed DL [2]. In this section, after a brief introduction to parallelization methods for DL, we present a simple distributed training that leverages our data management strategy and we analyze its implementation and performance.

A. Data and model parallelism
The two main approaches to parallelize DL algorithms are data and model parallelization [2].
A general workflow for data parallelism is the following: • the neural network (NN) to be trained is copied to all the computational devices (e.g., GPUs); • at every iteration, the current (global) mini-batch of samples is divided into chunks (local minibatches), that are mapped to the local copies of the NN; • at the end of each iteration the local gradients, computed after a back-propagation pass, are aggregated among all (or almost all) the workers, to compute the global gradient and update the network parameters. In a distributed system the average of the gradients is typically implemented with an All-Reduce operation, followed by a local update of the parameters. This synchronization step affects the parallel efficiency of the distributed training, thus limiting the scalability of the computation. Another, subtler, scalability issue is related to the mini-batch size: as the parallelism increases, so does the global batch size and this can affect the generalization capability of the model [2].
In model parallelism the computations of different parts of a neural network are performed by different devices (e.g., GPUs). In this case the mini-batch size is independent of the parallelism, and hence there is no reduction of generalization capabilities as in data parallelism. On the other hand, the main drawback of this approach is given by the high communication costs due to the dependencies among different parts of the NN. Some enhanced architectures have been proposed in literature to mitigate this overhead by using redundant computations, however model parallelism is typically used when the NN cannot fit on a single computational device or if the particular NN architecture (e.g., LSTM models) can be efficiently split across different devices.

B. Decentralized data distribution
In order to stress our data loading pipeline, we have implemented a basic version of synchronized data parallelism for the EDDL library [4], extending the SGD optimizer by using Open MPI to compute the average of gradients, the losses and the performance metrics among all the parallel ranks. Note that this approach applies transparently to both inter-and intranode communications.
We chose to keep a copy of synchronized parameters on each worker (an approach also called mirrored strategy [9]), instead of using a centralized parameters server, as we are not interested on implementing more complex distributed schemas like asynchronous updates. We also chose to update NN parameters at each iteration to closely mimic the behavior of the original SGD algorithm.
Our data management strategy allows to easily distribute (and uniformly, globally permute) data among the MPI ranks, without the need for centralized process, as it is exemplified by the following procedure for a parallel system of size n: • At startup, each rank reads the full list of the images hosted by the Cassandra servers. This can be done either by querying directly the DB or by reading a pre-shared file (of size about 60 MB for ImageNet). • The data loader and the network on each rank are initialized with the same seed (e.g., broadcasted by rank 0). • Each data loader creates 2n splits: n for training and n for validation (as described in § IV-A). • Rank i will read training data from split i and validation data from split n + i. • At the end of an epoch the UUIDs in the training splits are shuffled (again, using the same seed on each rank) and the next epoch can start. Some observations: • In EDDL is currently impossible to set the seed for the network initialization, hence at startup we broadcast the network parameters from rank 0, to be able to start the training everywhere in the same state. • Since resources consumed by splits are lazily allocated, the load on each rank remains constant when the parallelism grows.

C. Simulation of multi-GPU training
We have adapted our MPI distributed learner to simulate the load on the Cassandra servers induced by different training configuration (up to 16 nodes, up to 4 GPUs per node) in the following way: • The GPU computations have been replaced by appropriate time sleeps, obtained by actual performance measures on an NVIDIA TITAN RTX GPU. • The communications are normally carried out, using the UCX module [29] (in our system: inter-node via InfiniBand, intra-node via shared memory).
We have chosen to assign 6 threads per MPI rank (i.e., options --map-by node:pe=6 --bind-to core of mpirun) and we have specified in the hostfile a number of slots per each node equal to the number of simulated GPUs.
After the initial setup phase, in which the splits are created, each worker starts the loop across the epochs. During each epoch two inner loops across the local batches are performed, respectively to train and validate the model. From the point of view of the data loader the operations performed are identical, because the only difference is the split index used to get the local batch. However, the operations simulated on the retrieved data are different. The training loop involves a local forward and backward propagation (simulated by time sleeps), followed by the gradient average operation (which is instead performed in full, exactly as in the case where GPUs are available), run to keep the network copies synchronized at the end of every iteration. The validation loop, on the other hand, computes only a local forward pass (simulated) followed by a global average for losses and metrics (fully performed). Since these average operations involve only floating point communications, their overhead is negligible and the validation task behaves as an embarrassing parallel algorithm and scales linearly (up to saturation of the outgoing bandwidth of the Cassandra servers). Accordingly to the difference in the operations, the sleeps for training (forward and backward) and validation (only forward) are different, so as to match the actual values we have measured on GPUs.
The throughputs obtained by simulating distributed trainings with compressed and not compressed images are shown in Tables III and IV, and Figures 3a and 3b. As can be seen by the data, the validation measurements show a linear relationship between the number of running workers and the throughput. There is an exception for the validation of not compressed images with 64 workers, where the throughput saturates the bandwidth of the 2 Cassandra servers being used.