Published January 13, 2022 | Version v1
Conference paper Open

Scaling deep learning data management with Cassandra DB

  • 1. CRS4

Description

Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability. In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.

Files

cassandradl.pdf

Files (935.9 kB)

Name Size Download all
md5:0222041fd6bab1eff4c448f91c3f3557
935.9 kB Preview Download

Additional details

Funding

DeepHealth – Deep-Learning and HPC to Boost Biomedical Applications for Health 825111
European Commission