Published January 13, 2022 | Version v1
Conference paper Open

Scaling deep learning data management with Cassandra DB

  • 1. CRS4


Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability. In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.



Files (935.9 kB)

Name Size Download all
935.9 kB Preview Download

Additional details


DeepHealth – Deep-Learning and HPC to Boost Biomedical Applications for Health 825111
European Commission