Scaling deep learning data management with Cassandra DB

Francesco Versaci; Giovanni Busonera

doi:10.1109/BigData52589.2021.9672005

Published January 13, 2022 | Version v1

Conference paper Open

Scaling deep learning data management with Cassandra DB

1. CRS4

Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability. In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.

Files

cassandradl.pdf

Files (935.9 kB)

Name	Size	Download all
cassandradl.pdf md5:0222041fd6bab1eff4c448f91c3f3557	935.9 kB	Preview Download

Additional details

European Commission
DeepHealth - Deep-Learning and HPC to Boost Biomedical Applications for Health 825111

222

Views

292

Downloads

Show more details

	All versions	This version
Views	222	221
Downloads	292	290
Data volume	283.6 MB	281.7 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Conference

IEEE International Conference on Big Data (IEEE BIgData) , 15-18 December 2021 (Session Special Session 3: Machine Learning on Big Data (MLBD 2021))

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 10, 2022
Modified: July 17, 2024

Scaling deep learning data management with Cassandra DB

Authors/Creators

Description

Files

cassandradl.pdf

Files (935.9 kB)

Additional details

Funding