Published November 27, 2024 | Version v1
Conference paper Open

Pravega: A Tiered Storage System for Data Streams

  • 1. Dell Technologies

Description

The growing popularity of the data stream abstraction entails new challenging requirements when it comes to data ingestion and storage. Many organizations expect to retain data streams for extended periods of time and to store such stream data in a cost-effective manner. It is also crucial to reconcile apparently opposite properties, like data durability and consistency, along with high performance. Furthermore, data streams should not only deal with a high degree of parallelism, but also adapt to fluctuating workloads with little or no admin intervention. To our knowledge, no storage system for data streams fully copes with all these requirements.

In this paper, we present Pravega: a distributed, tiered storage system for data streams. Pravega streams are unbounded by design and cost-effective, as the system automatically moves data to a long-term storage tier (e.g., S3, NFS) and transparently manages it for the user. Pravega guarantees no duplicate or missing events, as well as per routing-key event ordering, while providing high performance streaming IO and historical reads. As a unique feature, Pravega streams are elastic: they can automatically change their degree of parallelism based on the ingestion workload. We compared the performance of Pravega with Apache Kafka and Apache Pulsar on AWS. Our results certify that Pravega can deliver performance improvements over them in many scenarios.

Files

pravega-camera-ready-final.pdf

Files (949.8 kB)

Name Size Download all
md5:ff5d456f9bacdce3f0fe6157d21e3f86
949.8 kB Preview Download

Additional details

Funding

European Commission
CloudSkin - Adaptive virtualization for AI-enabled Cloud-edge Continuum 101092646
European Commission
NEARDATA - Extreme Near-Data Processing Platform 101092644

Software