Published December 11, 2023 | Version v1
Conference paper Open

Practical Storage-Compute Elasticity for Stream Data Processing

  • 1. Dell Technologies

Description

Stream processing pipelines need to handle workload fluctuations (e.g., daily patterns, popularity spikes) by scaling up/down the resources contributed to running jobs. While there have been efforts proposing auto-scaling mechanisms for stream processing engines, prior work has overlooked the role of the storage system in ingesting and serving stream data. The absence of effective scaling for data streams is problematic given that the number of parallel partitions of a data stream limits both streaming data ingestion throughput and read parallelism for downstream streaming jobs. In this paper, we propose to augment the auto-scaling notion of stream processing engines with information about the source data stream. The key novelty of our approach lies in exploiting elastic data streams to ingest data, which is a unique feature of Pravega: a storage system for data streams part of the Dell's Streaming Data Platform. Pravega streams can dynamically change their parallelism based on the ingestion workload, and such information can in turn be exploited for auto-scaling the streaming job downstream. To this end, we have developed an Apache Flink connector for Pravega, as well as an auto-scaling orchestrator that feeds on data stream metrics. Our experiments show how a stream processing pipeline auto-scales by coordinating data stream and processing parallelism under workload fluctuations, with low operations cost.

Files

pravega-industry-final.pdf

Files (848.5 kB)

Name Size Download all
md5:f97691da8daef8bc02057ed53b279ed8
848.5 kB Preview Download

Additional details

Funding

European Commission
NEARDATA - Extreme Near-Data Processing Platform 101092644
European Commission
CloudSkin - Adaptive virtualization for AI-enabled Cloud-edge Continuum 101092646