Optimizing Big Data Pipelines for Scale

Kumar, Bikesh; Kodali, Ravi Kiran; Saha, Sumit

doi:10.5281/zenodo.20503431

Published December 2025 | Version v1

Journal article Open

Optimizing Big Data Pipelines for Scale

Big data pipelines now combine batch analytics, stream ingestion, machine-learning feature preparation, model-serving telemetry, and governed enterprise data movement. Scale is no longer only a question of adding workers: pipeline operators must select execution engines, partitioning strategies, storage layouts, accelerator routes, and recovery policies while preserving freshness, cost discipline, and compliance. This paper presents Scale-Aware Pipeline Optimization (SAPO), a practical control framework that decomposes a pipeline into typed stages, estimates each stage's bottleneck from telemetry and metadata, and applies bounded optimization actions for placement, partitioning, storage layout, and retry scope. SAPO is designed to operate with established distributed data systems rather than replacing them. A prototype study over batch, streaming, and ML-preparation workloads shows a 31.6% reduction in median completion time, a 27.9% improvement in sustained throughput, and a 22.4% reduction in unit processing cost compared with static configuration, while keeping freshness violations and rollback events within operator-specified limits. The results indicate that scalable pipeline optimization requires coordinated decisions across compute, storage, data quality, and governance boundaries.

Files

paper.pdf

Files (124.0 kB)

Name	Size	Download all
paper.pdf md5:d58b37652c4f4500d83110fb42e4aa0c	124.0 kB	Preview Download

	All versions	This version
Views	15	15
Downloads	5	5
Data volume	619.8 kB	619.8 kB

Optimizing Big Data Pipelines for Scale

Authors/Creators

Description

Files

paper.pdf

Files (124.0 kB)