Published December 2025 | Version v1
Journal article Open

Optimizing Big Data Pipelines for Scale

Description

Big data pipelines now combine batch analytics, stream ingestion, machine-learning feature preparation, model-serving telemetry, and governed enterprise data movement. Scale is no longer only a question of adding workers: pipeline operators must select execution engines, partitioning strategies, storage layouts, accelerator routes, and recovery policies while preserving freshness, cost discipline, and compliance. This paper presents Scale-Aware Pipeline Optimization (SAPO), a practical control framework that decomposes a pipeline into typed stages, estimates each stage's bottleneck from telemetry and metadata, and applies bounded optimization actions for placement, partitioning, storage layout, and retry scope. SAPO is designed to operate with established distributed data systems rather than replacing them. A prototype study over batch, streaming, and ML-preparation workloads shows a 31.6% reduction in median completion time, a 27.9% improvement in sustained throughput, and a 22.4% reduction in unit processing cost compared with static configuration, while keeping freshness violations and rollback events within operator-specified limits. The results indicate that scalable pipeline optimization requires coordinated decisions across compute, storage, data quality, and governance boundaries.

Files

paper.pdf

Files (124.0 kB)

Name Size Download all
md5:d58b37652c4f4500d83110fb42e4aa0c
124.0 kB Preview Download