Published May 21, 2024 | Version v1
Journal article Open

DAFT AND IBIS: THE EMERGING PYTHON-NATIVE DISTRIBUTED DATAFRAME ECOSYSTEM EVALUATING LAZY EVALUATION, QUERY PUSHDOWN, AND MULTI-ENGINE EXECUTION FOR CLOUD-SCALE DATA ENGINEERING

Description

The Python data engineering ecosystem is undergoing a fundamental architectural transition. The era of monolithic,
eager-execution DataFrames - where pandas loads the full dataset into memory before any computation occurs - is
giving way to a new generation of lazy, declarative, multi-engine frameworks designed for cloud-scale workloads.
Daft and Ibis represent the two most architecturally significant entrants in this space as of early 2024: Daft as a
distributed, Ray-native DataFrame library with deep Apache Arrow integration and native support for multimodal
data types, and Ibis as a portable SQL expression compiler that decouples the Python DataFrame API from any single
execution engine. This paper delivers a comprehensive technical evaluation of both frameworks across six dimensions:
lazy evaluation semantics and optimization opportunities, query pushdown mechanisms and their quantified impact
on data scan reduction, multi-engine execution breadth and portability, developer ergonomics and API expressiveness,
performance benchmarks across representative workload classes, and production readiness criteria for cloud-scale
deployments. We demonstrate that Daft's distributed execution model achieves 3.1x the throughput of PySpark for
Parquet-intensive workloads at 8 nodes while maintaining sub-2x memory overhead versus single-node pandas, and
that Ibis's query pushdown reduces row scan volume by 80-99% for filtered queries against partitioned columnar
stores. Together, these frameworks represent a coherent vision for a Python-native data engineering stack that
eliminates the forced migration to JVM-based tools as data volumes exceed single-machine capacity.

Files

DAFT-AND-MAY2024-109.pdf

Files (503.3 kB)

Name Size Download all
md5:809d5934b56bb660f3c5aaa21f6f736f
503.3 kB Preview Download

Additional details