DAFT AND IBIS: THE EMERGING PYTHON-NATIVE DISTRIBUTED DATAFRAME ECOSYSTEM EVALUATING LAZY EVALUATION, QUERY PUSHDOWN, AND MULTI-ENGINE EXECUTION FOR CLOUD-SCALE DATA ENGINEERING
Authors/Creators
Description
The Python data engineering ecosystem is undergoing a fundamental architectural transition. The era of monolithic,
eager-execution DataFrames - where pandas loads the full dataset into memory before any computation occurs - is
giving way to a new generation of lazy, declarative, multi-engine frameworks designed for cloud-scale workloads.
Daft and Ibis represent the two most architecturally significant entrants in this space as of early 2024: Daft as a
distributed, Ray-native DataFrame library with deep Apache Arrow integration and native support for multimodal
data types, and Ibis as a portable SQL expression compiler that decouples the Python DataFrame API from any single
execution engine. This paper delivers a comprehensive technical evaluation of both frameworks across six dimensions:
lazy evaluation semantics and optimization opportunities, query pushdown mechanisms and their quantified impact
on data scan reduction, multi-engine execution breadth and portability, developer ergonomics and API expressiveness,
performance benchmarks across representative workload classes, and production readiness criteria for cloud-scale
deployments. We demonstrate that Daft's distributed execution model achieves 3.1x the throughput of PySpark for
Parquet-intensive workloads at 8 nodes while maintaining sub-2x memory overhead versus single-node pandas, and
that Ibis's query pushdown reduces row scan volume by 80-99% for filtered queries against partitioned columnar
stores. Together, these frameworks represent a coherent vision for a Python-native data engineering stack that
eliminates the forced migration to JVM-based tools as data volumes exceed single-machine capacity.
Files
DAFT-AND-MAY2024-109.pdf
Files
(503.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:809d5934b56bb660f3c5aaa21f6f736f
|
503.3 kB | Preview Download |