10.5281/zenodo.3550756
https://zenodo.org/records/3550756
oai:zenodo.org:3550756
Javier García Rubio
Javier García Rubio
PERFORMANCE STUDY OF PARQUET CODECs
Zenodo
2019
CERN openlab
summer-student programme
2019-11-22
10.5281/zenodo.3550755
https://zenodo.org/communities/cernopenlab
Creative Commons Attribution 4.0 International
Apache Parquet is a columnar storage format for the Hadoop ecosystem. The technology has become
almost a de-factor standard due to important benefits and adventages such its seameless integraction
with multiple choices of data processing framework, data model or programming language.
Another important aspect of this technology is the capacity to redure drastically the amount of storage
required to persist data. That reduction is achieved based on the columnar format used but also to the
compression codecs supported and applied to the data once it is transformed into parquet format.
This project studies the implications in terms of performance on data compression and access of the
differents codecs available.