Report Open Access
Javier García Rubio
Apache Parquet is a columnar storage format for the Hadoop ecosystem. The technology has become
almost a de-factor standard due to important benefits and adventages such its seameless integraction
with multiple choices of data processing framework, data model or programming language.
Another important aspect of this technology is the capacity to redure drastically the amount of storage
required to persist data. That reduction is achieved based on the columnar format used but also to the
compression codecs supported and applied to the data once it is transformed into parquet format.
This project studies the implications in terms of performance on data compression and access of the
differents codecs available.