Published November 30, 2020 | Version v1
Journal article Open

A Benchmark for Suitability of Alluxio over Spark

  • 1. Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India. E
  • 1. Publisher

Description

Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.

Files

A81901110120.pdf

Files (446.2 kB)

Name Size Download all
md5:6b933f41b859941b4a2165b3cf8b6cdc
446.2 kB Preview Download

Additional details

Related works

Is cited by
Journal article: 2278-3075 (ISSN)

Subjects

ISSN
2278-3075
Retrieval Number
100.1/ijitee.A81901110120