Report Open Access

Apache Spark on Hadoop YARN & Kubernetes for Scalable Physics Analysis

Dimakopoulos, Vasileios

Big Data Technologies popularity continues to increase each year. The vast amount of data produced at the LHC experiments, which will increase further after the upgrade to HL-LHC, makes the exploration of new ways to perform physics data analysis a very important challenge. As part of the openlab project on Big Data Analytics in collaboration with the CMS Big Data Project and Intel, this project aims at exploring the possibility to utilize Apache Spark for data analysis and reduction in High Energy Physics. To do this, we decided to focus on the scalability aspect of Apache Spark. We scaled our data input between 20 and 110 TBs and our available resources between 150 and 4000 virtual cores and monitored Sparks performance. Our ultimate goal is to be able to reduce 1 PB of data in less than 5 hours. The datasets used in these scalability tests are stored in ROOT format within the EOS Storage Service at CERN. We used two different resource managers for Apache Spark: Hadoop/YARN and Kubernetes. Furthermore, we leveraged three important libraries: Hadoop-XRootD Connector to fetch the files from EOS, spark-root to import the ROOT files in Spark dataframes, and sparkMeasure to monitor the performance of our workload in terms of execution time. The detailed results and conclusions are presented in this report.

Files (2.1 MB)
Name Size
Report_Vasilios Dimakopoulos.pdf
2.1 MB Download
All versions This version
Views 225226
Downloads 187187
Data volume 392.4 MB392.4 MB
Unique views 211212
Unique downloads 176176


Cite as