Report Open Access
Big Data Technologies popularity continues to increase each year. The vast amount of data produced at the LHC experiments, which will increase further after the upgrade to HL-LHC, makes the exploration of new ways to perform physics data analysis a very important challenge. As part of the openlab project on Big Data Analytics in collaboration with the CMS Big Data Project and Intel, this project aims at exploring the possibility to utilize Apache Spark for data analysis and reduction in High Energy Physics. To do this, we decided to focus on the scalability aspect of Apache Spark. We scaled our data input between 20 and 110 TBs and our available resources between 150 and 4000 virtual cores and monitored Sparks performance. Our ultimate goal is to be able to reduce 1 PB of data in less than 5 hours. The datasets used in these scalability tests are stored in ROOT format within the EOS Storage Service at CERN. We used two different resource managers for Apache Spark: Hadoop/YARN and Kubernetes. Furthermore, we leveraged three important libraries: Hadoop-XRootD Connector to fetch the files from EOS, spark-root to import the ROOT files in Spark dataframes, and sparkMeasure to monitor the performance of our workload in terms of execution time. The detailed results and conclusions are presented in this report.