Report Open Access
Ganju, Siddha; Kuznetsov, Valentin; Wildish, Tony; Martin Marquez, Manuel; Romero Marin, Antonio
The goal of this openlab summer student project is to analyse Apache Spark as a framework for the big data analytics framework of CERN. It utilizes MLlib and Spark Streaming along with Python and its libraries such as scikit-learn and scipy. The objective was to determine if the metadata of CMS can be analysed efficiently, in terms of CPU usage and time for completion of the job using Apache Spark. Apache Spark is a framework providing speedy and parallel processing of distributed data in real time. Additionally it provides powerful cache and persistence capabilities. Because of these features Apache Spark is applied to data analytics problems. Components of Spark like Spark Streaming and MLlib (Spark native machine learning library) make analysis possible.
Apache Spark Release 1.4.0
scikit-learn package i. ensemble
ii. cross_validation iii. neighbors
i. The Pandas acronym comes from a combination of panel data and
Python data analysis. It targets five typical steps in the processing and analysis of data, regardless of the data origin: load, prepare, manipulate, model, and analyze.
ii. Pandas places much emphasis on flexibility, for example, in handling disparate cell separators. Moreover, it reads directly from the cache or loads Python objects serialized in files by the Python pickle module.
Numpy package: manipulating arrays
I present an evaluation of Apache Spark for streamlining predictive models which use information from CMS data -services. The models are used to predict which datasets will become popular over time. This will help to replicate the datasets that are most heavily accessed, which will improve the efficiency of physics analysis in CMS. The evaluation will cover implementation on Apache Spark framework to evaluate quality of individual models, make ensembles and choose best predictive model(s) for new set of data.
The task in this project is to predict popular datasets. Finding the popular datasets is helpful in a two-fold way. Firstly, it helps in providing expeditious access to datasets that might be required and secondly, it helps in finding which might become the ‘hot topics’ in high energy physics. It is also necessary to define what a popular dataset is. Based on the data collected, some parameters such as nusers, naccesses and tot_cpu can be said to define popularity because the curve between all the parameters and popularity is mostly dependent on them. To find the numerical value of the threshold limit beyond which a dataset is termed popular, a graph is plotted. This graph is plotted on the log scale so that all values can be plotted within the region represented by the graph.
After calculating the threshold values, transformation into a classification problem is done. Now, a rolling forecast is performed. This helps us to predict binary popularity values for each week. Each week’s data is added to the existing data and a new model is created. This follows the notion, more data leads to better data analysis. Prediction can be done in various ways following implementation of several machine learning algorithms, mainly, Naive Bayes, Stochastic Gradient Descent and Random Forest. Their models are then combined into an ensemble to check which algorithm offers the best true positive, true negative, false positive or false negative value.
This project also includes plotting the results obtained against the time scale to get a notion of how accuracy scores change with each week. These include sensitivity, specificity, precision, and recall and fallout rate against time scale.