Project deliverable Open Access
Franco Maria Nardini; Raffaele Perego; Vinicius Monteiro de Lira; Ida Mele; Cristina Muntean; Nicola Tonellotto
This accompanying document for deliverable D4.1 Methods and Tools for Scalable Distributed Processing describes the main mechanisms and tools used in the BigDataGrapes (BDG) platform to support efficient processing of large datasets in the context of grapevine-related assets. The BDG software stack designed provides efficient and fault-tolerant tools for distributed processing, aiming at providing scalability and reliability for the applications.
The document first introduces the big picture of the architecture of the BDG platform and the main technologies currently used in the Persistence and Processing Layers of the platform to perform efficient data processing over extremely large dataset.
Second, the requirements needed to run the BigDataGrapes platform are introduced and discussed, by also providing instructions to set up and to launch the platform. The platform has been built, re-using and customizing the software stack of the Big Data Europe (BDE, https://www.big-data-europe.eu/). Besides the customization of some existing components, the BigDataGrapes software stack extends the BDE to better support efficient processing and distributed predictive analytics of geospatial raster data in the context of precision agriculture and Farm Management Systems. Furthermore, all the platform components have been designed and built using Docker containers. They thus include everything needed to deploy the BDG platform with a guaranteed behavior on any suitable system that can run a Docker engine.
Third, to provide the reader with practical examples of usage of the current release of the BDG platform, we report about two demos that have been already developed on the top of it by the project’s partner. Specifically, the two demonstrators perform scalable operations on geospatial raster data using the Spark-based GeoTrellis geographic data processing engine provided by the BDG platform. The first demo regards the tiling of large raster satellite images. Tiling is a mandatory process that allows the large raster datasets to be split-up into manageable pieces that can be processed on parallel and distributed resources. As a second demonstrator, the tiles previously computed are processed to extract from each tile image two relevant vegetation indexes. The first index is the normalized difference vegetation index (NDVI), an indicator that allows to assess at what degree the target being observed contains live green vegetation or not. The second index is a version of the Normalized Difference Water Index (NDWI), most appropriate for vegetation water content mapping.
Fourth, we present the BDG Cluster Performance Tool (CPT), an application developed to monitor the status of the cluster, i.e., the spark nodes, dedicated to process satellite images. On the front-end side, the tool has a Web application that runs a dashboard built upon Kibana. Moreover, on the back-end side, the tool uses: i) an API that collects the performance of the nodes composing the cluster and ii) an ElasticSearch database to store the performance statistics of the nodes.
Finally, we report the lessons learned from the exploitation of the BDG data processing tools in a real-world satellite processing architecture. We report how GEOCLEDIAN moved from a monolithic architecture to a modular one after an analysis aiming at identifying bottlenecks and congestion. The new modular architecture allows GEOCLEDIAN to speed up the processing of satellite images by 3x. Moreover, the new architecture will be used to deploy the new BDG distributed data processing components presented in this deliverable for a fast tiling and index computation from satellite images.