Taming Big Maritime Data to Support Analytics

This article presents important challenges and progress toward the management of data regarding the maritime domain for supporting analysis tasks. The article introduces our objectives for big data – analysis tasks, thus motivating our efforts toward advanced data-management solutions for mobility data in the maritime domain. The article introduces data sources to support speci ﬁ c maritime situation – awareness scenarios that are addressed in the datAcron [The datAcron project has received funding from the European Union ’ s Horizon 2020 research and innovation program under grant agreement No 687591 (http://datacron-project.eu).] project, presents the overall infrastructure designed for managing and exploiting data for analysis tasks, and presents a representation framework for integrating data from different sources revolving around the notion of semantic trajectories: the datAcron ontology.

Challenges emerge as the number of moving entities and related operations increase at unprecedented scale.This, in conjunction with the demand for increasingly more frequent data from many different sources and for each of these entities, results in generating vast data volumes of a heterogeneous nature, at extremely high rates, whose intertwined exploitation calls for novel big-data techniques and algorithms that lead to advanced data analytics.This is a core research issue that we address in the datAcron project.More concretely, core research challenges in datAcron include the following: • distributed management and querying of spatiotemporal RDF data-at-rest (archival) and data-in-motion (streaming) following an integrated approach; • reconstruction and forecasting of moving entities' trajectories in the challenging maritime (2D space with temporal dimension) and aviation (3D space with temporal dimension) domains; • recognition and forecasting of complex events due to the movement of entities (e.g., the prediction of potential collision, capacity demand, hot spots/paths); and • interactive visual analytics for supporting human exploration and interpretation of the above-mentioned challenges.
Technological developments are validated and evaluated in user-defined challenges that aim at increasing the safety, efficiency, and economy of operations concerning moving entities in the aviation and maritime domains.The main benefit arising from improved trajectory prediction in the aviation use-case lies in the accurate prediction of complex events, or hot spots, leading to benefits to the overall efficiency of an air traffic-management (ATM) system.Similarly, discovering and characterizing the activities of vessels at sea are key tasks to Maritime Situational Awareness (MSA) indicators and constitute the basis for detecting/predicting vessel activities toward enhancing safety, detecting anomalous behaviors, and enabling an effective and quick response to maritime threats and risks.
In both domains, semantic trajectories are turned into "first-class citizens."In practice, this forms a paradigm shift toward operations that are built and revolve around the notion of trajectory.For instance, in the MSA world, trajectories are essential for tracking vessels' routes, detecting and analyzing anomalous behavior, and supporting critical decision-making.datAcron considers trajectories as first class citizens and aims to build solutions toward managing data that are connected by way of, and contribute to, enriched views of trajectories.In doing so, datAcron revisits the notion of semantic trajectory and builds on it.Specifically, it is expected that meaningful moving patterns will be computed and exploited to recognizing and predicting the behavior and states of moving objects, taking advantage of the wealth of information available in disparate and heterogeneous data sources, and integrated in a representation in which trajectories are the main entities.
The objective of this section is to review the challenges of and recent progress toward managing big data for supporting analysis tasks regarding moving objects at sea (e.g., for predicting vessels' trajectories, events, and/or support visual-analytics tasks).Such data may be surveillance data but also data regarding vessels' characteristics, past events, areas of interest, patters of movement, etc.These are data from disparate and heterogeneous sources that should be integrated, together with the automatic computations of indicators that contribute to support maritime experts' awareness of situations.
The article presents the datAcron maritime-use case.It then presents the overall datAcron infrastructure to manage big mobility data focusing on data-management issues.It then presents the datAcron ontology for the representation of maritime data towards providing integrated views of data for disparate sources focusing on the notion of semantic trajectory.
2 Taming Big Data in the Maritime-Use Case: Motivation and Challenges The maritime environment has a huge impact on the global economy and our everyday lives.Specifically, surveillance systems of moving entities at sea have been attracting increasing attention due to their importance for the safety and efficiency of maritime operations.For instance, preventing ship accidents by monitoring vessel activity represents substantial savings in financial cost for shipping companies (e.g., oil-spill cleanup) and averts irrevocable damages to maritime ecosystems (e.g., fishery closure).The past few years have seen a rapid increase in the research and development of information-oriented infrastructures and systems addressing many aspects of data management and data analytics related to movement at sea (e.g., maritime navigation, marine life).In fact, the correlated exploitation of heterogeneous and large-data sources offering voluminous historical and streaming data is considered as an emergent necessity given the (a) wealth of existing data, (b) the opportunity to exploit such data toward building models of entities' movement patterns, and (c) understanding the occurrence of important maritime events.It is indeed true that reaching appropriate MSA for the decision-maker requires processing in real-time of a high volume of information of different nature, originating from a variety of sources (sensors and humans) that lack veracity and comes at high velocity.Different types of data are available, which can provide useful knowledge only if properly combined and integrated.However, the correlated exploitation of data from disparate and heterogeneous data sources is a crucial computational issue.
The growing number of sensors (in coastal and satellite networks) makes the sea one of the most challenging environments to be effectively monitored; the need for methods for processing of vessel-motion data, which are scalable in time and space, is highly critical for maritime security and safety.For instance, approximately 12,000 ships/day are tracked in EU waters, and approximately 100,000,000 AIS positions/month are recorded in EU waters (EMSA 2012).Beyond the volume of data concerning ships' positions obtained from AIS, these trackings might not be always sufficient for the purposes of detection and prediction algorithms.Only if properly combined and integrated with other data acquired from other data/information sources (not only AIS) can they provide useful information and knowledge for achieving the maritime situational awareness in support to the datAcron maritime-use case.
The Maritime-Use Case for datAcron focuses on the control of fishing activities because it fulfills many of the requirements for validating the technology to be developed in datAcron: It addresses challenging problems deemed of interest for the maritime operational community in general; it is aligned with the European Union maritime policy and needs in particular; and it relies on available datasets (unclassified, shareable) among the teams and others of interest in the research community (e.g., AIS data, radar datasets, databases of past events, intelligence reports, etc.).Moreover, it is of considerable complexity because it encompasses several maritime risks and environmental issues such as environmental destruction and degradation as well as maritime accidents, illegal, unreported, and unregulated (IUU) fishing; and trafficking problems.
The support for processing, analyzing, and visualizing fishing vessels at the European scale, although not worldwide, along with the capability of predicting the movement of maritime objects and the identification of patterns of movement and navigational events, shall improve existing solutions to monitor compliance with the European common fisheries policy.In addition to the control of fishing activities, another core issue is safety.Fishing, even under peace conditions, is known as one of most dangerous activities and is regularly ranked among the top five dangerous activities depending on the years being considered.Safety does not concern only fishing vessels themselves but also the surrounding traffic and more generally all other human activities at sea.
The data to be used in datAcron comprise real and quasi-real data streams as well as archival (or historical) European datasets supporting the fishing scenarios specified.These usually need to be cleaned up from inconsistencies, converted into standard formats, harmonized, and summarized.
The following list briefly summarizes typical datasets that are relevant to the datAcron scenarios: • automatic Identification System (AIS) messages broadcasted by ships for collision avoidance; • marine protected/closed areas where fishing and sea traffic may be (temporarily) forbidden; • traffic-separation schemes and nautical charts useful to define vessel routes; • vessel routes and fishing areas estimated from historical traffic data; • registry data on vessels and ports; • records of past events such as incidents and illegal-activities reports; and • meteorological and oceanographic (METOC) data on atmospheric and sea-state conditions and currents.
Despite the urgent need for the development of maritime data infrastructures, current information and database systems are not completely appropriate to manage this wealth of data, thus also supporting the analytics tasks targeted in datAcron.To address these limitations, we at datAcron put forward two major requirements.First, the very large data volumes generated require the development of pre-filtering data-integration process that should deliver data synopses in real-time while maintaining the main spatio-temporal and semantic properties.Next, additional ocean and atmospheric data, in conjunction to other data sources at the global and local scales are often necessary to evaluate events and patterns at sea in the most appropriate way, thus leading to additional data-integration issues [15].
In addition to the above-mentioned points, data measurements have an intrinsic uncertainty, which may be addressed by proper data-fusion algorithms and clustering in the preparation/preprocessing phase (by assessing the quality of data themselves) and by combining measurements from complementary sources [15].

Big-Data Management Challenges in datAcron
As already said, we at datAcron aim at recognizing and forecasting complex events and trajectories from a wealth of input data, both data-at-rest and data-in-motion, by applying appropriate techniques for Big-Data analysis.The technical challenges associated with Big-Data analysis are manifold and are perhaps better illustrated in [1,2] where the Big Data-Analysis Pipeline is presented.As depicted in Fig. 1, five major phases (or steps) are identified in the processing pipeline: • data acquisition and recording; • information extraction and cleaning; • data integration, aggregation, and representation; • query processing, data modeling, and analysis; and • data interpretation.

Data Acquisition
As already said, large volumes of high-velocity data are created in a streaming fashion, including surveillance data and weather forecasts that must be consumed in datAcron.One major challenge is to perform online filtering of this data in order to keep only the necessary data that contain the useful information.To this end, we apply data-summarization techniques on surveillance data, thus keeping only the Fig. 1 Major steps in the analysis of big data (from [1,2]) Taming Big Maritime Data to Support Analytics "critical points" of a moving object's trajectory, which signify changes in the mobility of the moving object.Such a summarized trajectory is shown in Fig. 2 comprising the low-level events detected as critical trajectory points.A research challenge for datAcron is to achieve a data-reduction rate >90% without compromising the quality of the compressed trajectories and, of course, the quality of trajectories' and events' analysis tasks [14].
Another challenge in the data-acquisition phase is to push computation to the edges of the Big Data-management system.To achieve this, we perform online data summarization of surveillance data on the input stream directly as soon as it enters the system.Moreover, we employ in-situ processing techniques, near to the streaming data sources, in order to identify additional low-level events of interest such as the entrance/leave of moving objects in specific areas of interest (such as protected marine areas) and events requiring cross-streaming processing.

Information Extraction and Cleaning
Given the disparity of data sources exploited in datAcron, with miscellaneous data in various formats for processing and analysis, a basic prerequisite for the subsequent analysis tasks is to extract the useful data and transform it into a form that is suitable for processing.As a concrete example, weather forecasts are provided as large binary files (GRIB format), which cannot be effectively analyzed.Therefore, we extract the useful meteorological variables from these files, together with their Fig. 2 Summarized trajectory: Critical points with specific meaning indicate important low-level events at specific points spatio-temporal information, so that they can be later associated with mobility data.These should be done in operational time (i.e., in milliseconds), enriching the stream(s) of surveillance data.
In addition, surveillance data are typically noisy, contain errors, and are associated with uncertainty.Data-cleaning techniques are applied in the streams of surveillance data in order to reconstruct trajectories with minimum errors, which will lead to more accurate analysis results with higher probability.Indicative examples of challenges addressed in this respect include handling delayed surveillance data and dealing with intentional erroneous data (spoofing) or hardware/equipment errors, etc.

Data Integration, Aggregation, and Representation
Having addressed data cleaning, the next challenge is to integrate the heterogeneous data coming from various data sources in order to provide a unified and combined view.Our approach is to transform and represent all input data in RDF following a common representation (i.e., the datAcron ontology), which was designed purposefully to accommodate the different data sources.However, data transformation alone does not suffice.To achieve data integration, we apply online link-discovery techniques in order to interlink streaming data from different sources, a task of major significance in datAcron.
In particular, the types of discovered links belong to different categories with the most representative ones being (a) moving object with static spatial area, (b) moving object with spatio-temporal variables, and (c) moving object with moving objects.In the first case, we monitor different relations (enter, exit, nearby) between a moving object and areas of interest such as protected natural areas or fishing zones.In the second case, we enrich the points of a trajectory with weather information coming from weather forecasts.Finally, in the last case, we identify relations between moving objects, e.g., two vessels approaching each other or staying in the same place for unusually long period.By means of link discovery, we derive enriched-data representations across different data sources, thereby providing richer information to the higher-level analysis tasks in datAcron.

Query Processing, Data Modeling, and Analysis
Another Big-Data challenge addressed in datAcron relates to the scalable processing of vast-sized RDF graphs that encompass spatio-temporal information.Toward this goal, we designed and developed a parallel spatio-temporal RDF processing engine on top of Apache Spark.Individual challenges that need to be solved in this context include RDF-graph partitioning, implementing parallel query operators that shall be used by the processing engine, and exploiting the capabilities of Spark in the context of trajectory data.
Complex event detection is also performed in datAcron where the objective is to detect events related to the movement of objects in real-time.
Last, but not least, particular attention is set toward predictive analytics, namely, trajectory prediction and event forecasting.Both short-and long-term predictions are useful depending on the domain and in particular for maritime: A difficult problem is to perform long-term prediction.For instance, as far as trajectory prediction is concerned, we may distinguish location prediction (where a moving object will be after X number of hours) and trajectory prediction (what path will a moving object follow in order to reach position P).

Interpretation
To assist the task of human-based interpretation of analysis results, as well as the detection of patterns that may further guide the detection of interesting eventstasks that are fundamental for any Big Data-analysis platform-datAcron relies on visual analytics.By means of visual-analysis tools, it is possible to perform visual and interactive exploration of moving objects and their trajectories, visualize aggregates or data summaries, and ultimately identify trends or validate analysis results that would be hard to find automatically.

Semantic Trajectories Revisited: An Ontology for Maritime Data to Support Movement Analysis
Given the significance of trajectories, analysis methods (e.g., for the detection and prediction of trajectories and events), in combination with visual analytics methods, require trajectories to be (a) available at multiple levels of spatio-temporal analysis, (b) easily transformed into spatio-temporal constructs/forms suitable for analysis tasks, and (c) provide anchors for linking contextual information and events related to the movement of any object.In doing so, representation of trajectories at the semantic level aim to provide semantically meaningful integrated views of data regarding the mobility of vessels at different levels of analysis.The term "contextual information" denotes any type of information about entities that affect the behavior of an object (e.g., weather conditions or events of special interest) as well as information about entities that are being affected by the behavior of an object (e.g., a fishing or protected area).Moreover, the context of an objects' trajectory may include the trajectories of other objects in its vicinity.As already said, surrounding traffic may entail safety concerns.The association of trajectories to contextual information and events results in enhanced semantic trajectories of moving objects.
Existing approaches for the representation of semantic trajectories suffer from at least one of the following limitations: (a) there is use of plain textual annotations instead of semantic links to other entities [3-5, 8, 10-12]; (b) only limited types of events can be represented as resources [3][4][5][6][7]; (c) assumptions are made of the structure of trajectories, thus restricting the levels of analysis and representations supported [6,9]; and (d) semantic links between entities are mostly applicationspecific rather than generic [6,7].
Motivated by real-life emerging needs in MSA, we aim at providing a coherent and generic scheme supporting the representation of semantic trajectories at different levels of spatial and temporal analysis: Trajectories may be seen as single geometries, as arbitrary sequences of moving objects' positions over time, as sequences of events, or as sequences of trajectories' segments each linked with important semantic information.These different levels of representation must be supported by the datAcron ontology.
The datAcron ontology is expressed in RDFS.The main concepts and properties in this ontology are depicted in Fig. 3 and are presented in the next paragraphs.
Places: The concept of "place" is instantiated by the static spatial resources representing places and regions of special interest.Places are related to any type of trajectory (or segment), weather conditions, and events (by relations "within" or "nearby")."Place" is a generalization of Places of Interest (POIs) related to trajectories and Regions of Interest (ROIs) [13] associated with a stop event of a moving object.
A place is always related to a geometry with the property "hasGeometry."Semantic Nodes: For the representation of moving objects' behavior at varying levels of analysis in space and time, and in order to associate trajectories with contextual information, we use the concept of "SemanticNode."A semantic node specifies the position of a moving object in a time period or instant or a specific set of spatio-temporal positions of a single moving object.In the latter case, it specifies an abstraction/aggregation of movement track positions (e.g., the centroid for a set of positions) and can be associated with a place and a temporal interval when this movement occurred.In both cases, a semantic node may represent the occurrence of an event type.
More importantly, any instance of SemanticNode can be associated with contextual information known for the specific spatio-temporal position or ROI.
In addition, the semantic node may be associated with weather information regarding the node's spatio-temporal extent.
Trajectories: A trajectory is a temporal sequence of semantic nodes or trajectory segments.The main properties relating trajectories to semantic nodes are Fig. 3 Core concepts and properties "hasInitNode," "hasLastNode" (representing the trajectory initial and last semantic nodes, respectively) and "consistsOf" (for relating trajectories to intermediate semantic nodes).The property "hasNext" relates consecutive semantic nodes in a trajectory to maintain the temporal sequence of nodes (Fig. 4).
Various types of trajectories are supported such as "OpenTrajectory" where the last semantic (terminal) node is not yet reached and "ClosedTrajectory," in which the last node is specified.Having said this, it should be noticed that criteria for determining terminal positions are application and domain specific.A trajectory can also be classified as "intendedTrajectory" specifying a planned or predicted trajectory.Thus, each moving object, at a specific time, may be related to multiple trajectories, actual or intended/predicted ones, and semantic nodes can be reused in different types of trajectories, w.r.t.spatial and temporal granularity.
The properties "hasParent" and "hasSuccessive" relate a trajectory with other trajectories, thus forming a structured trajectory.Specifically, the first property relates a trajectory to its parent (the whole), and the second one relates the successive trajectories (the parts).
For instance, Fig. 5 illustrates the trajectory of a vessel through ports Porto Di Corsini, Durres, and Bari, and Fig. 6 demonstrates the corresponding structured trajectory with its trajectory segments.
Trajectory segments have a starting and an ending semantic node and are associated with a time interval and geometry.
Event: The "Event" concept is instantiated by spatio-temporal entities representing specific, aggregated, or abstracted positions instantiating a specific event pattern.The instantiation of any such event pattern can be part of a preprocessing task on raw data, or it can be done by a function applied to the RDF data, thus resulting in the generation of new triples representing the recognized events.Thus, an event is associated with a set of semantic nodes, which may be in temporal sequence.Each event may be associated with one or more moving objects, and it has spatial, temporal and domain-specific properties according to the properties of Fig. 4 A semantic-node instance linked to weather data semantic nodes and trajectories.It must be pointed out that a semantic node or a trajectory can be specified to be associated to more than one event types (e.g., "Rendezvous" and "PackagePicking").
Events are distinguished as to low-level and high-level events: The former are those detected from raw trajectory data or from time-variant properties of a single moving object disregarding contextual data.For instance, a "Turning" or an "Accelerating" event is a low-level event because it concerns a specific object and can be detected directly from raw trajectory data.Figures 2 and 5 depict such low-level events as trajectory "critical" points.High-level (or complex) events are detected or predicted by means of features specifying movement and/or time-variant properties, in addition to contextual ones, of moving objects.For example, the detection of a "Fishing" event needs consideration of the type of the vessel and the known fishing regions in addition to the vessel's raw trajectory.

Concluding Remarks
The datAcron project aims to advance the management and integrated exploitation of voluminous and heterogeneous data-at-rest (archival data) and data-in-motion (streaming data) sources so as to address important challenges in time critical domains, such as the maritime domain, for supporting analysis tasks.It is indeed true that vast data volumes of heterogeneous nature, flowing at extremely high rates -whose intertwined exploitation for supporting analysis tasks in the maritime domain, is an emergent necessity-calls for novel big-data techniques and algorithms that lead to advanced data analytics.
Toward achieving its objectives, datAcron considers semantic trajectories to be "first-class citizens" following the paradigm shift towards operations that are built and revolve around the notion of trajectory.Thus, datAcron revisits the notion of semantic trajectory and builds on it.Specifically, it is expected that meaningful moving patterns will be computed and be exploited to recognizing and predicting the behavior and states of moving objects taking advantage of the wealth of information available in disparate and heterogeneous data sources.
Given the significance of trajectories, analysis methods (e.g., for the detection and prediction of trajectories and events), in combination with visual analytics methods, require trajectories to be (a) available at multiple levels of spatio-temporal analysis, (b) easily transformed to spatio-temporal constructs/forms suitable for analysis tasks, and (c) able to provide anchors for linking contextual information and events related to the movement of any object.Toward these objectives, datAcron has devised a representation for trajectories at the semantic level, providing semantically meaningful integrated views of data regarding the mobility of vessels at different levels of analysis.
Finally, as already mentioned, we address issues of all five major phases (or steps) identified in the processing pipeline of a big-data architecture: • data acquisition and recording; • information extraction and cleaning; • data integration, aggregation, and representation; • query processing, data modeling, and analysis; and • interpretation.
The design of the datAcron overall architecture reflects these issues and is in close connection with requirements of the maritime domain.
Our current work focuses on data integration, aggregation, and representation as well as on query processing, data modeling, and analysis.Methods aim toward providing big-data solutions to these processing phases even during operational times. http://www.springer.com/978-3-319-59538-2

Fig. 5
Fig. 5 An example trajectory: The figure shows the activity of a vessel near Durres port.The multiple stop positions with intermediate gap-start and gap-end points could indicate a suspicious behavior