Project deliverable Open Access
Todor Primov; Andrey Avramov; Nikola Rusinov; Vladimir Alexiev
This deliverable is the third report on the progress of T3.4 Semantic Enrichment. It aims to describe the practical application of advanced text analytics pipelines used to extract and semantically annotate information from unstructured textual data sources from the Big Data Grapes (BDG) data pool. The report describes the practical approach of designing a source knowledge graph for wine and wine review related information; semantic data fusion with basic ontologies and thesauri of relevant terminologies from the BigDataGrapes data pool; designing named entity recognition pipelines for data extraction public wine reviews and configuration of semantic search on top of the annotated content. The demonstrated approach is generic and can be applied on any type of unstructured content (research publications, news articles, patent data, trials reports, food quality reports, etc) using any of the available in the BDG data pool terminologies (sensor data, wine varieties, etc) or any other data set available in the linked open data (LOD) cloud.
The work reported in the first version of the deliverable (Version 1 of D3.4 - Linguistic Pipelines for Semantic Enrichment, reported in M12 of BDG project) was focused mostly on setting up the overall semantic enrichment workflow that must be followed, covering domain modeling; building a core knowledge graph to support the semantic enrichment; development and customization of NLP pipeline components; post-processing of the annotation schema into a corresponding RDF representation.
The second reported period (Version 2 of D3.4 - Linguistic Pipelines for Semantic Enrichment, in M24 of BDG project) was planned to apply the generic semantic enrichment approach on a concrete use case and to demonstrate how end users can benefit of using semantic enrichment to navigate and browse through large sample linked data set (described in Version 2 of D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets reported in M15 of BDG project).
The current work describes improvements implemented in the semantic enrichment of the data set used in Version 2 of D3.4 - Linguistic Pipelines for Semantic Enrichment including 1) extraction and filtering of grape, wine and food concepts from the data set; 2) semantic enrichment of wine reviews textual fields with these concepts and 3) improvement of the semantic search building new search indices over the semantically enriched wine reviews.
In addition to the work related to the Wine Search demonstrator was developed a PubMed Central web crawler that can be configured to download fresh relevant content for research related to wine, antioxidants and other relevant bioactive compounds. The content is then processed by a text analysis pipeline which identifies instances of organic compounds of interest for the project and classify them to functional groups of compounds (e.g. flavonoids, glycosides, etc).
D3.4 Linguistic Pipelines for Semantic Enrichment v.3 (Submitted to EC).pdf