PROPheT – Ontology Population and Semantic Enrichment from Linked Data Sources

. Ontologies are a rapidly emerging paradigm for knowledge representation, with a growing number of applications in various domains. However, populating ontologies with massive volumes of data is an extremely challenging task. The field of ontology population offers a wide array of approaches for populating ontologies in an automated or semi-automated way. Nevertheless, most of the related tools typically analyse natural language text, while sources of more structured information like Linked Open Data would arguably be more appropriate. The paper presents PROPheT, a novel software tool for ontology population and enrichment. PROPheT can populate a local ontology model with instances retrieved from diverse Linked Data sources served by SPARQL end-points. To the best of our knowledge, no existing tool can offer PROPheT’s diverse extent of functionality.


Introduction
Ontologies constitute a knowledge representation paradigm for modelling domains, concepts and interrelations in a structured, uniform and effective way, enabling the sharing of information between different systems [1]. The rapidly emerging popularity of ontologies has led to their deployment in various domains, like bioinformatics [2], e-commerce [3] and digital libraries [4]. Nevertheless, in order for ontologies to be more efficiently used at an enterprise level, massive volumes of data are required for populating the underlying models. If performed manually, this task is extremely time-consuming and potentially error-prone. Ontology population attempts to alleviate this problem, by introducing methods and tools for automatically augmenting an ontology with instances of concepts and properties that represent real data/objects [5]. The schema of the ontology itself is not altered but only the realisation of its set of concepts and the asserted relations on the newly introduced instances. This process is part of ontology learning, which refers to the automatic (or semi-automatic) construction, enrichment and adaptation of ontologies [6].
The vast majority of ontology population tools and methodologies are aimed at textual input, typically extracting knowledge from natural language text [7]. Nevertheless, the unstructured nature of free text drastically increases the efforts for utilising its content in already structured frameworks. Instead, other more structured sources of information could be used alternatively; such an example is Linked Open Data (LOD, or often referred to simply as Linked Data) [8], which builds upon established Web technologies and is a standard for publishing interlinked structured data that are capable also of responding to semantic queries. Linked Data are formalised using controlled vocabulary terms based on ontologies and can be publicly accessible via a SPARQL endpoint [9]. A popular Linked Dataset is DBpedia 1 , the Linked Data version of Wikipedia.
This paper argues that the rapidly increasing array of published Linked Datasets [10] can serve as the input for scalable ontology population and presents PROPheT (PERICLES 2 Ontology Population Tool), a novel software tool for user-driven ontology population from Linked Data sources. The tool is domain-agnostic and can efficiently handle vast volumes of input data. Upon user request, PROPheT locates realisations of concepts in Linked Data sources and appropriately inserts them into a local schema, preserving its initial structure and semantic representations. To the best of our knowledge, no existing tool can offer PROPheT's extent of functionality.
The work presented here constitutes an extension to previous work of ours [11], including the following new materials over the previous paper: (a) an extended related work section, now featuring a number of approaches for ontology population from DBpedia (section 2); (b) a more thorough account of PROPheT's technical implementation details (section 3.2); (c) two revised use cases, demonstrating the tool's functionality in diverse scenarios (section 4).
The rest of the paper is structured as follows: Section 2 gives an overview of related work approaches. Section 3 presents PROPheT's functionalities and operational workflow, followed by two illustrative use cases that demonstrate the tool's versatility and scalability in section 4. Section 5 reports on evaluating PROPheT, and the paper is concluded with final remarks and directions for future work.

Related Work
Ontology population has already been deployed in various domains, like e.g. etourism [12], web services [13] and clinical data [14], amongst others. Another recent work deploys ontology population in a Big Data setting [15], indicating a potentially emerging interest in the area. Overall, state-of-the-art ontology population approaches are mainly addressed to extracting and retrieving possible instances from natural language text, like e.g. product catalogues or university homepages, corpora, other Web sources, etc., and typically involve machine learning, text mining and natural lan-guage processing techniques. Other representative approaches besides the ones discussed above are presented in [5] and [7]. Another, albeit less popular, direction of ontology population research is aimed at retrieving instances from other types of input, like e.g. CAD files [16], or more structured content, like e.g. spreadsheets [17,18], and XML files [19].
Regarding ontology population from DBpedia, a recent attempt is presented in [20], where the authors manually map a local ontology to DBpedia classes and run a series of SPARQL queries that retrieve the respective instances; no details are given regarding the specifics of the population process. A similar approach for semantic annotation of news items is presented in [21], while the authors in [22] present a methodology for ontology enrichment based on input from DBpedia and (the now obsolete) Schema.org 3 .
PROPheT's similarity to these approaches lies in the use of a LOD source as input for ontology population and enrichment. In this sense, PROPheT could easily be used as the underlying ontology population tool in [20][21][22]. Nevertheless, no other ontology population tool can currently instantiate new concepts from a LOD source so flexibly, regardless the domain of interest or the content of the source. PROPheT can handle any kind of LOD as an external knowledge source for extracting concepts of interest and for populating them to corresponding resources into the domain ontology.

The PROPheT Ontology Population Tool
PROPheT 4 is a novel software tool for ontology population and semantic enrichment that can retrieve instantiations of concepts from Linked Data sources. In this sense, the tool is fully domain-independent and capable to operate with any OWL ontology and any RDF LOD dataset served via a SPARQL endpoint. The retrieved instances are filtered by the user and are then inserted, together with their accompanied/selected properties and values, into a target ontology. As described in the following subsections, PROPheT provides various modes of instance retrieval, and allows establishing user-defined mappings of the respective properties. Through its step-by-step wizardbased interaction mode, the tool is extremely easy to use even by unfamiliarised users.

Technical Infrastructure
PROPheT's front-end (see Fig. 1) is implemented in Python along with PyQt 5 , while specialised Python APIs (RDFLib 6 , SPARQLWrapper 7 ) are deployed for handling local and remote ontologies. An SQLite database was also set up in the back-end for storing dynamic data (e.g. settings, user preferences) that are created during the tool's operation.

Ontology Population
PROPheT offers the capability of class-based and instance-based ontology population. Class-based population retrieves instances from an external source, based on a given class name 8 , and inserts them into a local ontology. PROPheT submits appropriate SPARQL queries to the remote endpoint in order to first retrieve a result set of instances belonging to the specified class (declared with a unique classURI value in Table 1), and then to derive additional info for each instance, such as its label(s) (rdfs:label), data properties and related values defined in the remote ontology. The total number of fetched instances can be bound according to a maximum number of results (query_limit) specified by the user. The user may then select the instances to populate under an existing class in the local ontology. The second method, instance-based population, has two different modes: 1. Retrieval instances based on their rdfs:label property value, where the match of the retrieved instances is based on specific parameters. More specifically, the user may input the corresponding label field to search for instances defined in the exter-nal source, together with additional search options, such as: (i) the exact or partial match (i.e. contains term) of typed text with the label of retrieved instance(s), (ii) the exact match of language code 9 selected by the user with that specified in label(s) of retrieved instance(s), and (iii) the ability for the search execution to be performed as case sensitivity or insensitive. Detailed examples of corresponding SPARQL queries are presented in Table 2. Retrieved results can be of any class (rdf:type), thus, the user may select any of the derived instances to be populated under a specific class in the local ontology.

2.
Retrieval based on instances similar to an existing instance. More specifically, PROPheT detects the classes in the remote ontology that include an instance with a similar rdfs:label property value (exact match) with the input instance. The user may then select specific classes and choose which instances to import into the local ontology, following a similar approach as class-based population but performed for multiple classes results simultaneously.
In all the above cases, after the set of preferred instances has been selected by the user to be populated into the ontology, PROPheT launches the ontology mapping process described in subsection 3.4.

Instance Enrichment
PROPheT also offers the option of semantically enriching instances already existing in the local ontology with properties and values from similar instances in remote ontologies, i.e. instances with similar labels. The similar instances may belong to one or more different classes in the remote ontology, thus, the tool presents the user with the rdf:type of each instance. Based on the content and semantics of the derived instances, the user may then decide which property-value pairs he/she will insert from the remote into the local ontology.

Ontology Mapping
In order for PROPheT to proceed with populating the ontology with the selected instances, the properties of the retrieved instances have to be mapped to properties defined in the local model. PROPheT displays a list of all datatype properties for the selected instances, so that the user can define suitable mappings to datatype properties already existing in the local ontology; for example, mapping the retrieved property dbo:birthDate to the local property ex:dateOfBirth. PROPheT stores the mappings in a linked SQLite database and offers suggestions when the same mappings occur again in future occasions. When ontology mapping is finalised, the instances and their related properties and values can directly be populated as new triples in the local ontology.

Semantic Enrichment
The local model may also be semantically enriched by establishing links between properties in the local and the remote ontologies via owl:equivalentProperty declarations added into the local model. Similar links between classes are represented via owl:sameAs and rdfs:seeAlso declarations added to the local ontology.

Use Cases
This section presents two use case scenarios: the former demonstrates PROPheT's versatility in performing ontology population and semantic enrichment from diverse sources, while the latter illustrates the tool's scalability in data-intensive domains.

Use Case 1: Ontology Population from Different LOD Sources
Suppose that Alice, an avid movie enthusiast, has developed an ontology of actors and films and wishes to initially populate it with an instance of the movie "The Godfa-ther" retrieved from LinkedMDB 10 . She loads her model in PROPheT and registers LinkedMDB as the current source. Since the name of the movie is specified a priori, she searches for existing instances through the "Search by Instance Label" method. One result is retrieved 11 and Alice adds this instance to her local ontology. She then wishes to retrieve additional information on the specific movie from another LOD source, DBpedia. Through PROPheT's "Enrich existing instance" function, Alice retrieves a set of instances that may belong to different classes, but they all share the same rdfs:label with the newly populated instance. At this point, she may select any pairs of datatype properties/values she wants to add to her local instance of "The Godfather" movie. After manually mapping the relevant pairs of properties, the data is inserted into the corresponding fields in Alice's ontology.
In case the user wishes to further populate her model with similar resources, she can employ PROPheT's methods "Search by Class" or "Search by Existing Instance" for any LOD endpoint. For instance, if the former method is selected, Alice should type e.g. dbo:Film for DBpedia, or movie:film for LinkedMDB. A set of instances will be retrieved, and Alice may then proceed with the selection and mapping process as described previously. If, on the other hand, "Search by Existing Instance" is selected, PROPheT will search for alternative classes that contain instances with the same label. Alice can now select one or more classes from which instances will be retrieved and proceed with the selection of instances to be populated (see Fig. 2).

Use Case 2: Ontology Population in a Data-intensive Domain
Bob, an employee at a government institution monitoring pollution in rural environments, wishes to create a directory of cities and towns worldwide, including related information, such as population, postal codes, etc., along with the respective pollution levels. Bob deploys a local ontology schema incorporating the necessary classes (e.g. Town, City, etc.) and properties (e.g. hasPopulation, hasPostalCode, etc.) and loads it into PROPheT. This ontology now needs to be populated with instances of cities and towns.
Bob then registers the sources that serve the desired data. Two suitable candidates are ENVO 12 and LinkedGeoData 13 . Specifically, ENVO's class City (ENVO_00000856) and LinkedGeoData's classes City and Town contain relevant instances. Using PROPheT's class-based instance extraction wizard, Bob populates his ontology with 10K instances from ENVO's City and 10K instances from LinkedGeoData's City, along with an additional 10K instances from LinkedGeoData's class Town. Data property values were also mapped and added. Table 3 displays the population times (in seconds) for the 30K instances. Table 3. Instance retrieval and population times.

Ontology
No of instances Population time (sec) LinkedGeoData 10,000 120 ENVO 10,000 204 LinkedGeoData 10,000 158 Finally, through PROPheT's "Enrich Instance" function, Bob can semantically enrich the major cities' instances (e.g. London, Paris, Amsterdam) with data from different endpoints regarding air pollution levels.

PROPheT Evaluation
We conducted a user evaluation of the tool, which resulted in very encouraging conclusions by the participants, who distinguished the following aspects of the tool as the most positive ones: attractiveness (93.5%), user-friendliness (93.5%), ease of usage (100%), innovativeness (87.5%), and efficiency (93.5%); the numbers in parentheses correspond to the respective percentages indicating acceptance on behalf of the users. More information on the user evaluation is presented in [23]. Furthermore, we also conducted a qualitative evaluation of PROPheT, based on the criteria for ontology population tools proposed in [7]. The key findings of the evaluation are presented in Table 4, while more information is given in [11].
Finally, considering the fact that the availability and scalability of SPARQL endpoints serving Linked Data is not always guaranteed [9], and in order to demonstrate PROPheT's scalability, we experimented with timing the retrieval and population of instances from several well-known SPARQL endpoints into a local custom ontology model. Our findings are presented in more detail in [11]. Table 4. PROPheT's qualitative evaluation.

Criterion
PROPheT's evaluation Elements extracted Objects and relations.
Initial requirements Availability of a local OWL ontologyno domain-dependant resources or specialised software is required.

Learning approach
Step-by-step ontology population and enrichment; SPARQL querying of Linked Data endpoints.
Degree of automation Retrieval is automated; selection is user-driven, but highly user-friendly.

Consistency maintenance
Integrated specialised APIs ensure consistency.

Redundancy elimination
The same instance, i.e. those carrying the same URI, cannot be populated multiple times.

Domain portability
Totally domain-agnostic. Corpora modality Limited to LOD sources with a SPARQL endpoint.

Conclusions and Future Work
The paper argued that, with the rapidly emerging advent of the use of ontologies in various domains, the process of ontology population becomes increasingly relevant. Most proposed solutions are typically aimed at analysing natural language text, often overlooking other sources of more structured information, like e.g. Linked Data. In this context, we presented PROPheT, a domain independent software tool for ontology population and enrichment from Linked Data sources. Through wizard-based userdriven processes, the tool facilitates the automatic retrieval of instances and their insertion into a local OWL ontology, without the need for technical details of the applied queries in the Linked Data endpoints or of the SPARQL query language's syntax. An advanced mapping process enables the dynamic definition of matching classes and properties between source and target models. The tool's rich functionality and versatility outweighs any other ontology population tool found in literature, making PROPheT a truly innovative system for populating and enriching ontologies in various domains where populating ontologies from diverse sources poses a formidable challenge. Indicative paradigms include cultural heritage [24] 14 , telecommunications and news [25], health and biomedicine [26,27]. This was our main motivation for turning PROPheT into a truly domain-agnostic tool, capable of performing ontology population and enrichment from Linked Data sources in virtually any domain, data-intensive or not.
Nevertheless, there are still a few areas of improvement for the tool. In its current implementation, PROPheT is only limited to handling datatype and not object properties; the latter are significantly more complex to tackle. In this context, we are plan-ning adopting the approach presented in [28]. Additionally, the ability of simultaneous querying in multiple selected endpoints or the handling of direct/indirect imports of ontologies would enrich the size and content correspondingly of the retrieved results, in one single query. A further improvement could be considering additional semantic enrichment associations, like e.g. skos:narrower and skos:broader from SKOS [29].Finally, the process of suggesting similar instances or classes to the user during the population and enrichment steps could be suggested by the tool itself, according to appropriate similarity metrics.