Linked Vocabularies for Mobility and Transport Research

The paper describes the creation of a vocabulary for a domain-specific information service platform (SIS move) by vocabulary re-use and linking. Source vocabularies differ with respect to several factors (domain-specificity, accessibility, data model). We address why vocabularies should be considered for a domain-specific vocabulary and how they are brought under a common modelling paradigm with standards for knowledge organization systems and alignment of schemata. We also discuss the creation and validation of alignments. Eventually, we give an outlook on the vocabulary’s further evolution and application.


Introduction
The Specialized Information Service Mobility and Transport Research (SIS move) [1] is a platform providing researchers from academia and industry with international (open access) literature, information about the European landscape of researchers, and research data. Covering a field with several disciplines and terminologies, SIS move needs a controlled vocabulary for information retrieval (cf. Sect. 2). Instead of building from scratch, we design a system of linked vocabularies from existing sources. New concepts, definitions, labels, or relations can expand these linked vocabularies. We evaluate several candidates against requirements to be met in SIS move (cf. Sect. 4). With reference to related work and best practices (cf. Sect. 3), we demonstrate how to prepare candidates for integration (cf. Sect. 5), and close with discussing application and further development of the Linked Vocabularies for SIS move (cf. Sect. 6).
We use the terms concept, term and terminology consistent with [2]: A concept is a mental unit, a term a linguistic expression referring to a concept, and a terminology the set of all concepts of an area of expertise, their relations and terms. Terminologies are an object of linguistics as well as terminology science, information science and computer science. Having unique theories and methods for terminology modelling, these disciplines create terminology resources differing in content scope and data modelling approach, e.g. (technical) dictionaries, vocabularies, terminologies, terminological resources, thesauri, controlled vocabularies, ontologies, glossaries. We do not want to exclude any type of terminology documentation as a potential source. Throughout the paper, we will hence use the term vocabulary to refer to terminology documentation regardless of format or degree of formalization. Specific vocabularies will be referred to by the proper term, e.g. thesaurus for a "controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms" used for subject indexing and information retrieval [3].

SIS Move's Thesaurus Use Case
The main use case for the vocabulary is linguistic assistance in a library information discovery system that indexes textual publications, research data, and audiovisual media. Features include word completion while typing a query in the search bar, discovering new information indexed according to a controlled vocabulary, similarity search, disambiguation, or query expansion by synonym or broader/narrower terms.
Especially in mobility and transport research technical languages of social sciences and engineering may cause misapprehensions. Discovering scientific literature from another field within the knowledge space of transport science can be facilitated by (visual) exploration of the vocabulary's content (both terms and relations). Another relevant target group in this context is politics and the general public. A transformation of today's mobility towards a more socially and economically sustainable behavior is prominent in most governments' agendas. A crucial aspect in scientific counselling is finding a widely understood language. A vocabulary listing colloquial terms may play the role of a "negotiator" between researchers and laypeople.

Related Work
Section 3.1 introduces standards for vocabulary development; Sect. 3.2 discusses best practices and tools. More than 100 domain-relevant vocabularies of different type are available for mobility and transport research. We discuss these types in Sect. 4 where we address their suitability for re-use in SIS move on prototypic examples. Table 1 gives an overview of standards and data models from different disciplines. Two approaches for vocabulary development can be distinguished, concept-orientation and term-orientation. In term-orientation, an entry represents a single term referring to one or more concepts, all of which need to be described by the entry. Different term-oriented entries may describe the same concept. In concept-orientation an entry represents a single concept which is referred to by one or more terms, all of which need to be described by the entry. Different entries may contain the same term. In terminology science, concept-orientation is complemented by term autonomy, the option to describe terms with term-specific data types [4].

Standards
The Linked Vocabularies for SIS move adhere to RDF, RDFS and OWL. To establish a controlled vocabulary, we follow concept-orientation supported by TBX or SKOS. Many thesauri come as RDF-based SKOS or SKOS-like representations, so SKOS was  [17] our first choice. We want to implement term-autonomy, but to model terms as entities, an expansion of SKOS is needed, e.g. SKOS-XL. Some more sophisticated linguistic models are available as well, e.g. lemon or LexInfo, providing means to describe terms as lexical entries. To give an adequate description of terms both need to be accompanied by ontologies for lexical characteristics (e.g. part of speech, case, gender, etc.). Appropriate classes and properties can be found in ISOCat or the GOLD ontology.

Best Practices and Tools
Another guide-post for vocabulary re-use and transformation are showcases defining best practices. Practical considerations for transforming legacy vocabularies into semantic web resources often revolve around thesauri. For example, [18,19] discuss the modelling of an excerpt of the thesaurus Technology and Management (TEMA) [20]. SIS move follows some modelling decisions of this project, e.g. using SKOS, treating labels as entities, but does not follow others, e.g. treating vocabulary concepts as instances of owl:Class. Similarly, [21] describes a semantic web version of the AGROVOC thesaurus discussing the differences between an application of OWL and an application of SKOS. Here, the alignment of concepts to other vocabularies also plays a crucial role. Another project that focused on making several domain-specific legacy thesauri fit for the semantic web was FinnONTO. Here, it was not only attempted to align a vocabulary with others but to integrate thesauri describing different domains into a single resource (KOKO) [22,23]. The creation of KOKO is comparable to that of the Linked Vocabularies for SIS move, e.g. source vocabularies had to be transformed into a semantic web format, they had to be harmonized according to a common schema and their shared contents had to be identified and mapped (cf. Sect. 5). Other aspects of FinnONTO are not transferable to SIS move, e.g. directly involving source vocabulary developers in the development of the Linked Vocabularies for SIS move. In comparison to aforementioned projects, SIS move also tries to consider less structured sources that show lexical orientation rather than conceptual orientation. We are not aware of showcases for this kind of integration. The group working on KOKO also developed the Skosmos software stack -a tool for publishing interlinked SKOS vocabularies [24][25][26]

Finding the Right Vocabulary
We analyzed existing vocabularies covering mobility and transport research for qualitative criteria that can be quantitatively expressed. The first five apply to vocabularies in general, the latter two are subject-specific. None of our source vocabularies covers all criteria. With regard to exhaustivity and multidisciplinarity, traditional thesauri and authority files are a good source, e.g. the approximately 250,000 subject headings of the Integrated Authority File (GND) [32]. Due to cooperative maintenance by German-speaking libraries under the German National Library's editorial sovereignty, it covers a great number of concepts from several subjects. With regard to multilinguality, specificity and prominence, there are some limitations: The GND is mainly developed in German, but mapped automatically to the US-American and French national authority files, thereby gaining some degree of multilinguality (cf. MACS project [33]). Its subject headings are applied in subject indexing of literature. Historically and logistically, the use case scenario of in-depth domain-specific indexing is not covered by the GND. Since the vocabulary gradually opens to other resources and use cases, its role as a vocabulary hub for interlinked vocabularies is now discussed. The GND is distributed as RDF under a CC0 license. Mobility and transport experts, SIS move's primary target group, may not be aware of it, though.
Specialized thesauri for mobility and transport research include the Transportation Research Thesaurus (TRT) [34]. It is used for subject indexing titles in the Transportation Research Board Publications Index 1 [35] and well established in the international research community. The subject-specific TRT has approx. 9,500 concepts. Nevertheless, the TRT is diverse regarding its topics. Next to transportation and transportation operations it includes topics like environment, economic and social factors or materials. The same limitations regarding specificity and multilinguality that apply to the GND are visible here, though. Furthermore, the TRT comes in custom XML and even though its reuse is encouraged by its creators it is not under an explicit open license.
Ontologies are also candidates for the Linked Vocabularies for SIS move. They are knowledge representations that make use of formalized languages like RDF [8], RDFS [9] and OWL [10]. They are not topically exhaustive since they define small sets of concepts with explicit semantics based on description logic. Their main focus is on conceptual, not on linguistic description. Since ontologies are often developed in research projects, they are close to current research questions. Their degree of multidisciplinarity depends on their context of origin. Successful re-utilization depends on their developers' reputation and communication towards the community. We identified approximately 30 mobility-and transport-related ontologies in scope of SIS move. For a survey on recent domain ontologies cf. [36][37][38]. Unfortunately, quite a number of ontologies from research are not re-used and short-lived [37]. Examples for active domain ontologies are the Transmodel Ontology [39], the Transport Disruption Ontology [40] or the extension of schema.org by the Automotive Ontology Community Group [41].
Terminologies, compared to ontologies, are almost always multilingual since their main area of application is the translation of technical documents: they are close to the text and therefore include many very specific concepts from their given domain. Since terminology databases are often created for corporate communication, especially product documentation, they may not include concepts relevant in research and development. As corporate knowledge, they are often not publicly available for reuse (even though their exchange is possible with TBX [6], a standardized XML dialect).
Another source are vocabularies from the scientific community that are unstructured with respect to standards listed in Sect. 3.1. These are rather rare since scientific considerations about a field's terminology are usually semantically implicit parts of research papers, not provided in dedicated digital records like online glossaries, e.g. [42]. Re-using semi-and unstructured vocabularies requires adaptation (cf. Sect. 5.1). Coming from the active research community, such vocabularies are very specific but may become out of date once their initial context of origin ceases to exist. Multilinguality is important for research-related vocabulary development for international communication of research results that often requires researchers to use English instead of their native language and their respective terminology. Multidisciplinarity can also be a goal, but managing an exhaustive number of concepts for several domains is out of scope. Community vocabularies might not be openly available, for example because they are part of proprietary projects, or simply because open licenses are not considered.
In conclusion, none of the resources discussed here sufficiently support SIS move's services (cf. Sect. 2) on their own. We therefore re-use selected vocabularies and propose a multi-modular structure for SIS move's Thesaurus.

A Multi-modular Thesaurus for SIS Move
To meet the criteria (cf. Section 4) we build a system of linked vocabularies realized as an ontology importing several other ontologies. We started with [34] since it is a comparatively large thesaurus recommended by our community, and [42] since it is a glossary compiled by an active researcher. The overall structure of the Linked Vocabularies for SIS move is illustrated in Fig. 1. Two steps of preprocessing were necessary: 1) Choosing modelling paradigm and formats. Making vocabularies comparable is not just a file format conversion but requires harmonization of modelling approaches. Terminologies, thesauri and authority files are typically concept-oriented; glossaries and researcher resources term-oriented, an approach well known from dictionaries. RDF, RDFS and OWL provide means to formalize either. We use concept-orientation (cf. Sect. 5.1), semantic web formats and RDF serializations since ontologies and thesauri adhere to these, while terminologies are mostly offered in TBX and researcher vocabularies in unstructured formats (CSV, DOCX).
2) Mapping of data models and vocabulary content: Vocabularies may be described by individual, non-standardized schemata, the GND, for example, allows only elements defined by the GND Ontology [43]. We consider these schemata as structural modules in the Linked Vocabularies for SIS move whose function is the description of content modules. Schemata are often compatible so we can identify equivalent classes and properties by mappings. Further, a mapping of content modules is needed: Different vocabularies make statements about the same concept and to get a full account about it, equivalent concepts in different vocabularies need to be mapped (cf. Sect. 5.2).

Skosification of Vocabularies
We demonstrate the transformation of a concept-oriented XML-based vocabulary and of a text-based term-oriented vocabulary into SKOS data.
Skosifying 2 a Concept-Oriented Vocabulary. The aforementioned TRT is conceptoriented and each entry comprises several elements: obligatorily, a notation, a preferred term, and a superordinate concept; optionally, a definition and its reference, a scope note, alternative terms and subordinate or related concepts 3 . In RDF, we would like to depict the information as a triple structure comprising of subject, predicate and object. Each TRT entry has at least one subject: the concept the entry is about. Modelling the TRT according to RDF requires aligning the TRT XML elements with RDF classes and properties. Since SKOS is an RDF-based standard for modelling controlled vocabularies, this is a straightforward task: all TRT elements map to SKOS as shown in Table 2. TermInfo as a subclass of skos:Concept as an easy way to distinguish TRT concepts from other concepts.
Only TRT's <Date Added> element does not mirror a SKOS term and thus needs to be mapped to an element from a different schema, e.g. dcterms:date. All in all, with some slight adjustments the SKOS standard fits the TRT rather well. To semanticize the TRT in practice, we used the tool jxml2owl [45,46], a java application which provides a GUI. We receive the following triple structure for the TRT entry in Turtle (the definition is abbreviated and no namespace prefix is given for TRT content): :Aet a :TermInfo , skos:Concept; dcterms:date "01/01/1999"^^xsd:string ; skos:prefLabel "Public transit"@en-us ; skos:altLabel "Local transit"@en-us , "Mass transit"@en-us , "Transit"@en-us ; skos:notation "Aet"^^xsd:string ; skos:definition "Transportation service [...]"^^xsd:string ; ftrts:source "AASHTO Glossary"^^xsd:string ; skos:related :Afma , :Afna .
Missing both in the XML entry and in the Turtle entry are the hierarchical relations of the thesaurus. These are implicitly expressed via the notation but not via an XML attribute, therefore these relations could not be obtained with jxml2owl. To make all information in the TRT explicit, we had to do some post-processing with Protégé [47]: With a SPARQL query [48] that compares the notation strings, the implicit hierarchical relations can be added as triples with the respective concept as their subject.
Skosifying a Term-Oriented Vocabulary. Skosifying a term-oriented vocabulary, e.g. the Glossary of Railway Operation and Control [42], is not as straightforward as skosifying a thesaurus: it needs remodeling as a concept-oriented resource. In term-oriented resources, conceptual information is "spread" over several entries and usually not explicitly identified or related. Making the informal connections between entries explicit requires intellectual processing, e.g. comparing concept descriptions, consulting external resources, or interpreting informal cues, such as: -The phrase "another term for" indicates synonymy: different terms refer to the same concept and can be subsumed in the same conceptual entry. -The use of defined terms in term descriptions indicates conceptual relations 4 . For example, cyclic timetable in [42] is defined as "a timetable in which trains that belong to the same route are scheduled with fixed time intervals between their train paths". The use of the term timetable in the definition indicates a relation of super-/sub-ordination between cyclic timetable and timetable. -The phrase "see also" indicates several conceptual relation types: super-/subordination, coordination, association. It is not used consistently in [42], e.g. the entries for gravity yard and hump yard should make mutual reference to each other since they are coordinate concepts.
Once the conceptual structure was implemented, a similar skosification as with the concept-oriented TRT had been performed.

Mapping of Vocabularies
The mapping of vocabularies is necessary from two practical viewpoints: First, vocabularies can be structured by non-standardized schemata with comparable, but differently named classes. Here, a mapping of structural modules (see the beginning of this section) is needed; e.g. to retrieve all labels from the Linked Vocabularies for SIS move with a single, simple query, proprietary label properties should be mapped to standardized label properties (e.g. gndo:preferredNameForTheSubjectHeading to skos:prefLabel). Data models of legacy thesauri are, however, often more expressive than SKOS so that the issue is not just a naming difference, but a matter of (semantic) incompatibility requiring more elaborate standards, as also observed by [49]. Second, vocabularies may have "conceptual overlaps", i.e. any two vocabularies can describe the same concept. Here, a mapping of content modules is needed. We decided to map domain-specific vocabularies (starting with the TRT [34] and the Glossary of Railway Operation and Control [42]) to GND subject headings which is the standard vocabulary for subject indexing in German-speaking libraries (cf. Section 4). The design of the sources influenced our choice of the mapping mechanism. Since there is no thirdparty vocabulary with existing mapping information to both TRT and the Glossary of Railway Operation and Control, transitivity-based inference was not applicable as a mapping mechanism. An approach based-on (semantic) word similarity (e.g. embeddings, cf. [50]) is also not promising since the vocabulary entries often do not contain definitions or other information that allows to create a sense-specific semantic representation for the vocabulary entry. This sparsity forced us to use a naïve label-based mapping approach. For this, we first had to make the labels comparable: the TRT is American English and the GND is German, thus there had to be a translation. We used DeepL [51] to generate German labels for the smaller TRT. Second, we needed to compare the labels and set a skos:mappingRelation between all concepts with string-identical labels. This simple mapping technique had the advantage that it could be done with a SPARQL CONSTRUCT query in Protégé. With this query around 4,400 mappings between TRT and GND were created. Due to the ambiguity of terms these relations are, however, only proposals. Each match has to be validated for correctness within the domain and confirmed as one of SKOS' sub-properties of skos:mappingRelation. We are currently conducting an intellectual mapping review process in which mapping proposals are evaluated against unstructured semantic information from the mapped vocabularies. Since the semantic information of vocabulary items is sometimes sparse, validation also needs to rely on external sources defining domain-specific concepts. In cases where mappings cannot be confirmed, validation is supported by subject specialists at TIB. This process is slow but will result in high-quality mappings. Furthermore, it helps identify conceptual gaps within the GND that could be addressed by the creation of new domainspecific concepts in this central resource for subject indexing in the library landscape in Germany.

Summary and Outlook
This paper introduces SIS move's strategy for the creation of a domain-specific network of linked vocabularies to be used in an information discovery system for transport and mobility research. Instead of re-inventing the wheel, we build on existing vocabularies established in the mobility and transport research community. It currently comprises three sources, the Integrated Authority File [32], the Transportation Research Thesaurus [34] and the Glossary of Railway Operation and Control [42]. An ongoing mapping process resulted in about 2,800 semantically confirmed mappings so far. The mappings between single vocabularies resulting from the SIS move project will be made publicly available for re-use in other projects [52]. Eventually, the semantic integration of sources will result in an aggregated resource that better addresses the criteria introduced in Sect. 4 than the single vocabularies on their own. Single vocabularies can be updated regularly and new community-driven sources (e.g. MobiVoc [44]) can be integrated on demand. In the future, we also want to enable domain experts to participate in thesaurus development. However, we do not expect them to learn semantic web languages, or to work with ontology editors or SPARQL endpoints. The threshold for participation should be low and making a contribution straightforward. The strategy for vocabulary re-use of SIS move will also be applied in future projects, e.g. the newly established SIS for Civil Engineering, Architecture and Urban Studies [53]).
On a technical level, this paper addressed the challenges of creating a thesaurus from different sources. It gave an overview of the range of standards related to distinct types of resources like thesauri, ontologies and controlled vocabularies. It focused on the topic of bringing unstructured vocabularies to machine-readable formats: The effort of intellectual processing is higher when sources are term-oriented, when they have unstructured formats and when their data model is not standardized.
Last, we would like to give an outlook on the thesaurus' future applications in SIS move's service portfolio [1]. Its main strength lies in supporting exploratory search utilizing thesaurus relations. Here, exploration has two dimensions. First, researchers may want to explore the concept system itself to find information on a concept or on a term (e.g. a definition or an equivalent expression). Second, exploring the thesaurus structure helps researchers to discover information on publications, research data and researchers that are linked with the thesaurus concepts. This is not the case for all SIS move resources yet, but automatic indexing is planned with annif [54]. Thesaurus data will be employed in functionalities in SIS move's discovery system, for example search term expansion (based on synonyms, equivalent terms, or super-/sub-ordinate terms) and autocomplete (based on thesaurus terms). An evaluation of the linked vocabularies for SIS move in information retrieval is still pending since the integration is still ongoing work. In all implementations, thesaurus/user interactions need to be as effortless as possible, e.g. by offering user-friendly visualizations of thesaurus information.