11.3 TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing
Authors/Creators
Description
The German National Library of Science and Technology (TIB) constantly aims to promote the use and distribution of its collections. To meet these goals, the TIB consequently foregrounds semantic web technologies, which ensure interoperability of metadata, allow for advanced methods of information retrieval (e.g. semantic and cross-lingual search), and improve the ease of access to library holdings. Accordingly, the TIB publishes its extensive metadata on scientific videos provided by the TIB AV-Portal as linked open data. This data is expressed in the standardised Resource Description Framework model (RDF) and comprises both of authoritative and automatically extracted metadata.
The latter is generated through different algorithms analysing (1) superimposed text, (2) speech, and (3) visual content of the portal’s videos. In addition, the analytical results are mapped against common authority files and knowledge bases via a process of automated named entity linking (NEL) to facilitate reuse as well as interlinking of information.
But publishing data in ways, which ensure interoperability and machine readability, is only one aspect of the problem. For although there are rapid advances in the field of machine learning, content mining and automated metadata extraction still pose a significant challenge to libraries. This is mainly due to different qualities in primary materials and ontologies as well as inherent ambiguities, which may impede correct detection and linking with named entities. For example, the results of automated speech recognition strongly depend on sound quality and pronunciation. Accordingly, metadata extracted by algorithms still exhibits varying degrees of accuracy.
Therefore, the TIB is exploring novel ways to improve metadata, which both results from content mining and is provided as linked open data. To achieve this, a service combining direct user interaction with RDF-data and semi-automatic NEL is being developed and will be implemented in the TIB AV-Portal. On the one hand, this will allow for interactive editing, extension, and correction of the analytical metadata and enable staff and users to manually improve overall data quality. On the other hand, users will be provided with suggestions to support manual correction and expansion of the NEL results.
In our presentation, we would like to introduce a possible approach for such a service ensuring the publication of high quality and interoperable metadata. In the first part we will briefly discuss our experiences with data derived by content mining/NEL and stored as RDF-triples. Based on that, common challenges of semi-automatic procedures to improve data quality and user involvement will be highlighted in greater detail in the second part of the presentation. In the concluding part, we will present possible scenarios for implementing and integrating this kind of service into the TIB AV-Portal.
Files
11.3.pdf
Files
(1.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9a7a788bdfd826a6288f03ddb2379232
|
1.4 MB | Preview Download |