Published October 18, 2024 | Version v1
Presentation Open

Closing the loop: integrating enriched metadata into collections platforms

  • 1. ROR icon British Library

Description

The AI4LAM community calls and annual Fantastic Futures conferences present a rich picture of the many ways that people are using AI and machine learning to enhance collections records. But what happens to the data generated by these projects? Ingesting enhanced data appropriately and responsibly into collections management and discovery platforms (hereafter 'CMS') is a key factor in operationalising AI to improve collections documentation and discovery - but it's often difficult in practice. We present original research into the extent to which projects can integrate enriched data into collections platforms. We report on the factors in the success or failure of data integration shared by over 60 projects.

 

Our research is inspired by the struggle many crowdsourcing and participatory projects have in integrating data created or enriched by online volunteers into CMS (Ridge, Ferriter, and Blickhan 2023). Early signs indicate that many projects using machine learning / AI tools to enhance or create data will face the same issues. For example, Anna Kasprzik (2023) described issues transferring applied research in automated subject indexing to productive services, and Matthew Krc and Anna Oates Schlaack (2023) shared the complexities of building and sustaining an automated workflow within a library. 

 

Enriched metadata is 'data about a cultural heritage object ... that augment, contextualise or rectify the authoritative data made available by cultural heritage institutions' (Europeana Initiative 2023). Enrichments include identifying and linking entities to vocabularies, creating or correcting transcriptions or metadata, and annotating items with additional information. Ideally, the origin of these enrichments would be visible to internal and external users of CMS (and even more ideally, crowdsourcing volunteers would be credited for their contributions). 

 

In early 2024 we designed a survey to understand the barriers and successes for projects incorporating enriched data into CMS, and the types of data, tools and processes used. Our survey was available at https://forms.gle/JgArpbL6VNM6W3Vk9 from March 20 to April 30 (3 responses came in after the official cut-off date but are included in our analysis). We designed the survey to be as inclusive as possible, encouraging past project members and those with long-completed or in-progress projects to respond, and allowing more than one person to respond per project (to allow for the partial nature of information sharing in large or complex projects). 

 

After piloting with previous collaborators and in response to early feedback, our survey invitation said 'We're interested in the experiences of anyone who's worked on crowdsourcing or machine learning projects to enrich collections data. … The survey is particularly designed for people working in collecting institutions (libraries, archives, museums, etc) with their own catalogues, but we also welcome responses from projects that create or enrich data through e.g. research or community projects working with data from GLAMs, or 'roundtripping' records to return enhanced data to a catalogue'.

 

As the survey closed not long before this Call for Proposals, we are still analysing the results, but we are able to share some early data below. Our analysis will be complete before the FF24 conference.

 

Overall, we received 63 responses. 20 responses were from the UK, 13 from the USA, 4 from multiple European countries, 3 each from China, Sweden and the Netherlands. 2 or fewer responses were received from Australia, Belgium, Denmark, Finland, France, Hong Kong, Ireland, Italy, Spain and Switzerland. 

 

Mindful of FF24's multicultural communities lens, we devoted much of our outreach effort to reach potential participants outside the usual Anglo-American and English-language circles. However, we are conscious that we did not receive any responses from Latin America, South Asia or Africa. We hope that a future, funded project would enable collaboration with groups in those areas, with a survey translated and localised for each region to encourage wider participation (and we would be particularly interested in discussing this with conference participants). Approximately half the projects are in languages other than English, but all responses were provided in English.

 

43% of responses were about crowdsourcing projects; 24% about machine learning / AI projects, and 22% were about projects that combined crowdsourcing and machine learning. 

 

The majority of responses were from libraries (35%), followed by museums and archives (both 9.5%); other projects were based in universities, non-profit organisations and combined services. More large organisations responded than small ones - 33% of responses came from organisations with more than 500 paid employees; 30% with 100 - 499 employees; 18% with 5 - 19 employees and 11% with 1 - 4 employees.

 

The majority of respondents were able to ingest enriched data to some extent - 20% could ingest both new and updated records; 8% could only ingest new records, another 8% were partially able to ingest records (for a range of reasons) and 6% could only ingest updated records. 22% of respondents were not able to ingest enriched data, and 21% were still planning or hoping to complete ingest. 

 

Barriers to ingesting enriched data include lack of technical skills, the restrictions of formats such as MARC, and the inability to ingest 'third party' metadata or transcriptions. Issues reported included gatekeeping and institutional politics, lack of staff time (e.g. for technical processes, quality control and data cleaning), erroneous or incomplete data not meeting required standards, and data replication problems.

 

Responses about factors important for successful ingest often began with the word 'agreement', including collaboration between departments, organisations, and with volunteers, on topics such as conditions for data re-use, specifications and standards, and the distribution of work between teams. Initial analysis suggests that the use of APIs, standard formats, data standards and controlled vocabularies contribute to success by reducing the overhead of creating a pipeline of import/exports across platforms and tools. 

 

At least 63% of projects had some manual or automated quality assurance processes for enriched data. 13% had no process and some projects are still deciding on their processes. 87% respondents provided more information on what 'data quality' meant for their project; analysis is ongoing. 29% of projects using machine learning (17 respondents) reported that corrections to the data help improve the model. The ability of systems and workflows to display information on crowdsourcing- or machine learning-enhanced records to staff or the public was very mixed; analysis is ongoing.

 

When we began this research, we thought that the affordances of the CMS used by GLAMs would be a significant factor in the successful integration of enriched data into these CMS. However, our results so far indicate that skills, resources, and inter-personal and institutional relationships are also significant.

 

Bibliography

Europeana Initiative. 2023. ‘Enrichments Policy for the Common European Data Space for Cultural Heritage’. Europeana Initiative. https://pro.europeana.eu/post/enrichments-policy-for-the-common-european-data-space-for-cultural-heritage.

Kasprzik, Anna. 2023. ‘Automating Subject Indexing at ZBW: Making Research Results Stick in Practice’. LIBER Quarterly: The Journal of the Association of European Research Libraries 33 (1). https://doi.org/10.53377/lq.13579.

Krc, Matthew, and Anna Oates Schlaack. 2023. ‘Pipeline or Pipe Dream: Building a Scaled Automated Metadata Creation and Ingest Workflow Using Web Scraping Tools’. The Code4Lib Journal, no. 58 (December). https://journal.code4lib.org/articles/17932.

Ridge, Mia, Meghan Ferriter, and Samantha Blickhan. 2023. ‘Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: A White Paper’. https://doi.org/10.21428/a5d7554f.2a84f94b.

 

Files

Files (1.5 MB)

Name Size Download all
md5:d49cfb1484e970ac0362e2d8ec57991f
1.5 MB Download