Published February 26, 2025 | Version v1
Poster Open

From Idea to Prototype: Using BITS and LLMs to automate the annotation process for SGN Collection Data

  • 1. ROR icon Senckenberg Research Institute and Natural History Museum Frankfurt/M
  • 2. Deutsches Klimarechenzentrum

Description

We will present a workflow at SGN that combines the usage of BITS (https://projects.tib.eu/bits/home) outcome (i.e. ESS collection of the TIB TS) with GPT4all in order to identify gaps in terminologies on the one hand, and provide assistance to scientists, who are working on new collections on the other hand. Based on two major data management challenges facing SGN, Legacy Data Digitisation (historical grown data require systematic transformation into machine-readable formats) and Data Proliferation Management (continuous input of data generated by ongoing collection efforts and research activities), our prototyping process can be divided into several areas:

  • Identifying nominal phrases (NPs) in the collection data and annotating them using BITS TS. Our primary goal was to achieve reliable detection, with a focus on minimising false negatives, while accepting some false positives during annotation.

  • During the prototyping phase, several obstacles were encountered referring to poor NP detection quality in scientific texts and a lack of reliability in conjunction splitting and singularization using common tools. It is also not always possible to determine the correct language of the text, especially with mixed-language content.

  • Revising our requirements had let us choose GPT4all as our preferred solution, specifically the Meta-Llama-3-8B-Instruct.Q4_0.gguf model. 

  • This allows us to perform high quality NP detection and transformation, but with very high computational and time requirements. To optimise resource utilisation, GPT4all is employed only for high-level operations. Other operations can be performed by tools with less hardware requirements.

  • Using statistical logging allows us to identify various significant information about the NP detection and usage. This data we can reuse in later development steps.

By leveraging the strengths of BITS and GPT4all, SGN is paving the way for more accurate processing of complex scientific data to improve research outcomes.

Files

BITS_deRSE25_public.pdf

Files (1.9 MB)

Name Size Download all
md5:4c590a1beb03e9228578cbcb47539273
1.9 MB Preview Download

Additional details

Funding

Deutsche Forschungsgemeinschaft
BITS - BluePrints for the Integration of Terminology Services in Earth System Sciences 508107981