Published June 12, 2025 | Version v1

Digitizing and Structuring Early Marine Biodiversity Records: A GraphRAG-Based Methodology

  • 1. Center for the Study of Marine Systems, (CESIMAR-CONICET), Puerto Madryn, Chubut, Argentina
  • 2. Computer Science Research Laboratory, (LINVI-UNPSJB), Puerto Madryn, Chubut, Argentina
  • 3. Faculty of Information Technology, (FTI-UAI), Buenos Aires, Argentina
  • 4. Research and Development Laboratory in Software Engineering and Information Systems, (LISSI-DCIC-UNS), Bahía Blanca, Buenos Aires, Argentina

Description

This work presents a method for generating and refining a knowledge graph (KG) from a historically significant 20th-century marine biology text. The source, a foundational ecological survey, was digitized using OCR and processed for semantic consistency. Knowledge extraction was performed with GraphRAG and the GPT-4o-mini model, producing an initial KG with verbose and inconsistent relationships. To improve clarity and alignment with semantic web standards, a two-step refinement process was applied, combining automated tuning and prompt engineering. The result was a set of concise, RDF-style predicates suitable for querying and integration with ontological frameworks. The refined KG is accessible via a public platform supporting multilingual natural language queries, enabling broader use of historical ecological data. This approach highlights the potential of AI-assisted pipelines to transform legacy scientific texts into semantically structured, interoperable resources for biodiversity research.

Files

GOBLIN25_Camera-Ready_21.pdf

Files (383.3 kB)

Name Size Download all
md5:a413be8238f79539c1d888ed789a6316
383.3 kB Preview Download

Additional details

Funding

European Cooperation in Science and Technology
GOBLIN: Global Network on Large-Scale, Cross-domain and Multilingual Open Knowledge Graphs CA23147