Digitizing and Structuring Early Marine Biodiversity Records: A GraphRAG-Based Methodology
Authors/Creators
- 1. Center for the Study of Marine Systems, (CESIMAR-CONICET), Puerto Madryn, Chubut, Argentina
- 2. Computer Science Research Laboratory, (LINVI-UNPSJB), Puerto Madryn, Chubut, Argentina
- 3. Faculty of Information Technology, (FTI-UAI), Buenos Aires, Argentina
- 4. Research and Development Laboratory in Software Engineering and Information Systems, (LISSI-DCIC-UNS), Bahía Blanca, Buenos Aires, Argentina
Description
This work presents a method for generating and refining a knowledge graph (KG) from a historically significant 20th-century marine biology text. The source, a foundational ecological survey, was digitized using OCR and processed for semantic consistency. Knowledge extraction was performed with GraphRAG and the GPT-4o-mini model, producing an initial KG with verbose and inconsistent relationships. To improve clarity and alignment with semantic web standards, a two-step refinement process was applied, combining automated tuning and prompt engineering. The result was a set of concise, RDF-style predicates suitable for querying and integration with ontological frameworks. The refined KG is accessible via a public platform supporting multilingual natural language queries, enabling broader use of historical ecological data. This approach highlights the potential of AI-assisted pipelines to transform legacy scientific texts into semantically structured, interoperable resources for biodiversity research.
Files
GOBLIN25_Camera-Ready_21.pdf
Files
(383.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a413be8238f79539c1d888ed789a6316
|
383.3 kB | Preview Download |