Published February 17, 2026 | Version v1
Presentation Open

Leveraging LLM for Semantic Search and Curation in a National Research Data Catalog

  • 1. ROR icon Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement
  • 2. ROR icon Université de Lorraine

Description

We present a suite of operational services (TRL 7-9) that leverage Artificial Intelligence to augment, not replace, human expertise. We have developed a prototype national catalog for French research data that integrates hybrid search capabilities with a suite of AI-driven tools for metadata enhancement and quality assessment. The catalog combines traditional faceted search with a multilingual semantic search engine, using bi-encoder models for efficient retrieval and cross-encoders for precise reranking. To tackle metadata inconsistency, we utilize right-sized, open-source LLMs like Mistral Small to align entities to controlled vocabularies (e.g., ROR) and generate standardized classifications (e.g. scientific disciplines). This approach minimizes computational costs and environmental impact while ensuring transparency by always distinguishing between original and AI-generated metadata. Acknowledging metadata can be of low quality, we have also built a novel curation analysis tool using a few-shot LLM to assess the semantic substance of descriptions. Our roadmap focuses on evolving these tools into a proactive "FAIR by Design" ecosystem.

Files

20260209_IDCC2026_-_Leveraging_LLM_for_semantic_search_and_curation_in_a_national_research_data_cata[35] - Read-Only.pdf

Additional details

Dates

Available
2026-02-17