Leveraging LLM for Semantic Search and Curation in a National Research Data Catalog
Authors/Creators
Description
We present a suite of operational services (TRL 7-9) that leverage Artificial Intelligence to augment, not replace, human expertise. We have developed a prototype national catalog for French research data that integrates hybrid search capabilities with a suite of AI-driven tools for metadata enhancement and quality assessment. The catalog combines traditional faceted search with a multilingual semantic search engine, using bi-encoder models for efficient retrieval and cross-encoders for precise reranking. To tackle metadata inconsistency, we utilize right-sized, open-source LLMs like Mistral Small to align entities to controlled vocabularies (e.g., ROR) and generate standardized classifications (e.g. scientific disciplines). This approach minimizes computational costs and environmental impact while ensuring transparency by always distinguishing between original and AI-generated metadata. Acknowledging metadata can be of low quality, we have also built a novel curation analysis tool using a few-shot LLM to assess the semantic substance of descriptions. Our roadmap focuses on evolving these tools into a proactive "FAIR by Design" ecosystem.
Files
20260209_IDCC2026_-_Leveraging_LLM_for_semantic_search_and_curation_in_a_national_research_data_cata[35] - Read-Only.pdf
Files
(2.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:93a83787ced63096c1cf2ea488232839
|
2.4 MB | Preview Download |
Additional details
Dates
- Available
-
2026-02-17