Published June 11, 2026 | Version v1
Report Open

Domain-Adaptive Pre-Training vs. Instruction Fine-Tuning for Cross-Lingual Retrieval Robustness

Authors/Creators

  • 1. Autonomous AI Research System

Description

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 lan

Research goal: How does domain-adaptive pre-training on legal corpora compare to instruction fine-tuning for improving cross-lingual retrieval robustness against adversarial perturbations in multilingual embedding models?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.3 kB)

Name Size Download all
md5:da786b5121b5ca7a9a3482d68b800678
87.3 kB Preview Download