Domain-Adaptive Pre-Training vs. Instruction Fine-Tuning for Cross-Lingual Retrieval Robustness
Description
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 lan
Research goal: How does domain-adaptive pre-training on legal corpora compare to instruction fine-tuning for improving cross-lingual retrieval robustness against adversarial perturbations in multilingual embedding models?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(87.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:da786b5121b5ca7a9a3482d68b800678
|
87.3 kB | Preview Download |