Domain-Adaptive Pre-Training vs. Instruction Fine-Tuning for Cross-Lingual Retrieval Robustness

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20638755

Published June 11, 2026 | Version v1

Report Open

Domain-Adaptive Pre-Training vs. Instruction Fine-Tuning for Cross-Lingual Retrieval Robustness

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 lan

Research goal: How does domain-adaptive pre-training on legal corpora compare to instruction fine-tuning for improving cross-lingual retrieval robustness against adversarial perturbations in multilingual embedding models?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.3 kB)

Name	Size	Download all
paper.pdf md5:da786b5121b5ca7a9a3482d68b800678	87.3 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Domain-Adaptive Pre-Training vs. Instruction Fine-Tuning for Cross-Lingual Retrieval Robustness

Authors/Creators

Description

Notes

Files

paper.pdf

Files (87.3 kB)