Other Open Access

Sentence Boundary Detection for Multilingulal Legal Text

Tobias Brugger

Thesis supervisor(s)

Joel Niklaus; Matthias Stürmer

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our expericmental results show that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. For further use and research by the community, we publicly release our dataset, models, and code.

Files (673.1 kB)
Name Size
673.1 kB Download
All versions This version
Views 66
Downloads 66
Data volume 4.0 MB4.0 MB
Unique views 55
Unique downloads 44


Cite as