Sentence Boundary Detection for Multilingulal Legal Text

Tobias Brugger

doi:10.5281/zenodo.7835211

Published April 17, 2023 | Version v1

Other Open

Sentence Boundary Detection for Multilingulal Legal Text

Tobias Brugger

Contributors

Supervisor (2):

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our expericmental results show that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. For further use and research by the community, we publicly release our dataset, models, and code.

Files

Bachelorthesis_Tobias_Brugger.pdf

Files (673.1 kB)

Name	Size	Download all
Bachelorthesis_Tobias_Brugger.pdf md5:345b27259614029f20ba7a67e953dcf6	673.1 kB	Preview Download

	All versions	This version
Views	206	206
Downloads	71	71
Data volume	49.8 MB	49.8 MB

Sentence Boundary Detection for Multilingulal Legal Text

Authors/Creators

Contributors

Supervisor (2):

Description

Files

Bachelorthesis_Tobias_Brugger.pdf

Files (673.1 kB)