UK-LEX Dataset - Part of Chalkidis and Søgaard (2022)
Description
The UK-LEX dataset is part of the work "Ilias Chalkidis and Anders Søgaard. Improved Multi-label Classification under Temporal Concept Drift: Rethinking Group-Robust Algorithms in a Label-Wise Setting. 2022. In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland."
Details:
United Kingdom (UK) legislation is publicly available as part of the United Kingdom's National Archives (https://www.legislation.gov.uk). Most of the laws have been categorized in thematic categories (e.g., health-care, finance, education, transportation, planning) that are presented in the document preamble and are used for archival indexing purposes.
We release a new dataset, which comprises 36.5k UK laws (documents). The dataset is chronologically split in training (20k, 1975--2002), development (8.5k, 2002--2008), test (8.5k, 2008--2018) subsets. We manually extract and cluster the topics to supports two different label granularities, comprising 18, and 69 topics (labels), respectively.
Data Files:
uk-lex18.jsonl: The dataset where documents are annotated with 18 different topics (labels).
uk-lex69.jsonl: The dataset where documents are annotated with 69 different topics (labels).
Files
Files
(523.1 MB)
Name | Size | Download all |
---|---|---|
md5:adc67c56144a530f5e91b77ff29c933c
|
261.5 MB | Download |
md5:557ff85ba6297259d0a2de107f2f7640
|
261.6 MB | Download |