Published September 1, 2021 | Version 1.0.0
Dataset Open

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

  • 1. University of Copenhagen
  • 2. Athens University of Economics and Business

Description

The dataset is published with:

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. Punta Cana, Dominican Republic.

Documents: MultiEURLEX comprises 65k EU in 23 official EU languages. Each EU law has been annotated with EUROVOC concepts (labels) by the Publication Office of EU. Each EUROVOC label ID is associated with a Label descriptor, e.g., [60, `agri-foodstuffs'],  [6006, `plant product'], [1115, `fruit']. The descriptors are also available in 23 languages. Chalkidis et al. (2019) published a monolingual (English) version of this dataset, called EURLEX57K, comprising 57k EU laws with the originally assigned gold labels.

Languages: MultiEURLEX covers 23 languages from 7 families. EU laws are published in all official EU languages, except for Irish for resource-related reasons (Read more: https://europa.eu/european-union/about-eu/eu-languages_en). This wide coverage makes the dataset a valuable testbed for cross-lingual transfer. All languages use the Latin script, except for Bulgarian (Cyrillic script) and Greek.

Multi-granular Labeling: EUROVOC has eight levels of concepts. Each document is assigned one or more concepts (labels). If a document is assigned a concept, the ancestors and descendants of that concept are typically not assigned to the same document. The documents were originally annotated with concepts from levels 3 to 8.  We created three alternative sets of labels per document, by replacing each assigned concept by its ancestor from levels 1, 2, or 3, respectively. Thus, we provide four sets of gold labels per document, one for each of the first three levels of the hierarchy, plus the original sparse label assignment.

Supported Tasks: Similarly to EURLEX (Chalkidis et al., 2019), MultiEURLEX can be used for legal topic classification, a multi-label classification task where legal documents need to be assigned concepts (in our case, from EUROVOC) reflecting their topics. Unlike EURLEX57K, however, MultiEURLEX supports labels from three different granularities (EUROVOC levels). More importantly, apart from monolingual (one-to-one) experiments, it can be used to study cross-lingual transfer scenarios, including one-to-many (systems trained in one language and used in other languages with no training data), and many-to-one or many-to-many (systems jointly trained in multiple languages and used in one or more other languages).

Data Split and Concept Drift: MultiEURLEX is chronologically split in training (55k, 1958-2010), development (5k, 2010-2012), test (5k, 2012-2016) subsets, using the English documents. The test subset contains the same 5k documents in all 23 languages. The development subset also contains the same 5k documents in 23 languages, except Croatian. Croatia is the most recent EU member (2013); older laws are gradually translated. For the official languages of the seven oldest member countries, the same 55k training documents are available; for the other languages, only a subset of the 55k training documents is available. Compared to EURLEX57K (Chalkidis et al., 2019), MultiEURLEX is not only larger (8k more documents) and multilingual; it is also more challenging, as the chronological split leads to temporal real-world concept drift across the training, development, test subsets, i.e., differences in label distribution and phrasing, representing a realistic temporal generalization problem (Huang and Paul, 2019; Lazaridou et al., 2021). Recently, Søgaard et al. (2021) showed this setup is more realistic, as it does not overestimate real performance, contrary to random splits (Gorman and Bedrick, 2019).

Files

Files (2.8 GB)

Name Size Download all
md5:9f7beeff307418146356cd259b622a2c
2.8 GB Download