There is a newer version of the record available.

Published December 12, 2023 | Version v1
Dataset Restricted

NLUCat

  • 1. ROR icon Barcelona Supercomputing Center

Description

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

This work is licensed under a CC0 International License.

In this repository you'll find the following items:

  • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
  • NLUCat_dataset.json: the completed NLUCat dataset
  • NLUCat_stats.tsv: statistics about de NLUCat dataset
  • dataset: folder with the dataset as published in [HuggingFace](https://huggingface.co/datasets/projecte-aina/NLUCat), splited and prepared for training and evaluating intent classifiers
  • reports: folder with the reports done as feedback to the annotators during the annotation process

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.