Published October 16, 2023 | Version 1.00
Conference paper Open

Automatic Generation of Common Procurement Vocabulary Codes

Description

This dataset contains 5M pairs of Italian tender descriptions and the corresponding CPV code.

The data are downloaded from the ANAC website https://dati.anticorruzione.it/opendata and split into training (3.2M), developing (800K) and testing (1M).

The original dataset is in CSV format, while the three subsets are in JSON format, suitable for fine-tuning encoder-decoder models as T5.

This dataset is exploited in the following paper:

Lucia Siciliani, Emanuele Tanzi, Pierpaolo Basile and Pasquale Lops. Automatic Generation of Common Procurement Vocabulary Codes. Ninth Italian Conference on Computational Linguistics (CLiC-it 2023).

 

Files

Files (524.7 MB)

Name Size Download all
md5:dd68c3480819ff3065bb8dc49b053905
306.6 MB Download
md5:f96a442c38bff4cf8a481c958fddb84b
34.9 MB Download
md5:fc47ca1aa0e52bb26ac5d657a38f596b
43.6 MB Download
md5:517b7a4fb5dc2388391ed2c75ab25b0e
139.6 MB Download

Additional details

Funding

Ministry of Education, Universities and Research
FAIR - Future AI Research PE00000013