Automatic Generation of Common Procurement Vocabulary Codes
Authors/Creators
Description
This dataset contains 5M pairs of Italian tender descriptions and the corresponding CPV code.
The data are downloaded from the ANAC website https://dati.anticorruzione.it/opendata and split into training (3.2M), developing (800K) and testing (1M).
The original dataset is in CSV format, while the three subsets are in JSON format, suitable for fine-tuning encoder-decoder models as T5.
This dataset is exploited in the following paper:
Lucia Siciliani, Emanuele Tanzi, Pierpaolo Basile and Pasquale Lops. Automatic Generation of Common Procurement Vocabulary Codes. Ninth Italian Conference on Computational Linguistics (CLiC-it 2023).
Files
Files
(524.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dd68c3480819ff3065bb8dc49b053905
|
306.6 MB | Download |
|
md5:f96a442c38bff4cf8a481c958fddb84b
|
34.9 MB | Download |
|
md5:fc47ca1aa0e52bb26ac5d657a38f596b
|
43.6 MB | Download |
|
md5:517b7a4fb5dc2388391ed2c75ab25b0e
|
139.6 MB | Download |