Published May 2, 2023 | Version v4
Conference paper Open

GLADIS: A General and Large Acronym Disambiguation Benchmark

  • 1. Institut Polytechnique de Paris
  • 2. Inria Saclay

Description

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports,  scientific papers, and search engine queries. 
However, existing acronym disambiguation
benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. 
To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences;
(3) three datasets that cover the
general, scientific, and biomedical domains.
We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation, 
and show the challenges and values of our new benchmark.

Files

dataset.zip

Files (17.1 GB)

Name Size Download all
md5:9dc619967354c1bcc76917da52e8da58
9.6 MB Preview Download
md5:9d768e598ecce36241015faa4140b318
496.6 MB Preview Download
md5:5252479827a9fe6187c72c5e32b1be28
16.5 GB Download