GLADIS: A General and Large Acronym Disambiguation Benchmark

Lihu Chen; Gaël Varoquaux; Fabian Suchanek

doi:10.5281/zenodo.7568937

Published May 2, 2023 | Version v4

Conference paper Open

GLADIS: A General and Large Acronym Disambiguation Benchmark

1. Institut Polytechnique de Paris
2. Inria Saclay

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries.
However, existing acronym disambiguation
benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small.
To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences;
(3) three datasets that cover the
general, scientific, and biomedical domains.
We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation,
and show the challenges and values of our new benchmark.

Files

dataset.zip

Files (17.1 GB)

Name	Size
dataset.zip md5:9dc619967354c1bcc76917da52e8da58	9.6 MB	Preview Download
dictionary_model.zip md5:9d768e598ecce36241015faa4140b318	496.6 MB	Preview Download
pre_train.tgz md5:5252479827a9fe6187c72c5e32b1be28	16.5 GB	Download

Views

660

Downloads

Show more details

	All versions	This version
Views	1,255	965
Downloads	660	544
Data volume	4.3 TB	3.3 TB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Conference

The 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL)

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: January 25, 2023
Modified: February 1, 2023

GLADIS: A General and Large Acronym Disambiguation Benchmark

Authors/Creators

Description

Files

dataset.zip

Files (17.1 GB)