Assessment of Pre-Trained Models Across Languages and Grammars

Muñoz-Ortiz, Alberto; Vilares, David; Gómez-Rodríguez, Carlos

doi:10.5281/zenodo.11366018

Published November 1, 2023 | Version v1

Conference paper Open

Assessment of Pre-Trained Models Across Languages and Grammars

1. Universidade da Coruña

Abstract (English)

We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.

Other (English)

We acknowledge the European Research Council (ERC), which has funded this research under the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), grant FPI 2021 (PID2020-113230RB-C21) funded by MCIN/AEI/10.13039/501100011033, and Centro de Investigación de Galicia “CITIC”, funded by the Xunta de Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS).

Files

Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf

Files (2.0 MB)

Name	Size	Download all
Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf md5:9544434c97bee906bb7a43e2aaf1545e	2.0 MB	Preview Download

Additional details

Handle: 2183/36572

Views

Downloads

Show more details

	All versions	This version
Views	48	48
Downloads	22	22
Data volume	50.5 MB	50.5 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Association for Computational Linguistics

Imprint

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 359-373.

Conference

13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP 2023), Bali, Indonesia, 1-4 November 2023

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 28, 2024
Modified: July 5, 2024

Assessment of Pre-Trained Models Across Languages and Grammars

Authors/Creators

Abstract (English)

Other (English)

Files

Muñoz_Ortiz_2023_Assessment_pre-trained_models_across_lang_gram.pdf

Files (2.0 MB)

Additional details

Identifiers