Published October 27, 2025 | Version v1
Conference paper Open

First steps towards the development of an LLM for the Cape Verdean Creole

  • 1. ROR icon Iscte – Instituto Universitário de Lisboa

Description

This study presents a contribution to the development of a Large Language Model (LLM) for Cape Verdean Creole (CCV), with a particular focus on the fill-mask user task. Given the scarcity of linguistic resources and the dialectal diversity of CCV, the project presents considerable challenges, particularly in the creation and curation of a representative corpus. The adopted methodology includes the collection of corpora through crowdsourcing and web scraping, the development of tokenizers using Byte Pair Encoding, and the training and evaluation of DNN models. In one approach, we have trained from scratch with the collected corpora, a RoBERTa 83M-based model, and in another, we have fine-tuned Albertina 100M, a model pre-trained with European and Brazilian Portuguese. Various training and fine-tuning experiments were conducted on different computational infrastructures available at our university. The results show that the RoBERTa 83M-based model trained from scratch, with tokenization adapted to CCV, outperformed the fine-tuned Albertina 100M model (accuracy: 68.86%), due to its superior morphological compatibility with the target language. The study concludes that developing an LLM for CCV is both feasible and promising, representing a significant contribution to the processing of low-resource languages. Future work will focus on expanding the parallel corpus and extend the model to other downstream tasks.

Files

SIL-ISCTE_2025_paper_17.pdf

Files (548.8 kB)

Name Size Download all
md5:87fbc0a16040674262865ca804280d36
548.8 kB Preview Download