First steps towards the development of an LLM for the Cape Verdean Creole
Description
This study presents a contribution to the development of a Large Language Model (LLM) for Cape Verdean Creole (CCV), with a particular focus on the fill-mask user task. Given the scarcity of linguistic resources and the dialectal diversity of CCV, the project presents considerable challenges, particularly in the creation and curation of a representative corpus. The adopted methodology includes the collection of corpora through crowdsourcing and web scraping, the development of tokenizers using Byte Pair Encoding, and the training and evaluation of DNN models. In one approach, we have trained from scratch with the collected corpora, a RoBERTa 83M-based model, and in another, we have fine-tuned Albertina 100M, a model pre-trained with European and Brazilian Portuguese. Various training and fine-tuning experiments were conducted on different computational infrastructures available at our university. The results show that the RoBERTa 83M-based model trained from scratch, with tokenization adapted to CCV, outperformed the fine-tuned Albertina 100M model (accuracy: 68.86%), due to its superior morphological compatibility with the target language. The study concludes that developing an LLM for CCV is both feasible and promising, representing a significant contribution to the processing of low-resource languages. Future work will focus on expanding the parallel corpus and extend the model to other downstream tasks.
Files
SIL-ISCTE_2025_paper_17.pdf
Files
(548.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:87fbc0a16040674262865ca804280d36
|
548.8 kB | Preview Download |