Analysis of the MMLU Benchmark for the evaluation of LLMs in Spanish
Creators
Contributors
Description
Description
This dataset contains a set of files that had been used for an analysis of the MMLU benchmark for the Spanish evaluation of LLMs. The aim of the study was to analyze the consequences of using benchmarks designed in English and translated, by automatic translation tools, into Spanish in the evaluation of Large Language Models (LLMs). Therefore, the final objective was to highlight the importance of an accurate and specific multilingual evaluation benchmarks in order to promote the development of LLMs in other languages.
To achieve this goal, a method for evaluating LLMs has been designed to extract information about how they respond to selected tests of the MMLU benchmark. The process has generated a set of files that are included in this repository.
The files are divided into three folders, each of which is named after the MMLU test category that corresponds to the data it stores. The tests that have been selected are Miscellaneous, Philosophy and US_Foreign_Policy. Each folder contains ten files.
First of all, the GPT-3.5-Turbo and GPT-4 model answers to the tests are included, in their original version, in a Spanish version translated with Azure Translator, and another version translated with ChatGPT. These files, which are six in total for each test, have been named according to the following structure: category-translation-responses-model.xlsx (e.g., miscellaneous-azure-responses-ChatGPT4.xlsx).
The next steps of the method selected questions where the model failed in one of the translated versions and answered correctly in the original. For this selection of the GPT-4 model, a manual analysis was carried out. The analysis searched for the relationship between the model failure and the errors made in the translation. This has been included in two files for each test, one for Azure translation failures and one for ChatGPT failures. The files have been named as category_translation_manual_analysis_model.xlsx, and they also contain a classification of the linguistic fails of the automatic translation.
Finally, after performing the analysis, the error selection was manually retranslated and rerun on the model to see if it changed its responses. These results are shown in the two remaining files in each folder: category_translation_corrected_questions.xlsx.
Files
Dataset.zip
Files
(1.8 MB)
Name | Size | Download all |
---|---|---|
md5:d70d0f1ce783838da42c0948a6d20f97
|
1.8 MB | Preview Download |