EULingDiv

Essfors, Hannes

doi:10.5281/zenodo.18836550

Published March 2, 2026 | Version 1.0

Dataset Open

EULingDiv

Essfors, Hannes¹

1. TU Wien

Altough the European Union is commited to linguistic diversity by recognizing 24 offical EU-languages and actively promiting multilingualism, there have not been much effort vested into quantitatively asessing the linguistic diversity of the union. However, two eurobarometer surveys have been conducted that tangentially surves this purpose: one in 2012, and one in 2024. In the surveys, question are posed pertaining to the native language, and first to third other language, which we interpret as corresponding to L1 to L4, thus potentially allowing for more accruate models of linguistic diversity that account for multilingualism.

To allow for easier anlysis of the data, we have merged and structured the data pertaining to spoken language across the surveys into the dataframe EU_country_speakers_2012_2024.csv. The dataframe is structured in a long format with country-language-year-group-speakers pentuplets. We have not converted the numbers into proportions and derived any formal measures, since there are many analysis-specific aspects necessary to account for, e.g. the weighting given to L1 contra L2, L3, etc. Furthermore, the survey is sample based, and by summing the L1-speakers of each country, one arrives at the sample size. As such, consideration needs to be made regarding the uncertainty of potential diversity indices derived from the data.

While we have added ISO-codes to make the dataset more interoperable, one should be wary of the presence of macrolanguages. For example, the survey designates speakers of "Arabic" and "Albanian", but does not specify the varieties such as Tosk or Gheg Albanian. We have kept with this and added the macroidentifyer of "ara" and "sqi" in such cases. Furthermore, the languages included in the survey of 2012 and 2024 does not necessarily fully overlap due to how the surveys were designed per design of the survey constructors.

The dataset contains the following columns:

country_code: 2-letter ISO-3166 code denoting the country that was surveyed. (character)
country_name: Commonly used country names corresponding to the country code - does not follow any particular standards. (character)
ISO6393: 3-letter ISO639-3 code denoting the language that was surveyed. (character)
Language name: Language name used by the original surveys. (character)
Speaker_type: Categorical variable denoting if the language is the first (L1), second (L2), third (L3) or fourth (L4) language of the speakers. )
number_of_speakers: Variable denoting the number of speaker of a variety in a given country according to a given type. (double)
year: Denotes which year and survey the data is sourced from

Files

EU_country_speakers_2012_2024.csv

Files (133.3 kB)

Name	Size	Download all
EU_country_speakers_2012_2024.csv md5:8ffb330b18e443c420ab0d57644f90f6	133.3 kB	Preview Download

	All versions	This version
Views	21	21
Downloads	5	5
Data volume	1.1 MB	1.1 MB

EULingDiv

Authors/Creators

Description

Files

EU_country_speakers_2012_2024.csv

Files (133.3 kB)