Dataset for: "Big data suggest strong constraints of linguistic similarity on adult language learning"
- 1. Freie Universitaet Berlin
- 2. Radboud Universiteit
- 3. University of Rochester
This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See The Board of Tests and Examinations is mandated by the Dutch government.
The article accompanying the dataset:
Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056.
Every row in the dataset represents the first official testing score of a unique learner.
The columns contain the following information as based on questionnaires filled in at the time of the exam:
"L1" - The first language of the learner
"C" - The country of birth
"L1L2" - The combination of first and best additional language besides Dutch
"L2" - The best additional language besides Dutch
"AaA" - Age at Arrival in the Netherlands in years (starting date of residence)
"LoR" - Length of residence in the Netherlands in years
"" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer.
"Sex" - Gender
"Family" - Language Family
"ISO639.3" - Language ID code according to Ethnologue
"Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin.
"STEX_speaking_score" - The STEX test score for speaking proficiency.
"Dissimilarity_morphological" - Morphological similarity
"Dissimilarity_lexical" - Lexical similarity
"Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features)
"Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)
A few rows of the data:
"English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19
"English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19
"English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19
"English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19
"English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19
(5.8 MB)
Name | Size | Download all |
5.8 MB | Preview Download |