This repository contains data that accompany the paper 'Syntax-semantics interactions – seeking evidence froma synchronic analysis of 38 languages'. 

We used the UD corpora v2.6 (https://universaldependencies.org/) as the basis of our study, which we accessed in June 2020. We only considered languages for which there were more than 100k tokens available. Grammar examples and learner corpora were excluded. We considered all corpora of a language and merged them into one, using our scripts. Below is a list of included languages. 

Arabic 3 1,042K Afro-Asiatic, Semitic
Basque 1 121K Basque
Bulgarian 1 156K IE, Slavic
Catalan 1 531K IE, Romance
Chinese 5 285K Sino-Tibetan
Classical Chinese 1 130K Sino-Tibetan
Croatian 1 199K IE, Slavic
Czech 5 2,226K IE, Slavic
Danish 2 100K IE, Germanic
Dutch 2 306K IE, Germanic
English 9 648K IE, Germanic
Estonian 2 481K Uralic, Finnic
Finnish 3 377K Uralic, Finnic
French 8 1,157K IE, Romance
Galician 2 164K IE, Romance
German 4 3,753K IE, Germanic
Hebrew 1 161K Afro-Asiatic, Semitic
Hindi 2 375K IE, Indic
Icelandic 2 1,001K IE, Germanic
Indonesian 2 141K Austronesian, Malayo-Sumbawan
Italian 6 811K IE, Romance
Japanese 5 1,699K Japanese
Korean 5 446K Korean
Latin 4 824K IE, Latin
Latvian 1 220K IE, Baltic
Naija 1 126K Creole
Norwegian 3 666K IE, Germanic
Old French 1 170K IE, Romance
Old Russian 2 180K IE, Slavic
Persian 1 152K IE, Iranian
Polish 3 499K IE, Slavic
Portuguese 3 571K IE, Romance
Romanian 3 683K IE, Romance
Russian 4 1,289K IE, Slavic
Slovak 1 106K IE, Slavic
Slovenian 2 170K IE, Slavic
Spanish 3 1,004K IE, Romance
Swedish 3 206K IE, Germanic
Ukrainian 1 122K IE, Slavic
Urdu 1 138K IE, Indic

To proceed, unzip corpora.zip and scripts.zip, which you have to download from two related Zenodo repositories. (The repos are separate because copyright licenses differ.) 

The corpora for those languages are in the /corpora subfolder. Please note the license agreements for those corpora, as laid out in the UD_LICENSE.txt file. 

We then used a set of rapid development python scripts to merge the different subcorpora, and added further measures that are relevant to our analysis to the parses, including number of tokens per parse, number of characters per parse, logged word frequencies, logged lemma frequencies, and sum of dependency distances. All this is calculated disregarding punctuation. This is done by running the scripts 0_....py to 7_....py in each of the folders for the different languages. For the license governing the scripts, please see the SCRIPTS_LICENSE.txt file in the /scripts subfolder. 

Important: The output of the scripts, i.e. the processed/modified corpora, are still subject to the license agreements as per UD_LICENSE.txt.

8_get_len.py is used to get the median number of tokens per parse and the median number of characters per parse. 9_analysis.py is used to analyse the correlation between word/lemma frequencies and dependency lengths. Note: To use 9_....pys, you need to modify the script slightly so that it includes your home directory in the input/output functions. 

We executed the scripts with the bash code in the file 'bashcode.txt'. Make sure to put the right dirs in the bash code.