This repository contains data that accompany the paper 'Syntax-semantics interactions – seeking evidence froma synchronic analysis of 38 languages'. We used the UD corpora v2.6 (https://universaldependencies.org/) as the basis of our study, which we accessed in June 2020. We only considered languages for which there were more than 100k tokens available. Grammar examples and learner corpora were excluded. We considered all corpora of a language and merged them into one, using our scripts. Below is a list of included languages. Arabic 3 1,042K Afro-Asiatic, Semitic Basque 1 121K Basque Bulgarian 1 156K IE, Slavic Catalan 1 531K IE, Romance Chinese 5 285K Sino-Tibetan Classical Chinese 1 130K Sino-Tibetan Croatian 1 199K IE, Slavic Czech 5 2,226K IE, Slavic Danish 2 100K IE, Germanic Dutch 2 306K IE, Germanic English 9 648K IE, Germanic Estonian 2 481K Uralic, Finnic Finnish 3 377K Uralic, Finnic French 8 1,157K IE, Romance Galician 2 164K IE, Romance German 4 3,753K IE, Germanic Hebrew 1 161K Afro-Asiatic, Semitic Hindi 2 375K IE, Indic Icelandic 2 1,001K IE, Germanic Indonesian 2 141K Austronesian, Malayo-Sumbawan Italian 6 811K IE, Romance Japanese 5 1,699K Japanese Korean 5 446K Korean Latin 4 824K IE, Latin Latvian 1 220K IE, Baltic Naija 1 126K Creole Norwegian 3 666K IE, Germanic Old French 1 170K IE, Romance Old Russian 2 180K IE, Slavic Persian 1 152K IE, Iranian Polish 3 499K IE, Slavic Portuguese 3 571K IE, Romance Romanian 3 683K IE, Romance Russian 4 1,289K IE, Slavic Slovak 1 106K IE, Slavic Slovenian 2 170K IE, Slavic Spanish 3 1,004K IE, Romance Swedish 3 206K IE, Germanic Ukrainian 1 122K IE, Slavic Urdu 1 138K IE, Indic To proceed, unzip corpora.zip and scripts.zip, which you have to download from two related Zenodo repositories. (The repos are separate because copyright licenses differ.) The corpora for those languages are in the /corpora subfolder. Please note the license agreements for those corpora, as laid out in the UD_LICENSE.txt file. We then used a set of rapid development python scripts to merge the different subcorpora, and added further measures that are relevant to our analysis to the parses, including number of tokens per parse, number of characters per parse, logged word frequencies, logged lemma frequencies, and sum of dependency distances. All this is calculated disregarding punctuation. This is done by running the scripts 0_....py to 7_....py in each of the folders for the different languages. For the license governing the scripts, please see the SCRIPTS_LICENSE.txt file in the /scripts subfolder. Important: The output of the scripts, i.e. the processed/modified corpora, are still subject to the license agreements as per UD_LICENSE.txt. 8_get_len.py is used to get the median number of tokens per parse and the median number of characters per parse. 9_analysis.py is used to analyse the correlation between word/lemma frequencies and dependency lengths. Note: To use 9_....pys, you need to modify the script slightly so that it includes your home directory in the input/output functions. We executed the scripts with the bash code in the file 'bashcode.txt'. Make sure to put the right dirs in the bash code.