Published August 2, 2021 | Version v3.0
Software Open

dsorato/MCSQ_compiling: Version v3.0 annotated (Rosalind Franklin)

  • 1. Universidad Pompeu Fabra

Description

Version 3.0 of the Multilingual Corpus of Survey Questionnaires (MCSQ), named after the scientist Rosalind Franklin, includes new annotations and datasets to the corpus.

The following datasets were included in Version 3.0:

  • Wage Indicator: Wage Indicator and COVID-19 questionnaires for the English (source and Great Britain), Czech, French (France), German (Germany), Norwegian, Portuguese (Portugal), Russian (Russian Federation), and Spanish (Spain)*.
  • European Values Study: wave 2 questionnaires concerning the English (Great Britain and Ireland), French (France), German (Germany), Portuguese (Portugal), and Spanish (Spain)
  • European Social Survey: rounds 8 and 9 questionnaires concerning Catalan, Czech, English (Ireland and Great Britain), Portuguese (Portugal), and Spanish (Spain), Russian (Estonia in both rounds, Israel in round 8, and Latvia in round 9).

Additionally, the following questionnaires in the European Social Survey rounds 8 and 9, which were previously released with missing data in version 2.0, were completed in this release: French (Belgium, Switzerland, France), German (Austria, Switzerland), Norwegian and Russian (Lithuania and Russian Federation in round 8 and Latvia in round 9)

Lastly, we added Named Entity Recognition (NER) annotations to the corpus. This annotation was executed with pre-trained models from different sources, namely FlairNLP (English, German, French, and Spanish), SpaCy (Catalan, Norwegian and Portuguese), and Slavic BERT from DeepPavlov (Czech and Russian). We declare that due to the domain specificity and nature of the texts, some of the models (e.g., Catalan) performed worse than others, especially in cases of instruction segments.

*We attributed the questionnaire languages to the aforementioned countries due to metatada consistency. In reality, for a given language, the same questionnaire is administered in several other countries (e.g., French is administered in Belgium, Switzerland, Canada, etc ), the only difference being the salary range answer options. We opted for including only one questionnaire for each of the aforementioned languages to avoid text repetition in the database.

 

This release concerns only the code used to compile the corpus. The MCSQ data is preserved permanently in the CLARINO repository where it can be freely downloaded.

Files

dsorato/MCSQ_compiling-v3.0.zip

Files (925.4 kB)

Name Size Download all
md5:a297222dc34d0c1fae2d41ffef1e5db0
925.4 kB Preview Download

Additional details

Related works