AlpiLinK Corpus 1.0.0

Description

The AlpiLinK (Alpine Languages in Contact) Corpus is a corpus of spoken language based on crowdsourced linguistic data composed of mainly audio recordings and in a smaller part on multiple choice and written responses. The data is being collected during the AlpiLinK project (https://alpilink.it/). AlpiLinK aims to gather numerically relevant information about the Germanic and Romance dialects and minority languages spoken across the Alpine regions of Italy with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Saurano, Sappadino, Tyrolean, and Walser German) and Romance (Francoprovençal, Friulian, Ladin, Lombard, Occitan, Piemontese, Trentino, and Veneto dialects). The data collection took place from June 2023 to September 2023.

URL: https://alpilink.it/
Contact: vinko@ateneo.univr.it

Authors

Readme structure

  1. General
  2. Abbreviations
  3. Data Structure
  4. Additional Information
  5. Error Reporting
  6. Updates

1. General

Acknowledgments

How to cite

2. Abbreviations

Languages

Tasks

3. Data Structure

File structure under each language variety is identical and organized as follows:

 AlpiLinK 
        |--- README.txt/.html
        +--- metadata.zip
        ¦     |--- users_results_v1.0.0.csv
        ¦     |--- questionnaire_v1.0.0.csv
        ¦     +--- I_images
        ¦          |--- I01.jpg
        ¦          |...
        +--- cim.zip
        ¦     |--- I01_cim_U0003.wav
        ¦     |--- I01_cim_U0004.wav
        ¦     |--- S02_cim_U0012.wav
        ¦     |...
        ¦       
        +--- frp.zip
        ¦     |--- ... equivalent to "cim"
        +--- lmo.zip
        ¦     |--- ... equivalent to "cim"
        +--- oci.zip
        ¦     |--- ... equivalent to "cim"
        +--- pms.zip
        ¦     |--- ... equivalent to "cim"
        +--- tir.zip
        ¦     |--- ... equivalent to "cim"
        +--- tre.zip
        ¦     |--- ... equivalent to "cim"
        +--- vec.zip
        ¦     |--- ... equivalent to "cim"
        +--- wae.zip
             |--- ... equivalent to "cim"

As can be seen, the AlpiLinK Corpus consists of:

There are audio recordings in 11 language varieties.

The audio file name always mentions the stimulus ID (e.g. S01) followed by the abbreviation of the language variety (e.g., cim) and ending in the user ID (e.g., U0003). This means that audio file S01_cim_U0003 is a Cimbrian translation of stimulus S01 by speaker U0003. The first letter of the stimulus ID indicates the task design of the stimulus:

Task descriptions

M - Free production

Single item (M01) in which participants are asked to describe in their own words to tell a bit about the languages present in their life, e.g. which languages they spoke growing up, what their parents spoke, etc. Answers are recorded in audio (max. 5 min) and in whatever language they prefer (so could be responded to in standard language as well as dialect/minority language).

S - Translation task

Translation task from either standard Italian or standard German composed of 30 stimuli in total. The majority are presented in Italian and German with the exception of; S09, S15, S25, S26 and S28 (only presented in Italian) and S16 and S29 (only presented in German).

The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.0.ods).

I - Image description task

The image description task is presented in both the Italian and German questionnaire and is composed of 7 items. It presents speakers with 7 images meant to elicits phrasal verbs.

The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.0.ods).

T - Tense transformation task

Tense transformation task from either present tense to past tense or from present tense to future tense, composed of 6 items in total. Stimuli are presented in either standard Italian or standard German depending on the questionnaire-language.

The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.0.ods).

N - Name truncation task

The name truncation part with multiple choice responses (N01-N18) were presented to participants of all varieties with the exception of the German minority languages and the Ladin varieties (these participants instead only saw N19 which is an open text question regarding existing name truncations in their communities); the stimuli are questionnaire-language specific, meaning that N01-N09 are presented only to participants taking the Italian questionnaire and N10-N18 are presented only to participants of the German questionnaire.

N19 was implemented after 13-08-2023, which means that the Cimbrian speakers U0003-U0017 and Ladin speaker U033 received and responded to N01-N09 rather than N19.

Variables for the task are gender and age of speaker and addressee. The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.0.ods).

G - Sentence completion task

This section is only present in the German questionnaire and is composed of 10 items. It investigates word formation in Tyrolean with ge-N-e (or -erei) for specific phonological environments (simple/complex onset, obstruent/sonorant, stop/fricative/lateral, etc.).

The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.0.ods).

Metadata folder

This folder contains two csv files with the relevant information about the speakers and the linguistic stimuli:

users_results_v1.0.0.csv

The speaker information includes:

questionnaire_v1.0.0.csv

Includes the information of the linguistic questionnaire:

4. Additional information

Websites

Dissemination

Data collected and result dissemination in the period June 2023 to September 2023 was primarily done via projects in collaboration with local high schools. For Veneto and Friuli Venezia Giulia, see https://sites.hss.univr.it/vinkiamo/ for more information, for VinKiamo in Südtirol, see https://vinkiamo.projects.unibz.it/. In the Aosta Valley, a school project with 3 participating schools under the name "Tutelare le lingue minoritarie con il crowd-sourcing" was conducted in autumn 2023.

5. Error reporting

The collected files are raw audio data and some may be missing or empty. If you spot any inconsistency, error, or corrupted recording please contact us at vinko@ateneo.univr.it.

6. Updates

The AlpiLinK corpus is updated on a regular data to include new data, analysis, and/or new versions of the questionnaire. The versioning of the corpus is done according to the following logic: updates to include new responses to the questionnaire are reflected in the third numeral (e.g., v1.0.0 to v1.0.1); updates to the metadata, e.g. transcriptions, analysis, are reflected in the second numeral (e.g., v1.0.0 to v1.1.0), and any changes to the questionnaire are reflected in the first numberal (e.g., v1.0.0 to v2.0.0).