<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>README</title>
  <style type="text/css">
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" />
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<h1 id="vinko-varieties-in-contact-corpus-v1.2">AlpiLinK Corpus 1.0.3</h1>
<h2 id="description">Description</h2>
<p>The AlpiLinK (Alpine Languages in Contact) Corpus is a corpus of spoken language based on crowdsourced linguistic data composed of mainly audio recordings and in a smaller part on multiple choice and written responses. The data is being collected during the AlpiLinK project (<a href="https://alpilink.it/" class="uri">https://alpilink.it/</a>). AlpiLinK aims to gather numerically relevant information about the Germanic and Romance dialects and minority languages spoken across the Alpine regions of Italy with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Saurano, Sappadino, Tyrolean, Timavese, Kanaltal and Walser German) and Romance (Francoprovençal, Friulian, Ladin, Lombard, Occitan, Piemontese, Trentino, and Veneto dialects). The data collection took place from June 2023 to December 2023.</p>
<p><strong>URL:</strong> <a href="https://alpilink.it/" class="uri">https://alpilink.it/</a><br />
<strong>Contact:</strong> <a href="mailto:vinko@ateneo.univr.it">vinko@ateneo.univr.it</a></p>
<h2 id="authors">Authors</h2>
<ul>
<li>Stefan Rabanus (University of Verona)</li>
<li>Anne Kruijt (University of Verona)</li>
<li>Birgit Alber (Free University of Bozen-Bolzano)</li>
<li>Ermenegildo Bidese (University of Trento)</li>
<li>Livio Gaeta (University of Turin)</li>
<li>Gianmario Raimondi (University of Aosta Valley)</li>
</ul>
<h1 id="readme-structure">Readme structure</h1>
<ol type="1">
<li><a href="#general">General</a></li>
<li><a href="#abbreviations">Abbreviations</a></li>
<li><a href="#data-structure">Data Structure</a></li>
<li><a href="#additional-information">Additional Information</a></li>
<li><a href="#error-reporting">Error Reporting</a></li>
<li><a href="#updates">Updates</a></li>
</ol>
<h2 id="general">1. General</h2>
<ul>
<li><strong>Creator</strong> Anne Kruijt</li>
<li><strong>Date of creation v1.0.3.</strong> 2023-12</li>
<li><strong>Date of creation v1.0.2.</strong> 2023-11</li>
<li><strong>Date of creation v1.0.1.</strong> 2023-10</li>
<li><strong>Date of creation v1.0.0.</strong> 2023-09</li>
<li><strong>Last updated</strong> 2023-11</li>
<li><strong>Size</strong> 12.316 audio files; 12.328 tokens</li>
<li><strong>License</strong> Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/" class="uri">https://creativecommons.org/licenses/by-nc-sa/4.0/</a></li>
</ul>
<h3 id="acknowledgments">Acknowledgments</h3>
<ul>
<li><p><em>Project PRIN AlpiLinK. German-Romance Language Contact in
the Italian Alps: documentation, explanation, participation</em>, 2022-2025,
University of Verona. Project code: 2020SYSYBS. Funding organization:
Ministero dell’Università e della Ricerca</p></li>
</ul>
<h3 id="how-to-cite">How to cite </h3>
<ul>
<li><strong>Reference</strong>: Please cite the dataset in the following manner:<br />
Rabanus, Stefan, Anne Kruijt, Birgit Alber, Ermenegildo Bidese, Livio Gaeta, and Gianmario Raimondi. 2023. AlpiLinK Corpus 1.0.3. In collaboration with Paolo Benedetto Mas, Sabrina Bertollo, Jan Casalicchio, Raffaele Cioffi, Patrizia Cordin, Michele Cosentino, Silvia Dal Negro, Alexander Glück, Joachim Kokkelmans, Adriano Murelli, Andrea Padovan, Aline Pons, Matteo Rivoira, Marta Tagliani, Caterina Saracco, Emily Siviero, Alessandra Tomaselli, Ruth Videsott, Alessandro Vietti & Barbara Vogt. DOI:<a href="https://doi.org/10.5281/zenodo.10418858" class="uri">10.5281/zenodo.10418858</a>.</li>
</ul>
<h2 id="abbreviations">2. Abbreviations</h2>
<h3 id="languages">Languages</h3>
<ul>
<li>“cim” = Cimbrian</li>
<li>"frp" = Francoprovençal</li>
<li>"fur" = Friulian</li>
<li>"lld" = Ladin</li>
<li>"lmo" = Lombard</li>
<li>"mhn" = Mòcheno</li>
<li>"oci" = Occitan</li>
<li>"pms" = Piemontese</li>
<li>“tir” = Tyrolean</li>
<li>“tre” = Trentino</li>
<li>“vec” = Venetan</li>
<li>"wae" = Walser German</li>
</ul>

<h3 id="languages">Tasks</h3>
<ul>
<li>"M" = Free production; audio response</li>
<li>"S" = Translation task; audio response</li>
<li>"I" = Image description task; audio response</li>
<li>"T" = Tense transformation task; audio response</li>
<li>"N" = Name truncation task; multiple choice</li>
<li>"G" = Sentence completion task; audio response</li>
</ul>

<h2 id="data-structure">3. Data Structure</h2>
<p>File structure under each language variety is identical and organized as follows:</p>
<pre><code> AlpiLinK 
        |--- README.txt/.html
        +--- metadata.zip
        ¦     |--- users_results_v1.0.3.csv
        ¦     |--- questionnaire_v1.0.3.csv
        ¦     +--- I_images
        ¦          |--- I01.jpg
        ¦          |...
        +--- cim.zip
        ¦     |--- I01_cim_U0003.wav
        ¦     |--- I01_cim_U0004.wav
        ¦     |--- S02_cim_U0012.wav
        ¦     |...
        ¦       
        +--- frp.zip
        ¦     |--- ... equivalent to "cim"
        +--- fur.zip
        ¦     |--- ... equivalent to "cim"
        +--- lld.zip
        ¦     |--- ... equivalent to "cim"
        +--- lmo.zip
        ¦     |--- ... equivalent to "cim"
        +--- mhn.zip
        ¦     |--- ... equivalent to "cim"
        +--- oci.zip
        ¦     |--- ... equivalent to "cim"
        +--- pms.zip
        ¦     |--- ... equivalent to "cim"
        +--- tir.zip
        ¦     |--- ... equivalent to "cim"
        +--- tre.zip
        ¦     |--- ... equivalent to "cim"
        +--- vec.zip
        ¦     |--- ... equivalent to "cim"
        +--- wae.zip
             |--- ... equivalent to "cim"
</code></pre>
<p>As can be seen, the AlpiLinK Corpus consists of:</p>
<ul>
<li><p>1 readme file: contains the main information about the corpus and the AlpiLinK project.</p></li>
<li><p>1 Metadata folder: contains tables with the questionnaire content and users, sociolinguistic information and linguistic data, and images folder containing the image description task stimuli.</p></li>
<li><p>12 Audio folders containing raw audio recordings collected from speakers. The audio files were organized based on language variety (max. size 2GB).</li>
</ul>
<p>
There are audio recordings in 12 language varieties.</p>
<p>The audio file name always mentions the stimulus ID (e.g. S01) followed by the abbreviation of the language variety (e.g., cim) and ending in the user ID (e.g., U0003). This means that audio file S01_cim_U0003 is a Cimbrian translation of stimulus S01 by speaker U0003. The first letter of the stimulus ID indicates the task design of the stimulus:</p>
<ul>
<li>"M" = Free production; audio response</li>
<li>"S" = Translation task; audio response</li>
<li>"I" = Image description task; audio response</li>
<li>"T" = Tense transformation task; audio response</li>
<li>"N" = Name truncation task; multiple choice</li>
<li>"G" = Sentence completion task; audio response</li>
</ul>
<h3 id="languages">Task descriptions</h3>
<h4 id="phonology">M - Free production</h4>
<p>Single item (M01) in which participants are asked to describe in their own words to tell a bit about the languages present in their life, e.g. which languages they spoke growing up, what their parents spoke, etc. Answers are recorded in audio (max. 5 min) and in whatever language they prefer (so could be responded to in standard language as well as dialect/minority language). </p>
<h4 id="morphology">S - Translation task</h4>
<p>Translation task from either standard Italian or standard German composed of 30 stimuli in total. The majority are presented in Italian and German with the exception of; S09, S15, S25, S26 and S28 (only presented in Italian) and S16 and S29 (only presented in German).</p>
<p>The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.3.ods).</p>
<h4 id="morphology">I - Image description task</h4>
<p>The image description task is presented in both the Italian and German questionnaire and is composed of 7 items. It presents speakers with 7 images meant to elicits phrasal verbs.</p>
<p>The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.3.ods).</p>
<h4 id="syntax">T - Tense transformation task</h4>
<p>Tense transformation task from either present tense to past tense or from present tense to future tense, composed of 6 items in total. Stimuli are presented in either standard Italian or standard German depending on the questionnaire-language. </p>
<p>The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.3.ods).</p>
<h4 id="syntax">N - Name truncation task</h4>
<p>The name truncation part with multiple choice responses (N01-N18) were presented to participants of all varieties with the exception of the German minority languages and the Ladin varieties (these participants instead only saw N19 which is an open text question regarding existing name truncations in their communities); the stimuli are questionnaire-language specific, meaning that N01-N09 are presented only to participants taking the Italian questionnaire and N10-N18 are presented only to participants of the German questionnaire.</p>
<p>N19 was implemented after 13-08-2023, which means that the Cimbrian speakers U0003-U0017 and Ladin speaker U033 received and responded to N01-N09 rather than N19.</p>
<p>Variables for the task are gender and age of speaker and addressee. The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.3.ods).</p>
<h4 id="phonology">G - Sentence completion task</h4>
<p>This section is only present in the German questionnaire and is composed of 10 items. It investigates word formation in Tyrolean with ge-N-e (or -erei) for specific phonological environments (simple/complex onset, obstruent/sonorant, stop/fricative/lateral, etc.).</p>
<p>The specific variables per stimulus are reported in the Questionnaire table in the metadata folder (questionnaire_v1.0.3.ods).</p>

<h3 id="metadata-folder">Metadata folder</h3>
<p>This folder contains two csv files with the relevant information about the speakers and the linguistic stimuli:</p>
<h4 id="users">users_results_v1.0.3.csv</h4>
<p>The speaker information includes:</p>
<ul>
<li>General information about the survey; speaker_id (included in the audio file name), date of participation, note (comments on survey), and survey_language (it or de). </li>
<li>Sociolinguistic information of the speaker; the linguistic variety of a speaker written in full (language_variety) and as a 3-letter based coded (where available based on its ISO 639-3 code, where not available abbreviation based on the language name) (language_variety_iso), the location they are from and if they still live there at the time of participation (resp. location_province, location_municipality, location_presence), age (age), gender (gender), and language proficiency, frequency of use, use with family and friends of language variety (resp. language_compentence, language_use, language_family, language_friends; Likert scale). It also includes information about what other languages they speak (other_languages) and if they use their dialect/minority language in a written form and if so, where (written_use).</li>
<li>Geographical information linked to the municipality; columns ISTAT_it, GID, and coordinates(lat, lon). The ISTAT name refers to the Italian name of the municipality as used by the geographical data of ISTAT. The GID codes can be used to import data into the GIS system of the REDE (<a href="https://www.regionalsprache.de/home.aspx" class="uri">regionalsprache.de</a>) project of the University of Marburg, a collaborating partner of the AlpiLinK project. Some GIDs refer to municipalities which have since merged into new ones, this is always indicated in the note column. The coordinates column refers to the centre of the shapefile of the municipality in latitude, longitude-format.</li>
<li>Answers to name truncation task: N01-N19; written responses.</li>
<li>Answers to other tasks: labels of audio files included.</li>
</ul>
<h4 id="words">questionnaire_v1.0.3.csv</h4>
<p>Includes the information of the linguistic questionnaire:</p>
<ul>
<li>Stimulus information: stimulus_id (included in the audio file name, image name, and in the users_results file), target_it (target in standard Italian), stimulus_it (presented stimulus in standard Italian), target_de (target in standard German), stimulus_de (presented stimulus in standard German), task_format (task per stimulus, e.g. image description, translation, multiple choice etc.) and response_format (type of response, e.g. audio, written multiple choice options, etc.).</li>
<li>Variables present in the target sentence:</</li>
<ul>
<li>variable_sbjpron (subject pronouns/clitics)</li>
<li>variable_objpron (object pronouns/clitics)</li>
<li>variable_verbinflection (verb class + inflection)</li>
<li>variable_wordformation (word formation of nouns)</li>
<li>variable_pn (proper nouns/names)</li>
<li>variable_indef (indefiniteness)</li>
<li>variable_pos (possession)</li>
<li>variable_deixis (deixis)</li>
<li>variable_syntax (syntax)</li>
<li>variable_phonology (phonology)</li>
</ul>
<li>note</li>
</ul>

<h2 id="additional-information">4. Additional information</h2>
<h3 id="websites">Websites</h3>
<ul>
<li>AlpiLinK website<br />
<a href="https://alpilink.it/" class="uri">https://alpilink.it/</a></li>
</ul>
<h3 id="dissemination">Dissemination</h3>
<p>Data collected and result dissemination in the period June 2023 to December 2023 was primarily done via projects in collaboration with local high schools. For Veneto and Friuli Venezia Giulia, see <a href="https://sites.hss.univr.it/vinkiamo/" class="uri">https://sites.hss.univr.it/vinkiamo/</a> for more information, for VinKiamo in Südtirol, see <a href="https://vinkiamo.projects.unibz.it/" class="uri">https://vinkiamo.projects.unibz.it/</a> and VinKiamo in Trentino, see <a href="https://vinkiamo.unitn.it/" class="uri">https://vinkiamo.unitn.it/</a>. In the Aosta Valley, a school project with 3 participating schools under the name "Tutelare le lingue minoritarie con il crowd-sourcing" was conducted in autumn 2023.</p>
<h2 id="error-reporting">5. Error reporting</h2>
<p>The collected files are raw audio data and some may be missing or empty. If you spot any inconsistency, error, or corrupted recording please contact us at <a href="mailto:vinko@ateneo.univr.it">vinko@ateneo.univr.it</a>.</p>
<h2 id="updates">6. Updates</h2>
<p>The AlpiLinK corpus is updated on a regular basis to include new data, analysis, and/or new versions of the questionnaire. The versioning of the corpus is done according to the following logic: updates to include new responses to the questionnaire are reflected in the third numeral (e.g., v1.0.0 to v1.0.1); updates to the metadata, e.g. transcriptions, analysis, are reflected in the second numeral (e.g., v1.0.0 to v1.1.0), and any changes to the questionnaire are reflected in the first numberal (e.g., v1.0.0 to v2.0.0).</p>

<ul>
<li>v1.0.0 > v1.0.1</li>
<ul>
<li>New data added</li>
<ul>
<li>New speakers: U0058-U0080.</li>
<li>Audio files from 2.056 to 2.851.</li>
</ul>
<li>Folder for Mòcheno (mhn) added.</li>
<li>Correction in User_results_v1.0.1.csv table, v.1.0.0 included extra and incorrect column for S12.</li>
<li>Dissemination: VinKiamo Trentino website added.</li>
</ul>
<li>v1.0.1 > v1.0.2</li>
<ul>
<li>New data added</li>
<ul>
<li>New speakers: U0081-U0240.</li>
<li>Audio files from 2.851 to 8.025.</li>
</ul>
<li>Folder for Friulian (fur) added.</li>
<li>User_results_v1.0.2.csv: survey_language and location_municipality columns corrected. Three new columns added: ISTAT_it, GID and coordinates(lat, lon).</li>
</ul>
<li>v1.0.2 > v1.0.3</li>
<ul>
<li>New data added</li>
<ul>
<li>New speakers: U0241-U0336.</li>
<li>Audio files from 8.025 to 12.316.</li>
</ul>
<li>Errors corrected</li>
<ul>
<li>Missing data added: audio from speaker U0228-U0240 were missing in version 1.0.2 and have been added in 1.0.3.</li>
<li>Empty audio files deleted: S30_oci_U0022, S30_oci_U0022, S13_wae_U0059.</li>
</ul>
</ul>
</ul>
</body>
</html>