Cinema & Civiltà corpus

Roberta Bianca Luzietti; Niccolò Pretto; Sergio Canazza; Frédéric Kaplan; Alain Dufaux; Alessandra Origani; Ilenia Maschietto; Costanza Blaskovic

doi:10.5281/zenodo.5645827

Published November 18, 2021 | Version 1

Dataset Restricted

Cinema & Civiltà corpus

1. University of Padova
2. EPFL
3. Univerisity of Zurich
4. Giorgio Cini Foundation

The corpus material is composed of eight .xml files containing the transcriptions of the Cinema & Civiltà conference recordings obtained from the digitized form; see Luzietti, R.B., Pretto, N., Kaplan, F., Dufaux, A., Canazza, S. (2021). FONTI 4.0: evaluating speech-to-text automatic transcription of digitized historical oral sources. In Proceedings of the eight Italian conference on computational linguistics (CLiC-it 2021).

The name on the files coincides with the name of the preservation copy, e.g., FCINI002a. FCINI stands for Giorgio Cini foundation, the numbers correspond to the magnetic tape on which the parts of conference were recorded on (002, 003 etc.), and the letters a & b indicate the side of the tape.

The correct order of the files is:

1) FCINI002a

2) FCINI002b

3) FCINI003b

4) FCINI003a

5) FCINI004b

6) FCINI004a

7) FCINI005a

8) FCINI005b

Each transcription file is annotated according to the Text Encoding Initiative (TEI) style, tags, and guidelines (https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html). The first part of the TEI transcription files is the Header in which a series of technical metadata information about each recorded file are present. These are: the title, owner, editor, extent, duration, recording equipment, date of the recording, date of digitization, languages, and speakers.

The second part of the TEI file is the Body containing the text and further annotations. At the beginning of each segment (i.e., <annotation block>), are indicated the language, if different from Italian, using the tag <foreign> (e.g., <foreign xml:lang="fr-FR">...</foreign> for French and <foreign xml:lang="es-ES">...</foreign> for Spanish), and the audio quality using the tag <note> containing the values excellent, good, fair, poor or bad (e.g., <note> good </note>). The annotations within the text are: <shift>; <distinct>; <unclear>; <del>; <overlap>; <gap>; <vocal> and <incident>.

For more see Luzietti, R.B., Pretto, N., Kaplan, F., Dufaux, A., Canazza, S. (2021). FONTI 4.0: evaluating speech-to-text automatic transcription of digitized historical oral sources. In Proceedings of the eight italian conference on computational linguistics (CLiC-it 2021).

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

Request via e-mail to: archive@cini.it, niccolo.pretto@gmail.com & roberta.luzietti@yahoo.com

You are currently not logged in. Do you have an account? Log in here

Additional details

Luzietti, R.B., Pretto, N., Kaplan, F., Dufaux, A., Canazza, S. (2021). FONTI 4.0: evaluating speech-to-text automatic transcription of digitized historical oral sources. In Proceedings of the eight italian conference on computational linguistics (CLiC-it 2021).

	All versions	This version
Views	248	247
Downloads	1	1
Data volume	175.1 kB	175.1 kB

Cinema & Civiltà corpus

Creators

Description

Files

Restricted

Request access

Additional details

References