Data set name: Vlexique: French Verbal Paradigms in Phonemic Notation

Citation (if available): Beniamine, Coavoux, Bonami. Vlexique2.0 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10638682

Data set developer(s): Sacha Beniamine

Data sheet author(s): Sacha Beniamine

Others who contributed to this document: None

# Motivation

**For what purpose was the dataset created?** Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

This dataset was created in order to provide data on French Verbal Inflection. It is intended for use in NLP and linguistic investigation. 

**Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?**

This dataset was created by Sacha Beniamine¹, Maximin Coavoux², and Olivier Bonami³. It is largely based on [Flexique v.1.3](http://www.llf.cnrs.fr/fr/flexique-fr.php); itself derived from [Lexique](http://www.lexique.org/). Orthographic forms are taken from [Démonette](http://redac.univ-tlse2.fr/lexiques/demonette.html).   

Affiliations: 
¹Surrey Morphology Group, University of Surrey, United Kingdom
²Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
³Laboratoire de Linguistique Formelle, Université Paris Cité, France

**Who funded the creation of the dataset?** If there is an associated grant, please provide the name of the grantor and the grant name and number.

 This work was partially funded by the Leverhulme Early Career Fellowship (ECF-2022-286) awarded to Sacha Beniamine.


# Composition

Paralex datasets document paradigms of inflected forms.

**Are forms given as orthographic, phonetic, and or phonemic sequences ?**

Forms are given in both phonemic and orthographic notation.

**How many instances are there in total?**

- Number of inflected forms: 274855 distinct inflected forms
- Number of lexemes: 5273 lexemes
- Maximal paradigm size in cells: 51 cells

**Language varieties** 
> Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").

-   BCP-47 language tag: fr
-   Language variety description: French

**Does the data pertain to specific dialects, geographical locations, genre, etc ?**

The lexicon aims to reproduce the conventions of standard French.

**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?** 
> If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (for example, geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (for example, to cover a more diverse range of instances, because instances were withheld or unavailable).

There are certainly more possible French verbs. The set of lexemes was taken from Flexique V.1.3 (Which itself comprised numerous additions from the last version), which itself originally aimed to inflect all verbs found in Lexique 3.7. More lexemes were added to Flexique through recent updates; targetting in particular any missing frequent verb.

**Is any information missing from individual instances?** 
> If so, please provide a description, explaining why this information is missing (for example, because it was unavailable). This does not include intentionally removed information, but might include, for example, redacted text.

Defective verbs are annotated. This information originates from two resources:

- Flexique, for phonemic forms
- Demonette, for orthographic forms

Unfortunately, both resources do not fully agree on defectiveness. For example, both resources mark "éclore" as defective in the indicative past and subjunctive past; but although Vlexique marks éclore as defective for the imperfect and the present 1pl and 2pl; demonette lists imperfect forms "éclosais, éclosais, éclosait... éclosons, éclosez". We did not attempt to homogeneize, so one column might provide a form while the other indicates "#DEF#". This perhaps reflects the vagueness around the category of defectives. Thus when working on defectiveness, one would need to decide which resource to follow, or how to merge the information. Another source of information on defectiveness comes from frequencies: although frequency of 0 does not indicate defectiveness (due to the zipfian distribution of forms), when resources disagree, positive frequencies might indicate forms that are actually in use. For example, "surfaire" in the ptcp.pst.f.sg is marked as defective for the phonemic form (Flexique), but Demonette provides "surfaite", which saw 43 occurences in the Open Subtitle Corpus. Compare to quérir, which for the same cell presents #DEF# according to Flexique, but "quise" according to Demonette, with frequency 0. 

**Are there any errors, sources of noise, or redundancies in the dataset?** If so, please provide a description.

Not to the best of our knowledge, although we welcome any feedback if mistakes were made.

**Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?**
> If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (that is, including the external resources as they existed at the time the dataset was created); c) are there any restrictions (for example, licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

The dataset is complete and self-contained.

**If linking to vocabularies from other databases (such as databases of features, cells, sounds, languages, or online dictionnaries), were there any complex decisions in the matching of entries from this dataset to those of the vocabularies (eg. inexact language code) ?**

Matching of the paradigm cells was straightforward, and done for a large number of vocabularies: that of GRACE (Adda et al., 1998), the original Flexique scheme  (Bonami et al. 2014), Unimorph (Batsuren 2022), Universal Dependencies (Zeman, 2023), French Treebank, and a semantic decomposition (Bonami & Boyé, 2005)

Matching of lexemes between Demonette and Flexique required the following adjustments:

- Some derived verbs from Flexique were absent from Demonext. We added orthographic entries by prefixing the base, when it existed. This affected: re-rentrer, re-blinder, ex-officier, après-déjeuner, entr'égorger, re-décorer, re-échanger, entr'aimer, re-déménager, re-signer, co-signer, co-exister, re-biberonner, re-vérifier, re-réparer, re-chiader, re-respirer.
- Some compounds from Flexique do not present a dash, while they do in Demonext. In order to match lexemes, we added copies of Demonext compounds where the dash was removed.
- Missing verbs from Demonext were added by copying parts of other verbs:
  - éliciter (after féliciter)
  - désenrouler (by prefixing dés- to enrouler)
  - choser (by replacing p- by ch- in poser)
  - liposucer (by prefixing lipo- to sucer)
  - rafantir (by replacing n- by raf- in nantir)
  - friseler (by replacing c- by fr- in ciseler)
  - onder (by removing the initial s- of sonder)
  - fier (by removing dé- in défier)
  - pugiler (by replacing vi- by pu- in vigiler)
  - jogger 
  - two entries for ressortir ('exit again' and 'pertain to') were made by splitting the overabundant forms from the only entry in demonette.
  - two entries for saillir were similarly made from the overabundant forms given in demonette.
- Orthographic modifications were made to match conventions: using <oe> (not <œ>), taking the NFKC normalisation of the citation form;

Moreover, overabundant entries were matched semi-manually by writing rules, some of which are lexeme-specific, some more general, and iteratively adding rules until all cases were matched. This involved extensive manually verifications. The rules also assign overabundance tags.

The rules are as follow. Where an orthographic or phonemic match wasn't found, it was created by modifying the existing one:

- Purely orthographic variants where the form is given both with and without dashes in compounds, but the citation form has the dash were normalized so that they followed the citation form.
- In becqueter, phonemic forms were copied for all three variants, the standard becquete-, the old orthography becquett- and the new orthography becquèt-.
- For verbs in -ayer; overabundant stems in -aye- were matched to phonemic forms in /-ɛj-/, while stems in -aie- were matched to phonemic forms in /-E-/.
- For ouïr, phonemic variants in /wa-/, /oy-/ and /o-/ were created to match orthographic variants "oi-", "oy-", and "orr-". Existing forms in "wi-" were matched to the orthographic ouï-.
- Orthographic variant forms in -tisse (non standard variant) and -tît (standard variant) were matched respectively to phonemic variants in /-is/ and /-i/. Eg: abattre
- In absoudre, dissoudre, résoudre, we matched respectively -olves to /-olv/ and -ous to /-u/ 
- In verbs in -éer with overabundance in:
  - -éions vs -éons, these were matched respectively to forms in /-ejɔ̃/ and /-eɔ̃/ 
  - -éiez vs -éez, these were matched respectively to forms in /-eje/ and /-ee/ 
- In verbs in -illir which hesitate between endings in "iller-" and "illir-", we associate the first to forms in /jiʁ/ and the second to forms in /jəʁ/. (eg. défailleriez vs défailliriez).
- In verbs in -uire; orthographic -uisi- is matched to /ɥizi/ and -ui- to /ɥi/

Moreover, for a number of more specific lexemes, the following matches were made:

| lexeme     | orthographic forms in | phonemic forms in |
|------------|-----------------------|------------------|
| arguer     | argue-                | aʁgə-            |
| arguer     | arguë-,argüe-         | aʁgyə-           |
| entre-haïr | entre-hai             | ɑ̃tʁE-           |
| entre-haïr | entre-haï             | ɑ̃tʁai-          |
| ouïr       | oi-                   | wa-              |
| ouïr       | orr-                  | o-               |
| ouïr       | ouï                   | wi-              |
| surseoir   | sursié-               | syʁsjE-          |
| surseoir   | surseoi-              | syʁswa-          |
| asseoir    | assey-                | asEj-            |
| asseoir    | assoi-;asseoi-        | aswa-            |
| bénir      | béni                  | bEni             |
| bénir      | bénite                | bEnit            |
| catir      | -is-                  | -i-              |
| catir      | -isse-                | -is-             |
| choir      | cherr-                | ʃEʁ-             |
| choir      | choir-                | ʃwaʁ-            |
| communier  | -ni-                  | -nj-             |
| communier  | -nii-                 | -nij-            |
| -dire      | -disez                | -dize            |
| -dire      | -dites                | -dit             |
| dire       | die                   | di               |
| dire       | dise-                 | diz-             |
| défaillir  | défaille              | -faj             |
| défaillir  | défaut                | -fo              |
| départir   | départiss-            | dEpaʁtis-        |
| départir   | dépar-                | dEpaʁ-           |
| départir   | départi-              | dEpaʁti-         |
| départir   | départe-              | dEpaʁt-          |
| faillir    | fail-                 | faj-             |
| faillir    | fau-                  | fo-              |
| -fleurir   | fleur-                | flØʁ-            |
| -fleurir   | flor-                 | flOʁ-            |
| justifier  | -fi-                  | -fj-             |
| justifier  | -fii-                 | -fij-            |
| pouvoir    | peux                  | pø               |
| pouvoir    | puis                  | pɥi              |
| proscrire  | -cris-                | -kʁiz-           |
| proscrire  | -criv-                | -kʁiv-           |
| seoir      | seyant-               | sEjɑ̃-           |
| seoir      | séant-                | sEɑ̃-            |
| surseoir   | -sey-                 | -Ej-             |
| surseoir   | -soy-                 | -waj-            |
| surseoir   | -sie-                 | -sjE-            |
| surseoir   | -soi-                 | -swa-            |
| vouloir    | veuill-               | vØj-             |
| vouloir    | voul-                 | vul-             |
| vouloir    | veu-                  | vø-              |
| échoir     | éche                  | eʃE-             |
| échoir     | échoi                 | eʃwa-            |
| -faire     | faites                | fɛt              |
| -faire     | faisez                | fəze             |
| haïr       | haï                   | ai               |
| haïr       | hai                   | E                |


**Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?** If so, please provide a description.

No.

**Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?** If so, please describe why.

No.

**Any other comments?**

No.

# Collection process.

**What is provenance for each table (lexemes, cells, forms, frequencies, sounds, features), as well as for segmentation marks if any ? Are any information derived from other datasets ?**
> Was any information (forms, lexemes, cells, frequency) extracted from a corpus, a dictionnary, elicitated, extracted from field notes, digitized from grammars, generated ? What are the sources ?

- Frequencies everywhere: OpenSubtitle corpus, French section, tagged automatically. See Beniamine & al 2024 for details.
- Forms: 
  - Transcriptions: Flexique
  - Orthographic: Demonext
- Lexemes: Originally from Flexique. Inflection classes were added manually following the 3 broad macroclasses of French. 
- The phonemes table was adapted from that provided in Beniamine (2018), which itself followed Dell (1973)

**How were paradigms separated between lexemes (eg. in the case of homonyms or variants) ? What theoretical or practical choices were made ?**

Following Flexique, any orthographic variants or homonyms with identical pronunciation were merged into single lexemes. Eg: pécher and pêcher present a single entry. The `variants` column on the lexemes table lists those variants, separated by ":".

**How was the paradigm structure (set and labels of paradigm cells) decided ? What theoretical or practical choices were made ?**

Our selection of cells reflects the paradigm of French verbal cells: six person/number combinations combine with 8 tenses, with the imperative existing for only three persons. In addition, non-finite forms include the infinitive, present participle, and the past participle, the latter inflected for gender and number.


**What is the expertise of the contributors with the documented language ?**
> Are they areal expert, language experts, native speakers ? 

All contributors to the dataset are native speakers and language experts.

**How was the data collected (for example, manual human curation, generation by software programs, software APIs, etc)?** How were these mechanisms or procedures validated?

Forms for Flexique were in part generated (see Bonami & al 2014). Both Flexique and Demonext received automated rule-based additions in order to merge the resources: the small number of these additions allowed us to check them manually.

**If the dataset is a sample from a larger set, what was the sampling strategy (for example, deterministic, probabilistic with specific sampling probabilities)?**
    > Curation rationale: Which lemmas, forms, cells were included and what were the goals in selecting entries, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.

NA

**Who was involved in the data collection process (for example, students, crowdworkers, contractors) and how were they compensated (for example, how much were crowdworkers paid)?**

NA

**Over what timeframe was the data collected?** Does this timeframe match the creation timeframe of the data associated with the instances (for example, recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

NA

**Were any ethical review processes conducted (for example, by an institutional review board)?** If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

NA 

**Any other comments?**

No.

# Preprocessing/cleaning/labeling.

**How were the inflected forms obtained ?**  If generated, what was the generation process ? Is the software for generation available ?

See above regarding sources of forms, and in particular Bonami & al 2014. 

**If relevant, how were the forms segmented ?**

NA

**Was any preprocessing/cleaning/labeling of the data done (for example, discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values, cleaning of labels, mapping between vocabularies, etc)?** If so, please provide a description. If not, you may skip the remaining questions in this section. This includes estimation of frequencies.

The process for estimation of frequencies is detailed in Beniamine & al (2024).

**Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (for example, to support unanticipated future uses)?** If so, please provide a link or other access point to the "raw" data.

The "raw" data is available as Vlexique 1.3; Demonette; and the (public and open) resources used in calculating frequencies.

**Is the software that was used to preprocess/clean/label the data available?** If so, please provide a link or other access point.

The tagger used to generate frequencies is available at: 

**Any other comments?**


# Uses

**Has the dataset been used for any published work already?** If so, please provide a description.

Not yet.

**Is there a repository that links to any or all papers or systems that use the dataset?** If so, please provide a link or other access point.

No

**What (other) tasks could the dataset be used for?**

Any NLP task concerned with inflection and based on phonemic form; linguistic investigations into inflection, whether quantitative or qualitative.

**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (for example, stereotyping, quality of service issues) or other risks or harms (for example, legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

NA 

**Are there tasks for which the dataset should not be used?** If so, please provide a description.

NA 

**Any other comments?**

No.

# Distribution.

**Will the dataset be distributed to third parties outside of the entity (for example, company, institution, organization) on behalf of which the dataset was created?** If so, please provide a description.

No.

**How will the dataset be distributed (for example, tarball on website, API, GitHub)?** Does the dataset have a digital object identifier (DOI)?

DOI:  https://doi.org/10.5281/zenodo.10638682
The DOI points to a zenodo deposit
The dataset is available as a repository on gitlab: https://gitlab.com/sbeniamine/vlexique
It is presented as a user-friendly website at: https://sbeniamine.gitlab.io/vlexique

**When will the dataset be distributed?**

It is already distributed.

**Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?** If so, please describe this license and/ or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

License: Attribution-NonCommercial-ShareAlike 4.0 International

**Have any third parties imposed IP-based or other restrictions on the data associated with the instances?** If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

No.

**Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?** If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

No.

**Any other comments?**

No.

# Maintenance

**Who will be supporting/hosting/maintaining the dataset?**

Sacha Beniamine

**How can the owner/curator/manager of the dataset be contacted (for example, email address)?**

Please raise an issue on the gitlab repository, or email at s.my-last-name@surrey.ac.uk.

**Is there an erratum?** If so, please provide a link or other access point.

No.

**Will the dataset be updated (for example, to correct labeling errors, add new instances, delete instances)?** If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (for example, mailing list, GitHub)?

Yes, whenever relevant. Updates will be pushed to gitlab and lead to new versions, themselves pushed to zenodo.

**If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (for example, were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?** If so, please describe these limits and explain how they will be enforced.

No.

**Will older versions of the dataset continue to be supported/hosted/maintained?** If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

Yes, thanks to zenodo & gitlab.

**If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?** If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

We welcome merge requests on gitlab.

# Any other comments?

No.