Published June 7, 2019 | Version v1.0
Dataset Open

word2vec model trained on lemmatized French Wikipedia 2018

  • 1. Lyon Neuroscience Research Center, Auditory Cognition and Psychoacoustics, CNRS UMR 5292, INSERM UMRS 1028, Université Claude Bernard Lyon 1, Université de Lyon, Lyon, France | University of Groningen, University Medical Center Groningen, Groningen, Netherlands
  • 2. LLING, CNRS UMR 6310, Université de Nantes, Nantes, France | University of Groningen, University Medical Center Groningen, Groningen, Netherlands

Contributors

  • 1. Lyon Neuroscience Research Center, Auditory Cognition and Psychoacoustics, CNRS UMR 5292, INSERM UMRS 1028, Université Claude Bernard Lyon 1, Université de Lyon, Lyon, France | INSA Lyon, Lyon, France
  • 2. University of Groningen, University Medical Center Groningen, Groningen, Netherlands

Description

The files presented are trained word2vec models for French.

Corpus

The base corpus used for training is a dump of the French Wikipedia performed on 20 October 2018. The corpus was then processed to remove, as much as possible, the Mediawiki syntax, links, etc... Note that this is not perfect, but hopefully has little consequence on the weights calculated.

The corpus was then POS-tagged and lemmatized with TreeTagger. The list of tags can be found here. During POS-tagging, each word is replaced with its lemma and is replaced with the syntax `[lemma]_[tag]`. For instance, the sentence "Il a sauté dans sa voiture" will produce this output:

il_PRO:PER avoir_VER:pres sauter_VER:pper dans_PRP son_DET:POS voiture_NOM

Word2vec training

The training was performed using the gensim Python module (v3.5.0). The skip-grams method was used, with a size of 500, a window size of 5 and a minimum count of 5. This is how the model creation was invoked:

model = Word2Vec(size=500, window=5, min_count=5, workers=workers, sg=1)

Two versions of the model were trained:

  • frwiki-20181020.treetag.2__2019-01-24_10.41__.s500_w5_skip.word2vec.bin
    No extra processing was performed on the corpus before training.
  • frwiki-20181020.treetag.2.ngram-pass2__2019-04-08_09.02__.s500_w5_skip.word2vec.bin
    Where two passes of the 2-gram detection were ran before training. This allows detection of 2-, 3- and 4-grams in the vocabulary.

The files are in the word2vec binary format, so can be used either with the original C implementation of word2vec, or with the Python gensim version (and possibly other libraries that support that format).

 

 

 

 

 

 

 

Notes

Funding: CNRS International Mobility Support program, 2017; European Community PRESTIGE Mobility program N°2017-2-0044; NWO / ZonMW VICI: 918-17-603; LABEX CeLyA (ANR-10-LABX-0060) of Université de Lyon, within the program « Investissements d'Avenir » (ANR-16-IDEX-0005) operated by the French National Research Agency (ANR).

Files

Additional details

References

  • Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3