Dataset Open Access

word2vec model trained on lemmatized French Wikipedia 2018

Gaudrain, Etienne; Crouzet, Olivier

Researcher(s)
Huet, Moïra-Phoebé; Başkent, Deniz

The files presented are trained word2vec models for French.

Corpus

The base corpus used for training is a dump of the French Wikipedia performed on 20 October 2018. The corpus was then processed to remove, as much as possible, the Mediawiki syntax, links, etc... Note that this is not perfect, but hopefully has little consequence on the weights calculated.

The corpus was then POS-tagged and lemmatized with TreeTagger. The list of tags can be found here. During POS-tagging, each word is replaced with its lemma and is replaced with the syntax `[lemma]_[tag]`. For instance, the sentence "Il a sauté dans sa voiture" will produce this output:

il_PRO:PER avoir_VER:pres sauter_VER:pper dans_PRP son_DET:POS voiture_NOM

Word2vec training

The training was performed using the gensim Python module (v3.5.0). The skip-grams method was used, with a size of 500, a window size of 5 and a minimum count of 5. This is how the model creation was invoked:

model = Word2Vec(size=500, window=5, min_count=5, workers=workers, sg=1)

Two versions of the model were trained:

  • frwiki-20181020.treetag.2__2019-01-24_10.41__.s500_w5_skip.word2vec.bin
    No extra processing was performed on the corpus before training.
  • frwiki-20181020.treetag.2.ngram-pass2__2019-04-08_09.02__.s500_w5_skip.word2vec.bin
    Where two passes of the 2-gram detection were ran before training. This allows detection of 2-, 3- and 4-grams in the vocabulary.

The files are in the word2vec binary format, so can be used either with the original C implementation of word2vec, or with the Python gensim version (and possibly other libraries that support that format).

 

 

 

 

 

 

 

Funding: CNRS International Mobility Support program, 2017; European Community PRESTIGE Mobility program N°2017-2-0044; NWO / ZonMW VICI: 918-17-603; LABEX CeLyA (ANR-10-LABX-0060) of Université de Lyon, within the program « Investissements d'Avenir » (ANR-16-IDEX-0005) operated by the French National Research Agency (ANR).
  • Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3

77
16
views
downloads
All versions This version
Views 7777
Downloads 1616
Data volume 42.7 GB42.7 GB
Unique views 4343
Unique downloads 1111

Share

Cite as