Dataset Open Access
The files presented are trained word2vec models for French.
The base corpus used for training is a dump of the French Wikipedia performed on 20 October 2018. The corpus was then processed to remove, as much as possible, the Mediawiki syntax, links, etc... Note that this is not perfect, but hopefully has little consequence on the weights calculated.
The corpus was then POS-tagged and lemmatized with TreeTagger. The list of tags can be found here. During POS-tagging, each word is replaced with its lemma and is replaced with the syntax `[lemma]_[tag]`. For instance, the sentence "Il a sauté dans sa voiture" will produce this output:
il_PRO:PER avoir_VER:pres sauter_VER:pper dans_PRP son_DET:POS voiture_NOM
The training was performed using the gensim Python module (v3.5.0). The skip-grams method was used, with a size of 500, a window size of 5 and a minimum count of 5. This is how the model creation was invoked:
model = Word2Vec(size=500, window=5, min_count=5, workers=workers, sg=1)
Two versions of the model were trained:
The files are in the word2vec binary format, so can be used either with the original C implementation of word2vec, or with the Python gensim version (and possibly other libraries that support that format).
Mikolov T., Chen K., Corrado G., Dean J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3