Published July 14, 2017 | Version v1
Dataset Open

Wembedder wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20-iter=25

  • 1. Technical University of Denmark

Description

Wikidata embedding
==================

Gensim model:
wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20-iter=25

Download of Wikidata from::

    https://dumps.wikimedia.org/wikidatawiki/entities/

Trigram construction::

    from bz2 import BZ2File
    import re

    dump_filename = 'wikidata-20170613-truthy-BETA.nt.bz2'
    trigram_filename = 'wikidata-20170613-truthy-BETA.trigrams'

    pattern = re.compile(
        (r'^<http://www.wikidata.org/entity/(Q\d+)> '
         r'<http://www.wikidata.org/prop/direct/(P\d+)> '
         r'<http://www.wikidata.org/entity/(Q\d+)>'),
         flags=re.UNICODE)

    with open(trigram_filename, 'w') as f:
        for line in BZ2File(dump_filename):
            line = line.decode('utf-8')
            match = pattern.search(line)
            if match:
                f.write(" ".join(match.groups()) + '\n')


Construction of Gensim model::
                
    import logging
    from gensim.models import Word2Vec
    from gensim.models.word2vec import LineSentence

    logging.basicConfig(
        format='%(asctime)s : %(levelname)s : %(message)s',
        level=logging.INFO)

    sentences = LineSentence('wikidata-20170613-truthy-BETA.trigrams')

    filename = 'wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20-iter=25'
    w2v = Word2Vec(sentences, size=100, window=1, min_count=20, workers=10, iter=25)
    w2v.save(filename)

 

Notes

DABAI

Files

wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20-iter=25.zip