Nielsen, Finn Årup
2017-07-05
<p>Wikidata embedding<br>
==================</p>
<p>Gensim model:<br>
wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20</p>
<p>Download of Wikidata from::</p>
<p> https://dumps.wikimedia.org/wikidatawiki/entities/</p>
<p>Trigram construction::</p>
<p> from bz2 import BZ2File<br>
import re</p>
<p> dump_filename = 'wikidata-20170613-truthy-BETA.nt.bz2'<br>
trigram_filename = 'wikidata-20170613-truthy-BETA.trigrams'</p>
<p> pattern = re.compile(<br>
(r'^<http://www.wikidata.org/entity/(Q\d+)> '<br>
r'<http://www.wikidata.org/prop/direct/(P\d+)> '<br>
r'<http://www.wikidata.org/entity/(Q\d+)>'),<br>
flags=re.UNICODE)</p>
<p> with open(trigram_filename, 'w') as f:<br>
for line in BZ2File(dump_filename):<br>
line = line.decode('utf-8')<br>
match = pattern.search(line)<br>
if match:<br>
f.write(" ".join(match.groups()) + '\n')</p>
<p><br>
Construction of Gensim model::<br>
<br>
import logging<br>
from gensim.models import Word2Vec<br>
from gensim.models.word2vec import LineSentence</p>
<p> logging.basicConfig(<br>
format='%(asctime)s : %(levelname)s : %(message)s',<br>
level=logging.INFO)</p>
<p> sentences = LineSentence('wikidata-20170613-truthy-BETA.trigrams')</p>
<p> filename = 'wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20'<br>
w2v = Word2Vec(sentences, size=100, window=1, min_count=20, workers=10)<br>
w2v.save(filename)</p>
<p> </p>
DABAI
https://doi.org/10.5281/zenodo.823195
oai:zenodo.org:823195
Zenodo
https://doi.org/10.5281/zenodo.823194
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
embedding
Wikidata
Wembedder wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20
info:eu-repo/semantics/other