823195
doi
10.5281/zenodo.823195
oai:zenodo.org:823195
Wembedder wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20
Nielsen, Finn Årup
Technical University of Denmark
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
embedding
Wikidata
<p>Wikidata embedding<br>
==================</p>
<p>Gensim model:<br>
wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20</p>
<p>Download of Wikidata from::</p>
<p> https://dumps.wikimedia.org/wikidatawiki/entities/</p>
<p>Trigram construction::</p>
<p> from bz2 import BZ2File<br>
import re</p>
<p> dump_filename = 'wikidata-20170613-truthy-BETA.nt.bz2'<br>
trigram_filename = 'wikidata-20170613-truthy-BETA.trigrams'</p>
<p> pattern = re.compile(<br>
(r'^<http://www.wikidata.org/entity/(Q\d+)> '<br>
r'<http://www.wikidata.org/prop/direct/(P\d+)> '<br>
r'<http://www.wikidata.org/entity/(Q\d+)>'),<br>
flags=re.UNICODE)</p>
<p> with open(trigram_filename, 'w') as f:<br>
for line in BZ2File(dump_filename):<br>
line = line.decode('utf-8')<br>
match = pattern.search(line)<br>
if match:<br>
f.write(" ".join(match.groups()) + '\n')</p>
<p><br>
Construction of Gensim model::<br>
<br>
import logging<br>
from gensim.models import Word2Vec<br>
from gensim.models.word2vec import LineSentence</p>
<p> logging.basicConfig(<br>
format='%(asctime)s : %(levelname)s : %(message)s',<br>
level=logging.INFO)</p>
<p> sentences = LineSentence('wikidata-20170613-truthy-BETA.trigrams')</p>
<p> filename = 'wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20'<br>
w2v = Word2Vec(sentences, size=100, window=1, min_count=20, workers=10)<br>
w2v.save(filename)</p>
<p> </p>
DABAI
Zenodo
2017-07-05
info:eu-repo/semantics/other
823194
1579893965.076073
465563262
md5:7318edfe1803317f6efa05b89b146b2e
https://zenodo.org/records/823195/files/wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20.zip
public
10.5281/zenodo.823194
isVersionOf
doi