Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published August 20, 2018 | Version v1
Dataset Open

Embeddings Augmented by Random Permutations (EARP)

  • 1. Biomedical and Health Informatics. University of Washington, Seattle.
  • 2. Grab, Inc.

Description

This data set includes the word embeddings used for our CoNLL 2018 paper, "Bringing Order to Neural Word Embeddings with Embeddings Augmented By Random Permutations".

As described in the paper, encoding word order with random permutations leads to substantive improvements in performance across multiple analogical retrieval tasks, most notably for "syntactic" analogies that involve mapping between morphological derivatives. Smaller improvements are also evident in downstream sequence labeling tasks. 

Both baseline Skipgram-with-negative-sampling and permutation-based vectors are provided in two: formats - the Semantic Vectors (https://github.com/semanticvectors/semanticvectors) binary format (.bin) and the word2vec binary format (.w2v.bin).

All models were created with two different sliding window configurations - radius 2 (r2) and radius (r5); as well as with and without subword embeddings (sw). The models are as follows:

basic, fastText: standard Skipgram-with-negative-sampling, implemented in Semantic Vectors and fastText respectively

earp_directional: these models use random permutations (EARP) to distinguish between positions before and after the focus term in a sliding window

earp_positional: these models use random permutations (EARP) to distinguish between each position in a sliding window

earp_proximity: these models use random permutations (EARP) to generate encodings that are different, but not orthogonal, for neighboring positions in a sliding window. 

earpx variants: these models use exact sliding window position without replacing subsampled terms, as occurs in other models. 

The word2vec binary editions include random vectors for some characters and terms that were eliminated by Semantic Vectors' Lucene-based tokenization procedure - in particular punctuation marks - as these are useful for certain downstream machine learning tasks.

Notes

This work was supported in part by US National Library of Medicine Grant RO1-LM011563

Files

Files (24.5 GB)

Name Size Download all
md5:238fb38bb3455aa804ac8b2b7b5fbb83
424.6 MB Download
md5:25131ba83622d0b3ca5e6cb0f62e0665
450.9 MB Download
md5:f56eed0447d0d367528eb91e37eb9d12
424.6 MB Download
md5:104083741f11e385b7898a6764cbf754
450.9 MB Download
md5:813673b0e5ea3111b4ad94fa26b410c3
424.6 MB Download
md5:ab5d53abcaf7eb639e0f2082cd806f74
450.9 MB Download
md5:9a8cc559f4429b854b2af9c153b1c21d
424.6 MB Download
md5:3d1ecb0130fc02dbd58d1511b0e7a6ab
450.9 MB Download
md5:75e742997b8f7a4fc838924d58e83c05
424.6 MB Download
md5:73d41894511bf541eb2ea25b1cc857b3
450.9 MB Download
md5:75f0564eed00f3be6c9097e42ee8b029
424.6 MB Download
md5:f87d62aae11b4cacc45c8c1c551a23ff
450.9 MB Download
md5:26f5fe6397c0dae7f86fe4988a74da2f
424.6 MB Download
md5:c86541e4fbf0f8e9640c153d4099c25c
451.0 MB Download
md5:3ef01f152411e359ea02fb33e96b74b2
424.6 MB Download
md5:ae441ebf936ad29ec4f2f650e82096c1
451.0 MB Download
md5:b7372aa1df7038303389f1e2423fe1b1
424.6 MB Download
md5:ae83c8009ad4df83b01ac4e2ff8f26bb
451.0 MB Download
md5:35fa5ca4a2de18e8e0e6c7d1febec759
424.6 MB Download
md5:c3f6b85ff17656066ef85d3d6ed3be5f
450.9 MB Download
md5:f3ea9755c9818729035e1f54d21931c0
424.6 MB Download
md5:d909203e27481dcb7f80f406d24b79df
450.9 MB Download
md5:d68d60208fedab70d6837f804bec3e58
424.6 MB Download
md5:7e91a28b89793c8758123d2bb7d2a8a3
450.9 MB Download
md5:b36f51b5911436794a9bb2796ac7fdea
424.6 MB Download
md5:8ebf710ad7f6841290106e2b5975eb0e
450.9 MB Download
md5:94b736fed3ab1a7f5692eb29c839ac9d
424.6 MB Download
md5:2f5ce51b0c231f464bbe5cd97ca93e7d
450.9 MB Download
md5:fdcfa4191f49342c206ebd12c4408ae0
424.6 MB Download
md5:0fa21a88713d547dda58f22fc67bc548
450.9 MB Download
md5:63362c44b1d865ddfa183440f9b7cdfa
424.6 MB Download
md5:eaa019da490fc957fe2f03f273101952
450.9 MB Download
md5:c4cbd483ee7a0cbcaf6ec00e9df81f55
424.6 MB Download
md5:79cfba706ab577c2bc392677de8c7cfd
450.9 MB Download
md5:11ad93af718607de88e712466b7ceb37
424.6 MB Download
md5:dfad2691ad7bdad36e71787b817e6664
450.9 MB Download
md5:93ad4162c83fb2baf300e152e683e63d
424.6 MB Download
md5:1d747199d4f285e1673aebccf3a980a3
450.9 MB Download
md5:a70a02ec596a8cbb034eb2311dc31fbc
424.6 MB Download
md5:008a4f9431dc557c9abccbc03f473701
450.9 MB Download
md5:e9f560474d1082c79f6216e0e0c6ba27
424.6 MB Download
md5:5044ac96465d6edd0bf5fee318fa34f0
450.9 MB Download
md5:da3a11229ab8faa4d7a607235281018e
428.8 MB Download
md5:d395bad6a74f0e165e3266c3907d4a64
455.1 MB Download
md5:f66032e58133f308041497330d904783
424.6 MB Download
md5:96348497f153dfd68ad4a189bd789c64
450.9 MB Download
md5:b33ca799f665a4d9388c7efabf0f15b4
428.8 MB Download
md5:3cba1568217374b0600b4eb9668bcf9d
455.1 MB Download
md5:a29bc00e96560ce3c8dc775699edbfc1
424.6 MB Download
md5:76e90e9151758c3236a6ff8ceb9c6d23
451.0 MB Download
md5:9649caf3f181dc1923b191a105da8984
428.8 MB Download
md5:a84380feba74abbd9fea778a22ec4148
455.1 MB Download
md5:a6012fa7a78c53b0ab3218014c56c7cb
424.6 MB Download
md5:7df95c8f0fad6c75265a024228cfb4f3
450.9 MB Download
md5:b2aa3190b6280fa843fe66acee437734
428.8 MB Download
md5:576a2e134db7cd2d8cc60324cf3d7c5c
455.1 MB Download