Text Datasets for DSI
Creators
Description
Text data used in an article Tatsuya Haga, Yohei Oseki, Tomoki Fukai, "A unified neural representation model for spatial and semantic computations" (preprint in biorxiv doi: https://doi.org/10.1101/2023.05.11.540307). Codes and usage of data are available at https://github.com/TatsuyaHaga/DSI_codes
Main dataset (enwiki_processed_pickle): This file contains preprocessed text data of 100,000 articles randomly sampled from English Wikipedia dump taken on 22-May-2020 (https://dumps.wikimedia.org/enwiki/latest/).
Additional dataset (wikitext103train_processed_pickle): This file contains preprocessed text data based on WikiText-103 dataset (Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. http://arxiv.org/abs/1609.07843)
Both text data have already been preprocessed: all characters were lowercased, punctuation characters were removed, and all words were tokenized. Data format is python pickle format.
We publish data under CC-BY-SA following the license of original datasets.
Files
Files
(2.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1358cc090c2c95b7062cc6fa8208d451
|
1.9 GB | Download |
|
md5:fd17f0e366b2733299709db49ceaba95
|
678.9 MB | Download |
Additional details
Related works
- Is required by
- Preprint: 10.1101/2023.05.11.540307 (DOI)
Dates
- Submitted
-
2024-06-14