Published June 14, 2024 | Version 1.0
Dataset Open

Text Datasets for DSI

  • 1. ROR icon National Institute of Information and Communications Technology
  • 2. Center for Information and Neural Networks

Description

Text data used in an article Tatsuya Haga, Yohei Oseki, Tomoki Fukai, "A unified neural representation model for spatial and semantic computations" (preprint in biorxiv doi: https://doi.org/10.1101/2023.05.11.540307). Codes and usage of data are available at https://github.com/TatsuyaHaga/DSI_codes

Main dataset (enwiki_processed_pickle): This file contains preprocessed text data of 100,000 articles randomly sampled from English Wikipedia dump taken on 22-May-2020 (https://dumps.wikimedia.org/enwiki/latest/).

Additional dataset (wikitext103train_processed_pickle): This file contains preprocessed text data based on WikiText-103 dataset (Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. http://arxiv.org/abs/1609.07843)

Both text data have already been preprocessed: all characters were lowercased, punctuation characters were removed, and all words were tokenized. Data format is python pickle format. 

We publish data under CC-BY-SA following the license of original datasets.

Files

Files (2.5 GB)

Name Size Download all
md5:1358cc090c2c95b7062cc6fa8208d451
1.9 GB Download
md5:fd17f0e366b2733299709db49ceaba95
678.9 MB Download

Additional details

Related works

Is required by
Preprint: 10.1101/2023.05.11.540307 (DOI)

Dates

Submitted
2024-06-14