Text Datasets for DSI

Tatsuya, Haga

doi:10.5281/zenodo.11651117

Published June 14, 2024 | Version 1.0

Dataset Open

Text Datasets for DSI

Tatsuya, Haga (Researcher)^{1, 2}

1. National Institute of Information and Communications Technology
2. Center for Information and Neural Networks

Text data used in an article Tatsuya Haga, Yohei Oseki, Tomoki Fukai, "A unified neural representation model for spatial and semantic computations" (preprint in biorxiv doi: https://doi.org/10.1101/2023.05.11.540307). Codes and usage of data are available at https://github.com/TatsuyaHaga/DSI_codes

Main dataset (enwiki_processed_pickle): This file contains preprocessed text data of 100,000 articles randomly sampled from English Wikipedia dump taken on 22-May-2020 (https://dumps.wikimedia.org/enwiki/latest/).

Additional dataset (wikitext103train_processed_pickle): This file contains preprocessed text data based on WikiText-103 dataset (Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. http://arxiv.org/abs/1609.07843)

Both text data have already been preprocessed: all characters were lowercased, punctuation characters were removed, and all words were tokenized. Data format is python pickle format.

We publish data under CC-BY-SA following the license of original datasets.

Files

Files (2.5 GB)

Name	Size	Download all
enwiki_processed_pickle md5:1358cc090c2c95b7062cc6fa8208d451	1.9 GB	Download
wikitext103train_processed_pickle md5:fd17f0e366b2733299709db49ceaba95	678.9 MB	Download

Additional details

Is required by: Preprint: 10.1101/2023.05.11.540307 (DOI)

Submitted: 2024-06-14

https://github.com/TatsuyaHaga/DSI_codes

	All versions	This version
Views	65	65
Downloads	100	100
Data volume	134.2 GB	134.2 GB

Text Datasets for DSI

Files

Files (2.5 GB)

Additional details

Related works

Dates

References

Text Datasets for DSI

Creators

Description

Files

Files (2.5 GB)

Additional details

Related works

Dates

References