Written and spoken digits database for multimodal learning

Khacef, Lyes; Rodriguez, Laurent; Miramond, Benoit

doi:10.5281/zenodo.4452953

Published October 1, 2020 | Version 2.0

Dataset Open

Written and spoken digits database for multimodal learning

1. Université Côte d'Azur, CNRS, LEAT, France

Database description:

The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion [1].

The written digits database is the original MNIST handwritten digits database [2] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.

The spoken digits database was extracted from Google Speech Commands [3], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.

To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [2] and [3] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).

The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.

Files:

data_wr_train.npy: 60000 samples of 784-dimentional written digits for training;
data_sp_train.npy: 60000 samples of 507-dimentional spoken digits for training;
labels_train.npy: 60000 labels for the training subset;
data_wr_test.npy: 10000 samples of 784-dimentional written digits for test;
data_sp_test.npy: 10000 samples of 507-dimentional spoken digits for test;
labels_test.npy: 10000 labels for the test subset.

References:

Khacef, L. et al. (2020), "Brain-Inspired Self-Organization with Cellular Neuromorphic Computing for Multimodal Unsupervised Learning".
LeCun, Y. & Cortes, C. (1998), “MNIST handwritten digit database”.
Warden, P. (2018), “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.

Files

Files (723.5 MB)

Name	Size	Download all
data_sp_test.npy md5:85dfcbec8e18d0a2a5bc484bec7331fb	40.6 MB	Download
data_sp_train.npy md5:4e9ac96a6cd81250a96633b8997cee44	243.4 MB	Download
data_wr_test.npy md5:4220bf5b4f16b8c34d4b1edcc9ffe8f6	62.7 MB	Download
data_wr_train.npy md5:127bf195bce96aacc776e4f2449ec762	376.3 MB	Download
labels_test.npy md5:10bd739acbb3520b133e2c4d4f4b21f1	80.1 kB	Download
labels_train.npy md5:88b689ce98c01167aff3de5f722be2ab	480.1 kB	Download

Additional details

LeCun, Y. & Cortes, C. (1998), "MNIST handwritten digit database".
Warden, P. (2018), "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition".

	All versions	This version
Views	2,596	1,027
Downloads	1,387	665
Data volume	293.5 GB	99.2 GB

Written and spoken digits database for multimodal learning

Creators

Description

Files

Files (723.5 MB)

Additional details

References