Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published April 25, 2019 | Version v5
Dataset Open

Japanese FAQ dataset for e-learning system

  • 1. Tokyo Metropolitan University


This dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.

Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.

Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.

This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.

We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.

File contents:

  • FAQ data (*.csv)
    1. Answer2Category.csv: Categories of answers.
    2. Answer2Tag.csv: Titles of answers.
    3. Answers.csv: IDs for answers and texts of answers.
    4. Categories.csv: Names of categories for answers.
    5. Questions.csv: Texts of questions and their corresponding answer IDs.
    6. Answers_english.csv: IDs for answers and texts of answers written in English.
    7. Categories_english.csv: Names of categories for answers and their corresponding English names.
    8. Questions_english.csv: Texts of questions and their corresponding answer IDs written in English.

  • Statistics (*.tsv)

     Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.

Grants: JSPS KAKENHI Grant Number 18H01057



Files (1.5 MB)

Name Size Download all
957 Bytes Preview Download
2.0 kB Preview Download
3.7 kB Preview Download
42.9 kB Preview Download
37.3 kB Preview Download
525 Bytes Download
238 Bytes Preview Download
403 Bytes Preview Download
825 Bytes Download
78.6 kB Download
7.2 kB Download
8.3 kB Download
1.3 MB Download
34.9 kB Preview Download
30.9 kB Preview Download
2.5 kB Download