Published November 16, 2020 | Version v1
Dataset Open

Rule-based Synthetic Data for Japanese GEC

  • 1. M.Eng., Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
  • 2. Ph.D. Student, Center for Computational Engineering, Massachusetts Institute of Technology
  • 3. Senior Lecturer, Global Languages, Massachusetts Institute of Technology

Description

Title: Rule-based Synthetic Data for Japanese GEC
Dataset Contents:
This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:

Synthetic Corpus - *synthesized_data.tsv*
This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.

These paired sentences are derived from data scraped from the keyword-lookup site <yourei.jp>. The data within this file is primarily intended to serve as or augment a training set for a Japanese GEC model. Overall the sentences cover a broad array of primarily simple Japanese grammatical errors.

Teacher Corpus - *teacher_data.tsv*
This corpus file contains 6,345 parallel sentence pairs created via  what we call the "teacher-sourcing" project [2]. The corpus sentences were created by Japanese language teachers, and this "teacher-sourcing" was funded by the Japan Foundation, Los Angeles. The overall format of the file is similar to that of *synthesized_data.tsv*, with each line containing an erroneous sentence and a corresponding correction separated by a comma.

In addition, each erroneous sentence and correction sentence also contain pairs of characters that delimit the specific location within the sentence where the error/correction occur. For the erroneous sentence, these characters are `<` and `>`, while for the correction sentence, these are `(` and `)`.

For example, consider the following sentence pair:

 - Error: <汚れる服>をあらいました。
 - Correction: (汚れた服)をあらいました。

The delimiter characters indicate that the error phrase is `汚れる服` while the corresponding correction is `汚れた服`

These paired sentences were written to mimic commonly grammatical errors produced by Japanese langauge learners; thus this file's data is primarily intended to serve as a evaluation set for Japanese GEC models.

______

In addition, the dataset contains the rule file used to generate the synthetic data within *synthesized_data.tsv*:

### Rule File - *rule_set.tsv*
This file contains the 400 "syntactic rules" used to generate the data within *synthesized_data.tsv*. Each line contains a single rule, with different attributes delimited by tabs. Consult pages 41-66 of [1] for a more detailed analysis of these "syntactic rules" and the manner in which they are used to produce the synthetic data.

Notes

Associated Work - Kimn, A. (May, 2020). *A syntactic rule-based framework for parallel data synthesis in Japanese GEC* (Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, United States of America). Retrieved from [1] - Aikawa, T. & T. Takahashi. (2019), 「AIチュータの実現に向け:誤用例文コーパスデータの構築と誤用文修正知識の習得」,『ICT×日本語教育:ICTが作る新しい日本語教育への挑戦』當作靖彦(監修),李在鎬(編集),ひつじ書房, pp.84-98. (Toward the Development of AI Tutor: the Development of Error Corpus Data and the Acquisition Process of Grammar Error Correction, Information and Communication Technology (ICT) x Japanese Language Education: ICT's Challenges for Japanese Language Education, Eds., Yasuhiko Tohsaku, Jae-Ho LEE, Hitsuji Shobo, Tokyo, Japan, pp. 84-98.) [2]

Files

README.md

Files (615.8 MB)

Name Size Download all
md5:49f8ab07e7262acbfab7010de6ada19f
4.1 kB Preview Download
md5:283f8896cf1075adb9bfaa4cfda767cc
123.5 kB Download
md5:066df30dc803ba23be63dfd9b5a99775
615.1 MB Download
md5:6016d762fa0921e756940d69422d835f
604.2 kB Download

Additional details

Related works

Is documented by
Book chapter: 10.5281/zenodo.4281322 (DOI)