Published November 16, 2020
| Version v1
Dataset
Open
Rule-based Synthetic Data for Japanese GEC
Creators
- 1. M.Eng., Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
- 2. Ph.D. Student, Center for Computational Engineering, Massachusetts Institute of Technology
- 3. Senior Lecturer, Global Languages, Massachusetts Institute of Technology
Description
Title: Rule-based Synthetic Data for Japanese GEC Dataset Contents: This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows: Synthetic Corpus - *synthesized_data.tsv* This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction. These paired sentences are derived from data scraped from the keyword-lookup site <yourei.jp>. The data within this file is primarily intended to serve as or augment a training set for a Japanese GEC model. Overall the sentences cover a broad array of primarily simple Japanese grammatical errors. Teacher Corpus - *teacher_data.tsv* This corpus file contains 6,345 parallel sentence pairs created via what we call the "teacher-sourcing" project [2]. The corpus sentences were created by Japanese language teachers, and this "teacher-sourcing" was funded by the Japan Foundation, Los Angeles. The overall format of the file is similar to that of *synthesized_data.tsv*, with each line containing an erroneous sentence and a corresponding correction separated by a comma. In addition, each erroneous sentence and correction sentence also contain pairs of characters that delimit the specific location within the sentence where the error/correction occur. For the erroneous sentence, these characters are `<` and `>`, while for the correction sentence, these are `(` and `)`. For example, consider the following sentence pair: - Error: <汚れる服>をあらいました。 - Correction: (汚れた服)をあらいました。 The delimiter characters indicate that the error phrase is `汚れる服` while the corresponding correction is `汚れた服` These paired sentences were written to mimic commonly grammatical errors produced by Japanese langauge learners; thus this file's data is primarily intended to serve as a evaluation set for Japanese GEC models. ______ In addition, the dataset contains the rule file used to generate the synthetic data within *synthesized_data.tsv*: ### Rule File - *rule_set.tsv* This file contains the 400 "syntactic rules" used to generate the data within *synthesized_data.tsv*. Each line contains a single rule, with different attributes delimited by tabs. Consult pages 41-66 of [1] for a more detailed analysis of these "syntactic rules" and the manner in which they are used to produce the synthetic data.
Notes
Files
README.md
Additional details
Related works
- Is documented by
- Book chapter: 10.5281/zenodo.4281322 (DOI)