Published February 12, 2016 | Version v1
Dataset Open

NNSeval: Evaluating Lexical Simplification for Non-Natives

  • 1. Gustavo Henrique
  • 2. Lucia

Description

We have conducted a user study to learn more about word complexity for non-native speakers. 400 non-native speakers participated in the experiment, all university students or staff. They were asked to judge whether or not they could understand the meaning of each content word (nouns, verbs, adjectives and adverbs, as tagged by Freeling (Padr and Stanilovsky (2012)) in a set of sentences, each of which was judged independently. Volunteers were instructed to annotate all words that they could not understand individually, even if they could comprehend the meaning of the sentence as a whole.

All sentences used were taken from Wikipedia, LSeval and LexMTurk. A total of 35,958 distinct words from 9,200 sentences were annotated (232,481 total), of which 3,854 distinct words (6,388 total) were deemed as complex by at least one annotator.

Using the data produced in the user study, we first assessed reliability of the LSeval and LexMTurk datasets in evaluating LS systems for non-native speakers. We found that the proportion of target words deemed complex by at least one annotator was only 30.8% for LexMTurk, and 15% for LSe- val. As for the candidate substitutions, 21.7% of the ones in LSeval and 13.4% in LexMTurk were deemed complex by at least one annotator.

These results show that, although they may not be used in their entirety, both datasets contain instances that are suit- able for our purposes. To create our dataset, we first used the Text Adorning module of LEXenstein (Paetzold and Specia 2015; Burns 2013) to inflect all candidate verbs and nouns in both datasets to the same tense as the target word. We then used the Spelling Correction module of LEXenstein to correct any misspelled words among the candidates of both datasets. Next, we removed all candidate substitutes which were deemed complex by at least one annotator in our user

study. Finally, we discarded all instances in which the target word was not deemed complex by any of our annotators. The resulting dataset, which we refer to as NNSeval, contains 239 instances.

Notes

http://ghpaetzold.github.io/data/NNSeval.zip

Files

NNSeval.zip

Files (27.4 kB)

Name Size Download all
md5:b89615056b7b0db921ad2e8f35b23a52
27.4 kB Preview Download

Additional details

Funding

SIMPATICO – SIMplifying the interaction with Public Administration Through Information technology for Citizens and cOmpanies 692819
European Commission