Published August 21, 2021 | Version 4
Dataset Open

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

  • 1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University)
  • 2. Symbiosis Centre for Information Technology, Symbiosis International (Deemed University)

Description

This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.

Files

dataset.csv

Files (863.4 kB)

Name Size Download all
md5:e848d395304103a1ab510edd6227e1ae
863.4 kB Preview Download