Published August 21, 2021 | Version 4
Dataset Open

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

  • 1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University)
  • 2. Symbiosis Centre for Information Technology, Symbiosis International (Deemed University)


This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.



Files (863.4 kB)

Name Size Download all
863.4 kB Preview Download