There is a newer version of the record available.

Published April 29, 2022 | Version 5
Other Open

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

  • 1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University)
  • 2. Symbiosis Centre for Information Technology, Symbiosis International (Deemed University)
  • 3. Cognitive Science Research Group, Queen Mary University of London,


This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.



Files (906.5 kB)

Name Size Download all
906.5 kB Preview Download