CWID-hi: A Dataset for Complex Word Identification in Hindi Text

doi:10.5281/zenodo.5229160

Natural Language Processing

There is a newer version of the record available.

Published April 29, 2022 | Version 5

Other Open

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University)
2. Symbiosis Centre for Information Technology, Symbiosis International (Deemed University)
3. Cognitive Science Research Group, Queen Mary University of London,

This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.

Files

Ranked_dataset.csv

Files (906.5 kB)

Name	Size	Download all
Ranked_dataset.csv md5:d1f305cbb16f4ca528304af6dbb5cd7d	906.5 kB	Preview Download

Views

285

Downloads

Show more details

	All versions	This version
Views	1,155	254
Downloads	285	136
Data volume	297.0 MB	151.4 MB

More info on how stats are collected....

DOI

Resource type

Other

Publisher

Zenodo

Languages

Hindi

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: August 21, 2021
Modified: April 29, 2022

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

Creators

Description

Files

Ranked_dataset.csv

Files (906.5 kB)