CWID-hi: A Dataset for Complex Word Identification in Hindi Text

doi:10.5281/zenodo.5790833

Natural Language Processing

Published August 21, 2021 | Version 4

Dataset Open

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

1. Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University)
2. Symbiosis Centre for Information Technology, Symbiosis International (Deemed University)

This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.

Files

dataset.csv

Files (863.4 kB)

Name	Size	Download all
dataset.csv md5:e848d395304103a1ab510edd6227e1ae	863.4 kB	Preview Download

Views

285

Downloads

Show more details

	All versions	This version
Views	1,155	423
Downloads	285	101
Data volume	297.0 MB	101.0 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Hindi

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: December 18, 2021
Modified: April 29, 2022

CWID-hi: A Dataset for Complex Word Identification in Hindi Text

Creators

Description

Files

dataset.csv

Files (863.4 kB)