BhashaHMPV Dataset : Multilingual HMPV News and Fact-Check Articles Dataset for Indian Regional Languages
Description
For the collection of Google News articles on HMPV in the Indian context, we scraped the articles using Python-based framework known as Splinter. In the script, we queried terms and phrases such as “HMPV”, “hmpv india”. The results included articles from the Google News website in a variety of languages, and the websites’ domains and languages were noted. We automated the URL by changing the language filter of the website Also in some cases, all articles were scraped and those unrelated to HMPV were filtered out in the pre-processing stage. All the samples collected were then put together into one CSV.We retrieved articles in ten Indian languages supported by Google News, namely: Bengali, English, Gujarati, Hindi, Marathi, Malayalam, Punjabi, Tamil, Telugu, Urdu, and Kannada.We also performed stemming for each language, and the stemmed outputs were added as separate columns in the respective language-specific sheets of the final CSV file.
The following information was extracted along with the news articles:
1) language of the Google News article
2) title of the Google news article
3) source of the Google news article (if available)
4) link of the Google news article
5) content of the Google news article
6) domain of the article
For the collection of Google Fact-Check articles, we used the Google Fact-Check API key to fetch the articles.In the python script, we queried terms and phrases such as “HMPV”, "hmpv india".We also performed stemming for each language, and the stemmed outputs were added as separate columns in the respective language-specific sheets of the final CSV file.The following information was extracted along with the news articles:
1) claim-text of the Google fact-check article
2) claimant of the Google fact-check article
3) claim-date of the Google fact-check article
4) review-publisher of the Google fact-check article
5) review-title of the Google fact-check article
6) review-url of the Google fact-check article
7) review-date of the Google fact-check article
8) textual-rating of the Google fact-check article
9) extracted-content of the Google fact-check article
Files
Files
(1.9 MB)
Name | Size | Download all |
---|---|---|
md5:dcecef35b0f08e7fda08952633e71b64
|
119.2 kB | Download |
md5:cde4ad2242a6902d57ef3b543bc8b1a6
|
1.7 MB | Download |