Event Registry dataset with multiple extracted features (both sparse and dense)

Guillaume Bernard

doi:10.5281/zenodo.6630367

Published June 10, 2022 | Version 1.0

Dataset Open

Event Registry dataset with multiple extracted features (both sparse and dense)

Guillaume Bernard¹

1. Laboratoire L3i, Université de La Rochelle

This is a republication of the Event Registry dataset originaly published by:

Rupnik, Jan, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, et Marko Grobelnik. 2016. « News Across Languages - Cross-Lingual Document Similarity and Event Tracking ». Journal of Artificial Intelligence Research 55 (janvier): 283‑316. https://doi.org/10.1613/jair.4780.

And reorganised for document tracking by:

Miranda, Sebastião, Artūrs Znotiņš, Shay B. Cohen, et Guntis Barzdins. 2018. « Multilingual Clustering of Streaming News ». In 2018 Conference on Empirical Methods in Natural Language Processing, 4535‑44. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1483/.

In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Features are extracted using:

- A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

- A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

- A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) [3]

- A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) [4]

References:

[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

[3]: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

[4]: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Files

miranda_test.csv

Files (3.2 GB)

Name	Size	Download all
miranda_test.csv md5:e4f11f867757ea3dff2b1e90df86e824	787.3 kB	Preview Download
miranda_test_corpus_features_gold.csv md5:7ca16cb7fae40e2aeec4123c15b165a6	259.3 MB	Preview Download
miranda_test_corpus_features_mpnet.csv md5:03a3362469a0cf2060b832fc322c4a2a	480.8 MB	Preview Download
miranda_test_corpus_features_news.csv md5:4f207d0f61af94bff8998660882b07da	255.3 MB	Preview Download
miranda_test_corpus_features_use.csv md5:3566186e715e0a36072fa506b57de4ce	305.0 MB	Preview Download
miranda_train.csv md5:799be6ede2232d6a444316e586ff49c4	1.3 MB	Preview Download
miranda_train_corpus_features_gold.csv md5:31285536eff4dc0fb4d56c7931d1571f	334.1 MB	Preview Download
miranda_train_corpus_features_mpnet.csv md5:fd2055149a176cb4de9cc071b5f94d15	769.0 MB	Preview Download
miranda_train_corpus_features_news.csv md5:26d04288837b756102ada5bc896fadf1	326.3 MB	Preview Download
miranda_train_corpus_features_use.csv md5:4b7e3107be3ee2612cd679f54837dc6a	487.7 MB	Preview Download

Additional details

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299

	All versions	This version
Views	517	509
Downloads	577	575
Data volume	200.4 GB	194.5 GB

Event Registry dataset with multiple extracted features (both sparse and dense)

Authors/Creators

Description

Files

miranda_test.csv

Files (3.2 GB)

Additional details

Funding