Event Registry dataset with multiple extracted features (both sparse and dense)
Description
This is a republication of the Event Registry dataset originaly published by:
Rupnik, Jan, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, et Marko Grobelnik. 2016. « News Across Languages - Cross-Lingual Document Similarity and Event Tracking ». Journal of Artificial Intelligence Research 55 (janvier): 283‑316. https://doi.org/10.1613/jair.4780.
And reorganised for document tracking by:
Miranda, Sebastião, Artūrs Znotiņš, Shay B. Cohen, et Guntis Barzdins. 2018. « Multilingual Clustering of Streaming News ». In 2018 Conference on Empirical Methods in Natural Language Processing, 4535‑44. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1483/.
In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.
Features are extracted using:
- A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]
- A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]
- A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) [3]
- A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) [4]
References:
[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406
[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
[3]: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1
[4]: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Files
miranda_test.csv
Files
(3.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:e4f11f867757ea3dff2b1e90df86e824
|
787.3 kB | Preview Download |
|
md5:7ca16cb7fae40e2aeec4123c15b165a6
|
259.3 MB | Preview Download |
|
md5:03a3362469a0cf2060b832fc322c4a2a
|
480.8 MB | Preview Download |
|
md5:4f207d0f61af94bff8998660882b07da
|
255.3 MB | Preview Download |
|
md5:3566186e715e0a36072fa506b57de4ce
|
305.0 MB | Preview Download |
|
md5:799be6ede2232d6a444316e586ff49c4
|
1.3 MB | Preview Download |
|
md5:31285536eff4dc0fb4d56c7931d1571f
|
334.1 MB | Preview Download |
|
md5:fd2055149a176cb4de9cc071b5f94d15
|
769.0 MB | Preview Download |
|
md5:26d04288837b756102ada5bc896fadf1
|
326.3 MB | Preview Download |
|
md5:4b7e3107be3ee2612cd679f54837dc6a
|
487.7 MB | Preview Download |