Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published May 13, 2024 | Version v3
Dataset Open

A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)

  • 1. ROR icon University of Amsterdam

Description

This is a collection of translated citation datasets extracted from the Multilingual Wikipedia February 2024 dumps. The same extraction and template harmonization pipeline was used as for English Wikipedia https://zenodo.org/records/10782978

Note:  Versions 2 and 3 fix issues with large Italian, French and German datasets that were corrupted (failed to upload in full) in the initial version.

In each language, Wikipedia authors can cite sources using language-specific or English templates. Our main effort in compiling these datasets was to assemble lists of citation templates for each language and convert relevant fields into a common English template. We started with known citation templates per each language (typically covering books, journals, web pages and news), and, in some cases, augmented these lists with additional frequently used templates (films, links, webarchives, etc.) which we were able to locate via the XML reference tags vs usage frequency dictionaries. For the list of accepted templates see our source code: https://github.com/albatros13/wikicite/tree/multilang (templates are listed in __init__.py files of the wikiciteparser library).

A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details). Please note that these numbers do not represent the overall estimation of the book and journal citation numbers. We count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers). The number of news citations is dependent on our list of recognised 22.646 news agency domains.  

Language Acronym Link Dump size Citations Books Journals News
German  de https://dumps.wikimedia.org/dewiki/20240220/ 6.7GB 4.854.945 320.179 105.542 901.091
French  fr https://dumps.wikimedia.org/frwiki/20240220/ 5.9GB 9.552.768 798.525 264.560 1.907.183
Russian ru https://dumps.wikimedia.org/ruwiki/20240220/ 5.1GB 7.437.100 420.828 130.470 1.370.665
Spanish es https://dumps.wikimedia.org/eswiki/20240220/ 4.2GB 6.918.442 522.910 213.767 1.699.396
Italian it https://dumps.wikimedia.org/itwiki/20240220/ 3.6GB 5.545.082 384.816 128.366 917.517
Polish pl https://dumps.wikimedia.org/plwiki/20240220/ 2.4GB 4.744.158 463.783  95.988 513.006
Portuguese pt https://dumps.wikimedia.org/ptwiki/20240220/ 2.2GB 4.775.025 243.593 142.216  1.176.140
Dutch nl https://dumps.wikimedia.org/nlwiki/20240220/ 1.8GB 566.549 27.074  12.706 114.110
Swedish sv https://dumps.wikimedia.org/svwiki/20240220/ 1.5GB 3.802.416 112.748 155.740  869.662 
Catalan ca https://dumps.wikimedia.org/cawiki/20240220/ 1.2GB 2.239.714 261.779 105.125 423.241
Finnish fi https://dumps.wikimedia.org/fiwiki/20240220/ 900.9MB 1.697.731 209.556 12.068 286.420
Turkish tr https://dumps.wikimedia.org/trwiki/20240220 883.9MB 1.993.177 85.079 56.202 339.122 
Norwegian no https://dumps.wikimedia.org/nowiki/20240220 763.7MB 796.500 43.314 12.373 151.780
Danish da https://dumps.wikimedia.org/dawiki/20240220 413.3MB 437.239 23.303 7.522 70.760 

This datasets can be equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in augmented versions, see the source code for instructions or contact authors for assistance with this task.   

This research was supported in part by the University of Amsterdam Data Science Centre.

Files

de.zip

Files (892.2 MB)

Name Size Download all
md5:b4eabe6a71e2db48bd373baf21d754dd
892.2 MB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.10782978 (DOI)

Software

Repository URL
https://github.com/albatros13/wikicite/tree/multilang
Programming language
Python