A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)
Description
This is a collection of translated citation datasets extracted from the Multilingual Wikipedia February 2024 dumps. The same extraction and template harmonization pipeline was used as for English Wikipedia https://zenodo.org/records/10782978.
Note: Versions 2 and 3 fix issues with large Italian, French and German datasets that were corrupted (failed to upload in full) in the initial version.
In each language, Wikipedia authors can cite sources using language-specific or English templates. Our main effort in compiling these datasets was to assemble lists of citation templates for each language and convert relevant fields into a common English template. We started with known citation templates per each language (typically covering books, journals, web pages and news), and, in some cases, augmented these lists with additional frequently used templates (films, links, webarchives, etc.) which we were able to locate via the XML reference tags vs usage frequency dictionaries. For the list of accepted templates see our source code: https://github.com/albatros13/wikicite/tree/multilang (templates are listed in __init__.py files of the wikiciteparser library).
A classification label is assigned to each citation (either 'news', 'book', 'journal' or 'other)' by the deterministic rule-based classifier that analyses available identifiers (see code documentation for details). Please note that these numbers do not represent the overall estimation of the book and journal citation numbers. We count only citations with DOI, PMID, PMC and ISBN identifiers assigned by authors (prior to the lookup process that augments citations with missing identifiers). The number of news citations is dependent on our list of recognised 22.646 news agency domains.
Language | Acronym | Link | Dump size | Citations | Books | Journals | News |
German | de | https://dumps.wikimedia.org/dewiki/20240220/ | 6.7GB | 4.854.945 | 320.179 | 105.542 | 901.091 |
French | fr | https://dumps.wikimedia.org/frwiki/20240220/ | 5.9GB | 9.552.768 | 798.525 | 264.560 | 1.907.183 |
Russian | ru | https://dumps.wikimedia.org/ruwiki/20240220/ | 5.1GB | 7.437.100 | 420.828 | 130.470 | 1.370.665 |
Spanish | es | https://dumps.wikimedia.org/eswiki/20240220/ | 4.2GB | 6.918.442 | 522.910 | 213.767 | 1.699.396 |
Italian | it | https://dumps.wikimedia.org/itwiki/20240220/ | 3.6GB | 5.545.082 | 384.816 | 128.366 | 917.517 |
Polish | pl | https://dumps.wikimedia.org/plwiki/20240220/ | 2.4GB | 4.744.158 | 463.783 | 95.988 | 513.006 |
Portuguese | pt | https://dumps.wikimedia.org/ptwiki/20240220/ | 2.2GB | 4.775.025 | 243.593 | 142.216 | 1.176.140 |
Dutch | nl | https://dumps.wikimedia.org/nlwiki/20240220/ | 1.8GB | 566.549 | 27.074 | 12.706 | 114.110 |
Swedish | sv | https://dumps.wikimedia.org/svwiki/20240220/ | 1.5GB | 3.802.416 | 112.748 | 155.740 | 869.662 |
Catalan | ca | https://dumps.wikimedia.org/cawiki/20240220/ | 1.2GB | 2.239.714 | 261.779 | 105.125 | 423.241 |
Finnish | fi | https://dumps.wikimedia.org/fiwiki/20240220/ | 900.9MB | 1.697.731 | 209.556 | 12.068 | 286.420 |
Turkish | tr | https://dumps.wikimedia.org/trwiki/20240220 | 883.9MB | 1.993.177 | 85.079 | 56.202 | 339.122 |
Norwegian | no | https://dumps.wikimedia.org/nowiki/20240220 | 763.7MB | 796.500 | 43.314 | 12.373 | 151.780 |
Danish | da | https://dumps.wikimedia.org/dawiki/20240220 | 413.3MB | 437.239 | 23.303 | 7.522 | 70.760 |
This datasets can be equipped with identifiers located via the lookup process (no 'acquired_ID_list' field). If there is interest in augmented versions, see the source code for instructions or contact authors for assistance with this task.
This research was supported in part by the University of Amsterdam Data Science Centre.
Files
de.zip
Files
(892.2 MB)
Name | Size | Download all |
---|---|---|
md5:b4eabe6a71e2db48bd373baf21d754dd
|
892.2 MB | Preview Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.10782978 (DOI)
Software
- Repository URL
- https://github.com/albatros13/wikicite/tree/multilang
- Programming language
- Python