Published April 6, 2025 | Version 1.0.0
Dataset Open

Complete List of Mathematical Expressions in all Wikimedia Projects, including Wikipedia

Authors/Creators

  • 1. ROR icon FIZ Karlsruhe – Leibniz Institute for Information Infrastructure

Description

This dataset contains a deduplicated list of all mathematical expressions used in all wikimedia projects. The data is provided as json file where the key is the md5 hash of the input. The input is what was extracted from the wikitext sources. This was done in the following way:

  1. All current dump were filtered for the math tag (see https://doi.org/10.5281/zenodo.15107679) for details
  2. Those dumps were imported into a mediawiki installation with the MathSearch extension. Here one database was used per wiki.
  3. The data from all the mathlog tables were combined in one table, which was exported into a json file. The json contains a list of key value pairs where the keys are the md5 hashes of the input.

The scripts are available from

swh:1:cnt:faec2206a154db5a2711791f4211097e36bf1413; origin=https://github.com/MaRDI4NFDI/wikiFilter; visit=swh:1:snp:28ed43d0e16ca3d6ce4bad1b484cec9d1124cd48; anchor=swh:1:rev:855735a5c90a0db3ccfd20c3899af4c82bc6704f; path=/wmcloud/allFormulae.sql

Example: The Wikipedia article on mass energy equivalence contains the following wikitext

<math qid=Q35875>E = mc^2</math>

the MathSearch extension extracts the user input

E = mc^2

the md5 hash is 

281a70c20b16a38d7781189936e1ac9f

and thus the row

    "281a70c20b16a38d7781189936e1ac9f": "E = mc^2",

in the json file corresponds to that input.

Notes (English)

Except as discussed below, all original textual content is licensed under the  GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see our Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain. See https://dumps.wikimedia.org/legal.html for details.

Files

wmf_texvc_inputs.json

Files (322.1 MB)

Name Size Download all
md5:d1813da95a6915ea75bc1f614b9eb846
322.1 MB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.15107679 (DOI)

Funding

Deutsche Forschungsgemeinschaft
MaRDI 460135501