Jupyter in the Wikimedia ecosystem

A talk by Daniel Mietchen at JupyterCon 2020,
available via https://doi.org/10.5281/zenodo.4031806.

Structure of the talk¶

00:00 min — Start
00:30 min — Motivation
02:00 min — Jupyter in the Wikimedia ecosystem
08:00 min — Wikimedia in the Jupyter ecosystem
10:00 min — Potential for further integration
11:00 min — Conclusions
11:30 min — Credits
11:50 min — Contact

Motivation¶

Part 1: Mission alignment¶

Jupyter:
- ... to support interactive data science and scientific computing across all programming languages.
Wikimedia:
- Imagine a world in which every single human being can freely share in the sum of all knowledge.
Me:
- ... data scientist integrating open research and education workflows with the web
Synthesis:
- How can we leverage Jupyter-based open research and education resources to support the Wikimedia mission, and vice versa?

Part 2: Reproducibility¶

Was topic of various JupyterCon talks, including
- I don't like notebooks by Joel Grus in 2018
- Post-publication peer review of Jupyter Notebooks referenced in articles on PubMed Central by myself in 2017, where I briefly mentioned Wikimedia usage
good progress has been made since, e.g. via the REPRODUCE-ME ontology — seen in use below as a component of the Jupyter extension ProvBook — and the visualization tool ReproduceMeGit that assesses computational reproducibility

Jupyter in the Wikimedia ecosystem¶

The Wikimedia ecosystem	Project Jupyter

Some easy finds¶

Wikipedia articles about
Wikimedia Commons category for Project Jupyter (ca. 24 pageviews per day)
This notebook (public copy) runs on PAWS, a JupyterHub installation in the Wikimedia Cloud

Digging a little deeper¶

Jupyter in the context of the Wikidata knowledge graph¶

7-min introductory video about Wikidata

One way to look at Jupyter through the lense of Wikidata:

WikiDP

Another way to look at Jupyter through Wikidata

Wikidata-based scholarly profile of Project Jupyter on Scholia

We are in the process of upgrading Scholia, and Jupyter-based approaches like Voilà and papermill are amongst the options we are considering. Feedback on this is most welcome.

Features of programming languages¶

from Python syntax and semantics: Data structures

>>> alist = ['a', 'b', 'c']
>>> def my_func(al):
...     al.append('x')
...     print(al)
...
>>> my_func(alist)
['a', 'b', 'c', 'x']
>>> alist
['a', 'b', 'c', 'x']

But is this correct? Wikimedia projects can be edited by anyone, so a lot of emphasis is put on Verifiability, commonly abbreviated as Citation needed. .jpg)

What if each such code snippet were linked to a Jupyter implementation hosted on PAWS, Binder or similar, so that users could readily verify it and interact with it right away?

def my_func(al):
...     al.append('x')
...     print(al)
...

alist = ['a', 'b', 'c']
alist

['a', 'b', 'c']

my_func(alist)
alist

['a', 'b', 'c', 'x']

['a', 'b', 'c', 'x']

A similar case: Centripetal Catmull–Rom spline has an implementation in Python:

import numpy
import pylab as plt

def CatmullRomSpline(P0, P1, P2, P3, nPoints=100):
    """
    P0, P1, P2, and P3 should be (x,y) point pairs that define the Catmull-Rom spline.
    nPoints is the number of points to include in this curve segment.
    """
    # Convert the points to numpy so that we can do array multiplication
    P0, P1, P2, P3 = map(numpy.array, [P0, P1, P2, P3])

    # Parametric constant: 0.5 for the centripetal spline, 0.0 for the uniform spline, 1.0 for the chordal spline.
    alpha = 0.5
    # Premultiplied power constant for the following tj() function.
    alpha = alpha/2
    def tj(ti, Pi, Pj):
        xi, yi = Pi
        xj, yj = Pj
        return ((xj-xi)**2 + (yj-yi)**2)**alpha + ti

    # Calculate t0 to t4
    t0 = 0
    t1 = tj(t0, P0, P1)
    t2 = tj(t1, P1, P2)
    t3 = tj(t2, P2, P3)

    # Only calculate points between P1 and P2
    t = numpy.linspace(t1, t2, nPoints)

    # Reshape so that we can multiply by the points P0 to P3
    # and get a point for each value of t.
    t = t.reshape(len(t), 1)
 #   print(t)
    A1 = (t1-t)/(t1-t0)*P0 + (t-t0)/(t1-t0)*P1
    A2 = (t2-t)/(t2-t1)*P1 + (t-t1)/(t2-t1)*P2
    A3 = (t3-t)/(t3-t2)*P2 + (t-t2)/(t3-t2)*P3
#    print(A1)
#    print(A2)
#    print(A3)
    B1 = (t2-t)/(t2-t0)*A1 + (t-t0)/(t2-t0)*A2
    B2 = (t3-t)/(t3-t1)*A2 + (t-t1)/(t3-t1)*A3

    C = (t2-t)/(t2-t1)*B1 + (t-t1)/(t2-t1)*B2
    return C

def CatmullRomChain(P):
    """
    Calculate Catmull–Rom for a chain of points and return the combined curve.
    """
    sz = len(P)

    # The curve C will contain an array of (x, y) points.
    C = []
    for i in range(sz-3):
        c = CatmullRomSpline(P[i], P[i+1], P[i+2], P[i+3])
        C.extend(c)

    return C

# Define a set of points for curve to go through
Points = [[0, 1.5], [2, 2], [3, 1], [4, 0.5], [5, 1], [6, 2], [7, 3]]

# Calculate the Catmull-Rom splines through the points
c = CatmullRomChain(Points)

# Convert the Catmull-Rom curve points into x and y arrays and plot
x, y = zip(*c)
plt.plot(x, y)

# Plot the control points
px, py = zip(*Points)
plt.plot(px, py, 'or')

plt.show()

Algorithms¶

Lots of algorithms covered across the Wikimedia ecosystem
- working out the details of how to cover them can be tedious
Example: Wikimedia Commons category "Animations of sort algorithms" — some of these are used on hundreds of pages that combined receive thousands of pageviews per day
- Some problems with such GIFs:
  - parameters of the animation are static
  - users cannot change the underlying data or code, e.g. to explore edge cases of the algorithm or implementations in different programming languages
  - it's hard to use them to compare different algorithms (example)
  - usually no or no direct link to the code that ran the animation
- Bead sort article provides the essential bits of an implementation in Python, but nothing that can be run or explored.

Some more examples¶

Wikipedia articles like Time-based One-time Password algorithm or HMAC-based One-time Password algorithm link to Jupyter noteboooks with Python implementations of the algorithms
PAWS-based Jupyter notebooks are used in the manual for PyWikiBot, one of the main frameworks for automated curation or exploration of Wikimedia content
- examples:
  - Jupyter notebooks as well as Jupyter-Leaflet and Jupyter-widgets were used at a map making workshop at Wikimania 2019
  - Jupyter notebook that formed the basis of several visualizations for a manuscript about Wikidata's coverage of the COVID-19 pandemic
  - Jupyter notebook for computing traffic to the Scholia service
Several Jupyter notebooks used to document various aspects of a research project on content translation in Wikipedia
Jupyter is used as part of the internal analytics workflows at the Wikimedia Foundation
- sample notebook on blocked users by operating system
Several Jupyter notebooks that formed the basis of the Wikimedia Foundation's portal for COVID-19-related traffic-based analytics
Jupyter chapter in a scientific computation book on the French Wikibooks
Jupyter notebook for exporting metadata of 19th century publications from the German Wikisource into a format commonly used by libraries
Jupyter notebook demonstrating the mapping between diseases and their phenotypes, using the Human Phenotype Ontology database and Wikidata
Jupyter notebooks serving as an example for working with text files, in Introducing Julia on the English Wikibooks
Jupyter notebook as an illustration of a Wikipedia article on the quantum computing kit Qiskit
- Qiskit course materials in French.pdf)
JupyterCon in Wikidata
PAWS-related tickets on the Wikimedia Phabricator
- e.g. RStudio Server running on JupyterHub on internal PAWS
iPython question as part of a Python quiz on the English Wikiversity
Template:User_Jupyter (Chinese & English Wikipedia, Wikidata)
List of software using the Qt toolkit in Wikidata
draft of a MediaWiki extension for rendering Jupyter notebooks
web search / search inside a MediaWiki instance
Wikimedia Commons

Barriers to reuse of Jupyter in Wikimedia contexts¶

(software) dependencies: still a problem, but with tools like ReproduceMeGit, we're getting better
licensing: many notebooks do not have clear licensing, or licenses more suitable for code than documentation, data or other associated materials

Data sets¶

LIGO Binder — a great resource on a fascinating subject, the detection of gravitational waves:
- problem: like many other Jupyter resources, it does not come with clear and open licensing, which limits reuse in Wikimedia contexts
  - for example, the original paper on the gravitational wave detection was openly licensed, so some of its images have been reused to illustrate Wikipedia articles on the matter (example)
    - the paper was accompanied by the Jupyter notebook on Binder that contained figures also found in the paper and some sonifications not shared as part of the paper, and due to the unclear licensing, these sonifications did not find their way into Wikipedia.
Seismo-Live — Jupyter notebooks covering different aspects of seismology, e.g. determining an earthquake's location (on Binder) as an inverse problem.

Wikimedia in the Jupyter ecosystem¶

Wikipedia-related research was featured prominently in the announcement of Jupyter integration on GitHub
Wikipedia is often referenced anywhere on the web, including in GitHub repos full of Jupyter notebooks (example)
Demo of R package integration into Jupyter shows how to fetch Wikipedia pageview data
Jupyter kernel for SPARQL, showing candidate items to be portrayed in Jupyter notebooks: items about an algorithm or mathematical formula but with no associated image or Wikimedia Commons category, executed via via MyBinder).

Potential for further integration¶

Adding more Jupyter to Wikimedia
- Jupyter can be (and is being) used to illustrate things like
  - features of programming languages 🔍
  - mathematical formulas
  - algorithms 🔍
  - datasets or data structures 🔍
  - math problems in general, or math aspects of any other problem
  - computational reproducibility
- Wikimedia sites are widely consulted for education about any of these subjects
  - If you wrote a Jupyter notebook for educational purposes, chances are that it could serve these purposes through Wikimedia projects too.
  - If you wrote code that adds functionality to Jupyter notebooks, JupyterHub, Binder or other parts of the Jupyter ecosystem, chances are that the functionality would be of interest to Jupyter users within the Wikimedia community, or even to users of Wikimedia contents.
  - What about systematically
    - integrating existing Jupyter resources into the Wikimedia ecosystem?
    - creating Jupyter resources to fill relevant gaps in the Wikimedia coverage?
    - citing such Jupyter resources as references for claims that they can address?
      - This requires reproducibility, as per introduction
Adding more Wikimedia to Jupyter
- Complement 'Hello world' examples for Jupyter kernels with some 'Hello wiki' examples (prototype), e.g. fetching the first paragraph of a Wikipedia article about the kernel's language, or running a Wikidata query?
- Use Wikidata identifiers in Jupyter documentation when referring to general concepts?
- Use Wikibase to build a registry of Jupyter-related installations?
  - Turn Project Jupyter's A gallery of interesting Jupyter Notebooks into a Wikibase that has a (robust) NotebookViewer extension installed to view notebooks, links to Binder to facilitate re-running them, and has a SPARQL endpoint that allows to query across the collection, perhaps in a federated way.
Combining Wikimedia and Jupyter in new contexts
- imagine if Rosetta Code (which already uses MediaWiki) had Jupyter/ Binder layers, so that more users could get a glimpse of how programming languages unfamiliar to them might behave in specific contexts
- the Internet Archive has been actively archiving links cited from Wikimedia projects, including Jupyter notebooks

Further notes¶

Jupyter notebooks to increase replicability of research studies, including about Wikimedia projects like Wikidata
- example:
  - paper: Commonsense Knowledge in Wikidata
  - notebooks
Usage examples
- Wiki pages about Jupyter
  - Wikidata
    - Map Making Workshop by Royal Dutch Library
  - Wikibooks
    - Découverte de Python et de Jupyter
  - Wikimania
    - 2019
    - pre-2019
      - e.g. Wikimania 2014 talk Open Scholarship Tools
  - Meta
    - search
- Jupyter notebooks in wiki pages
  - as references
    - baseball stats
    - optimization software&oldid=970061912#cite_note-26)
  - infoboxes
    - documentation
- Jupyter notebooks to generate illustrations
  - Human disease network
    - plotly and igraph are not installed on the Binder server
      - "Note: you may need to restart the kernel to use updated packages."
  - Riemann zeta function — provenance unclear, so worth trying it out on Binder, but matplotlib is missing on the Binder server
    - DomainColoring.ipynb
  - Most Popular Wikipedia Articles of the Week (May 19 to 25, 2019).png)
    - "We can't seem to find the Binder page you are looking for."
  - Qiskit screenshot
    - points to a URL that gives a 404; code has been reorganized
- Jupyter notebooks for educational purposes
- Jupyter notebooks for Wikidata queries
  - examples:
    - Binder link
    - COVID-19 information
- Jupyter notebooks in discussions
  - finite element modelling
- Jupyter notebooks for presentional purposes
  - see also under Wikimania
- Jupyter notebooks for testing purposes
- Jupyter notebooks to analyze event/ campaign participation data
  - Wiki Loves Africa 2019
    - notebook
- Jupyter notebooks for editing
  - page redirects
  - Wikidata Integrator example
- PAWS-Internal
- Mention in WMF 2017-2018 Annual Plan
- Colab
- StackOverflow
  - Jupyter notebook as a comment on a SPARQL question
- Internet in a Box
  - via Kolibri
- lexemes
- tool use stats
  - e.g. Scholia
    - public notebook
    - non-public
Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia
Lots of code examples in Jupyter-supported languages in Wikipedia pages about algorithms, standards and similar concepts relevant to computational sciences. Would be nice to have a robust and straightforward way to link these example snippets to a Jupyter notebook that runs out of the box.
- e.g. Python and Unity C# in Centripetal Catmull–Rom spline
Wikidata:WikiProject PersonalData/Jupyter wikidata

Conclusions¶

Interactions between the Jupyter and Wikimedia systems already happen in many contexts.
Now is a good time to start thinking about more systematic interactions between the two.

Credits¶

Images¶

All images embedded in this presentation are available from Wikimedia Commons, and their metadata is also given in the Zenodo repository at https://doi.org/10.5281/zenodo.4031806.

Overview:

General¶

Thanks to the open communities supporting Jupyter, Wikimedia and the ecosystems around them.

Contact¶

daniel [dot] mietchen [at] virginia [dot] edu or @EvoMRI on Twitter