Jupyter in the Wikimedia ecosystem

Structure of the talk

Motivation

Part 1: Mission alignment

Part 2: Reproducibility

Jupyter in the Wikimedia ecosystem

The Wikimedia ecosystem Project Jupyter

Digging a little deeper

Jupyter in the context of the Wikidata knowledge graph

One way to look at Jupyter through the lense of Wikidata:

Another way to look at Jupyter through Wikidata

We are in the process of upgrading Scholia, and Jupyter-based approaches like Voilà and papermill are amongst the options we are considering. Feedback on this is most welcome.

Features of programming languages

>>> alist = ['a', 'b', 'c']
>>> def my_func(al):
...     al.append('x')
...     print(al)
...
>>> my_func(alist)
['a', 'b', 'c', 'x']
>>> alist
['a', 'b', 'c', 'x']
  • But is this correct? Wikimedia projects can be edited by anyone, so a lot of emphasis is put on Verifiability, commonly abbreviated as Citation needed. .jpg)

What if each such code snippet were linked to a Jupyter implementation hosted on PAWS, Binder or similar, so that users could readily verify it and interact with it right away?

In [1]:
def my_func(al):
...     al.append('x')
...     print(al)
...
In [3]:
alist = ['a', 'b', 'c']
alist
Out[3]:
['a', 'b', 'c']
In [4]:
my_func(alist)
alist
['a', 'b', 'c', 'x']
Out[4]:
['a', 'b', 'c', 'x']

A similar case: Centripetal Catmull–Rom spline has an implementation in Python:

In [5]:
import numpy
import pylab as plt

def CatmullRomSpline(P0, P1, P2, P3, nPoints=100):
    """
    P0, P1, P2, and P3 should be (x,y) point pairs that define the Catmull-Rom spline.
    nPoints is the number of points to include in this curve segment.
    """
    # Convert the points to numpy so that we can do array multiplication
    P0, P1, P2, P3 = map(numpy.array, [P0, P1, P2, P3])

    # Parametric constant: 0.5 for the centripetal spline, 0.0 for the uniform spline, 1.0 for the chordal spline.
    alpha = 0.5
    # Premultiplied power constant for the following tj() function.
    alpha = alpha/2
    def tj(ti, Pi, Pj):
        xi, yi = Pi
        xj, yj = Pj
        return ((xj-xi)**2 + (yj-yi)**2)**alpha + ti

    # Calculate t0 to t4
    t0 = 0
    t1 = tj(t0, P0, P1)
    t2 = tj(t1, P1, P2)
    t3 = tj(t2, P2, P3)

    # Only calculate points between P1 and P2
    t = numpy.linspace(t1, t2, nPoints)

    # Reshape so that we can multiply by the points P0 to P3
    # and get a point for each value of t.
    t = t.reshape(len(t), 1)
 #   print(t)
    A1 = (t1-t)/(t1-t0)*P0 + (t-t0)/(t1-t0)*P1
    A2 = (t2-t)/(t2-t1)*P1 + (t-t1)/(t2-t1)*P2
    A3 = (t3-t)/(t3-t2)*P2 + (t-t2)/(t3-t2)*P3
#    print(A1)
#    print(A2)
#    print(A3)
    B1 = (t2-t)/(t2-t0)*A1 + (t-t0)/(t2-t0)*A2
    B2 = (t3-t)/(t3-t1)*A2 + (t-t1)/(t3-t1)*A3

    C = (t2-t)/(t2-t1)*B1 + (t-t1)/(t2-t1)*B2
    return C

def CatmullRomChain(P):
    """
    Calculate Catmull–Rom for a chain of points and return the combined curve.
    """
    sz = len(P)

    # The curve C will contain an array of (x, y) points.
    C = []
    for i in range(sz-3):
        c = CatmullRomSpline(P[i], P[i+1], P[i+2], P[i+3])
        C.extend(c)

    return C

# Define a set of points for curve to go through
Points = [[0, 1.5], [2, 2], [3, 1], [4, 0.5], [5, 1], [6, 2], [7, 3]]

# Calculate the Catmull-Rom splines through the points
c = CatmullRomChain(Points)

# Convert the Catmull-Rom curve points into x and y arrays and plot
x, y = zip(*c)
plt.plot(x, y)

# Plot the control points
px, py = zip(*Points)
plt.plot(px, py, 'or')

plt.show()

Algorithms

  • Lots of algorithms covered across the Wikimedia ecosystem
    • working out the details of how to cover them can be tedious
  • Example: Wikimedia Commons category "Animations of sort algorithms" — some of these are used on hundreds of pages that combined receive thousands of pageviews per day
    • Some problems with such GIFs:
      • parameters of the animation are static
      • users cannot change the underlying data or code, e.g. to explore edge cases of the algorithm or implementations in different programming languages
      • it's hard to use them to compare different algorithms (example)
      • usually no or no direct link to the code that ran the animation
    • Bead sort article provides the essential bits of an implementation in Python, but nothing that can be run or explored.

Some more examples

Barriers to reuse of Jupyter in Wikimedia contexts

  • (software) dependencies: still a problem, but with tools like ReproduceMeGit, we're getting better
  • licensing: many notebooks do not have clear licensing, or licenses more suitable for code than documentation, data or other associated materials

Data sets

  • LIGO Binder — a great resource on a fascinating subject, the detection of gravitational waves: Binder
    • problem: like many other Jupyter resources, it does not come with clear and open licensing, which limits reuse in Wikimedia contexts
      • for example, the original paper on the gravitational wave detection was openly licensed, so some of its images have been reused to illustrate Wikipedia articles on the matter (example)
        • the paper was accompanied by the Jupyter notebook on Binder that contained figures also found in the paper and some sonifications not shared as part of the paper, and due to the unclear licensing, these sonifications did not find their way into Wikipedia.
  • Seismo-Live — Jupyter notebooks covering different aspects of seismology, e.g. determining an earthquake's location (on Binder) as an inverse problem.

Wikimedia in the Jupyter ecosystem

Potential for further integration

  • Adding more Jupyter to Wikimedia

    • Jupyter can be (and is being) used to illustrate things like
      • features of programming languages 🔍
      • mathematical formulas
      • algorithms 🔍
      • datasets or data structures 🔍
      • math problems in general, or math aspects of any other problem
      • computational reproducibility
    • Wikimedia sites are widely consulted for education about any of these subjects
      • If you wrote a Jupyter notebook for educational purposes, chances are that it could serve these purposes through Wikimedia projects too.
      • If you wrote code that adds functionality to Jupyter notebooks, JupyterHub, Binder or other parts of the Jupyter ecosystem, chances are that the functionality would be of interest to Jupyter users within the Wikimedia community, or even to users of Wikimedia contents.
      • What about systematically
        • integrating existing Jupyter resources into the Wikimedia ecosystem?
        • creating Jupyter resources to fill relevant gaps in the Wikimedia coverage?
        • citing such Jupyter resources as references for claims that they can address?
          • This requires reproducibility, as per introduction
  • Adding more Wikimedia to Jupyter

    • Complement 'Hello world' examples for Jupyter kernels with some 'Hello wiki' examples (prototype), e.g. fetching the first paragraph of a Wikipedia article about the kernel's language, or running a Wikidata query?
    • Use Wikidata identifiers in Jupyter documentation when referring to general concepts?
    • Use Wikibase to build a registry of Jupyter-related installations?
      • Turn Project Jupyter's A gallery of interesting Jupyter Notebooks into a Wikibase that has a (robust) NotebookViewer extension installed to view notebooks, links to Binder to facilitate re-running them, and has a SPARQL endpoint that allows to query across the collection, perhaps in a federated way.
  • Combining Wikimedia and Jupyter in new contexts
    • imagine if Rosetta Code (which already uses MediaWiki) had Jupyter/ Binder layers, so that more users could get a glimpse of how programming languages unfamiliar to them might behave in specific contexts
    • the Internet Archive has been actively archiving links cited from Wikimedia projects, including Jupyter notebooks

Further notes

Conclusions

  • Interactions between the Jupyter and Wikimedia systems already happen in many contexts.
  • Now is a good time to start thinking about more systematic interactions between the two.

Credits

Images

All images embedded in this presentation are available from Wikimedia Commons, and their metadata is also given in the Zenodo repository at https://doi.org/10.5281/zenodo.4031806.

Overview:

General

Thanks to the open communities supporting Jupyter, Wikimedia and the ecosystems around them.

Contact

daniel [dot] mietchen [at] virginia [dot] edu or @EvoMRI on Twitter

In [ ]: