Published March 28, 2022 | Version v1
Dataset Open

Geographic Diversity in Public Code Contributions — Replication Package

  • 1. University of Bologna, Italy
  • 2. LTCI, Télécom Paris, Institut Polytechnique de Paris

Description

Geographic Diversity in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages:

click==8.0.4
cycler==0.11.0
fonttools==4.31.2
kiwisolver==1.4.0
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.1
patsy==0.5.2
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.1
scipy==1.8.0
six==1.16.0
statsmodels==0.13.2

Initial data

  • swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.
    We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030.
    Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
  • names.tab - forenames and surnames per country with their frequency
  • zones.acc.tab - countries/territories, timezones, population and world zones
  • c_c.tab - ccTDL entities - world zones matches

Data preparation

  • Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst

    sh> ./export.sh
  • Run the authors cleanup script to create authors--clean.csv.zst

    sh> ./cleanup.sh authors.csv.zst
  • Filter out implausible names and create authors--plausible.csv.zst

    sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Zone detection by email

  • Run the email detection script to create author-country-by-email.tab.zst

    sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst

Database creation and initial data ingestion

  • Create the PostgreSQL DB

    sh> createdb zones-commit

    Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.

  • Import data into PostgreSQL DB

    sh> ./import_data.sh

Zone detection by name

  • Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script

    sh> psql -f extract_commits.sql zones-commit
  • Run the world zone detection script to create commit_zones.tab.zst

    sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst
    Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
  • Ingest zones assignment data into the DB

    psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

  • Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tabauthor_zones_7120_t5.tabcommit_zones_7120.grid and author_zones_7120_t5.grid.
    Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).

    sh> ./extract_data.sh
  • Run the script to create the graphs from all the previously extracted tabfiles.

    sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf

Files

cctld.csv

Files (11.0 MB)

Name Size Download all
md5:d499c0fcbd227d3730199016a70e7d31
17.2 kB Download
md5:6ab1f2bb6be52787d0924080ddb32ae3
6.5 kB Download
md5:0521d04701dfa60ea40c9f88f651b6a6
26.2 kB Preview Download
md5:e098c88883cd223b7ecde68ad7d1f151
805 Bytes Download
md5:1f8727b9fdbcc36868e426a1d6921b7f
2.5 kB Download
md5:8f9993bcc87a366166af914860edcdc1
381 Bytes Download
md5:dd54dbb324aaafe5aa27c04efd0bb5d2
570 Bytes Download
md5:2aac44c523b95a77b62233edea21fd72
2.1 kB Download
md5:8fa39a682b82c58e20d07b324418aead
2.9 kB Download
md5:370cfc7260e3cd6e5c6afe5069859abd
669 Bytes Download
md5:a1add2fa6bb5f3741b8a73ff496db373
1.2 kB Download
md5:a7191804aa66ad464a915d9f26c4c83f
10.9 MB Download
md5:6edfa408177828c7e3174e874d8e7e09
5.4 kB Download
md5:899ab766e987aa1782d977c2cae1fb61
2.4 kB Download
md5:0a8a506d4e876840e0cd681437d7c6cb
1.0 kB Download
md5:ca3a11f0581c6c6267c9400cef29c2d4
776 Bytes Download
md5:8e1e029fd0427bf3c789ee8a2c66dc93
2.6 kB Download
md5:4dc86424f8a3a352700d6401a0a47abc
6.4 kB Download
md5:2a499fd43c213e5a9319a3ccb8f06e51
5.1 kB Preview Download
md5:a9d0a56ddda3234711b7ce05eef8fee3
2.4 kB Download
md5:0bd2010d7f8e6e9b843f31d33e7844cc
533 Bytes Download
md5:1ffeb5ee22b7c897a941ae3e368058aa
1.5 kB Download
md5:328e5aee098f32474459b237c83146f7
2.2 kB Download
md5:83e979248dc609bc4eeb6db19a796973
23.7 kB Download

Additional details

Related works

Is supplement to
Conference paper: 10.1145/3524842.3528471 (DOI)