Geographic Diversity in Public Code Contributions — Replication Package
Authors/Creators
- 1. University of Bologna, Italy
- 2. LTCI, Télécom Paris, Institut Polytechnique de Paris
Description
Geographic Diversity in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.4
cycler==0.11.0
fonttools==4.31.2
kiwisolver==1.4.0
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.1
patsy==0.5.2
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.1
scipy==1.8.0
six==1.16.0
statsmodels==0.13.2
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.
We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030.
Once retrieved, the data can be loaded in PostgreSQL to populateswh-replica.names.tab- forenames and surnames per country with their frequencyzones.acc.tab- countries/territories, timezones, population and world zonesc_c.tab- ccTDL entities - world zones matches
Data preparation
-
Export data from the
swh-replicadatabase to createcommits.csv.zstandauthors.csv.zstsh> ./export.sh -
Run the authors cleanup script to create
authors--clean.csv.zstsh> ./cleanup.sh authors.csv.zst -
Filter out implausible names and create
authors--plausible.csv.zstsh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Zone detection by email
-
Run the email detection script to create
author-country-by-email.tab.zstsh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst
Database creation and initial data ingestion
-
Create the PostgreSQL DB
sh> createdb zones-commitNotice that from now on when prepending the
psql>prompt we assume the execution of psql on thezones-commitdatabase. -
Import data into PostgreSQL DB
sh> ./import_data.sh
Zone detection by name
-
Extract commits data from the DB and create
commits.tab, that is used as input for the zone detection scriptsh> psql -f extract_commits.sql zones-commit -
Run the world zone detection script to create
commit_zones.tab.zst
Usesh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst./assign_world_zone.py --helpif you are interested in changing the script parameters. -
Ingest zones assignment data into the DB
psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
-
Run the script to execute the queries to extract the data to plot from the DB. This creates
commit_zones_7120.tab,author_zones_7120_t5.tab,commit_zones_7120.gridandauthor_zones_7120_t5.grid.
Editextract_data.sqlif you whish to modify extraction parameters (start/end year, sampling, …).sh> ./extract_data.sh -
Run the script to create the graphs from all the previously extracted tabfiles.
sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
Files
cctld.csv
Files
(11.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:d499c0fcbd227d3730199016a70e7d31
|
17.2 kB | Download |
|
md5:6ab1f2bb6be52787d0924080ddb32ae3
|
6.5 kB | Download |
|
md5:0521d04701dfa60ea40c9f88f651b6a6
|
26.2 kB | Preview Download |
|
md5:e098c88883cd223b7ecde68ad7d1f151
|
805 Bytes | Download |
|
md5:1f8727b9fdbcc36868e426a1d6921b7f
|
2.5 kB | Download |
|
md5:8f9993bcc87a366166af914860edcdc1
|
381 Bytes | Download |
|
md5:dd54dbb324aaafe5aa27c04efd0bb5d2
|
570 Bytes | Download |
|
md5:2aac44c523b95a77b62233edea21fd72
|
2.1 kB | Download |
|
md5:8fa39a682b82c58e20d07b324418aead
|
2.9 kB | Download |
|
md5:370cfc7260e3cd6e5c6afe5069859abd
|
669 Bytes | Download |
|
md5:a1add2fa6bb5f3741b8a73ff496db373
|
1.2 kB | Download |
|
md5:a7191804aa66ad464a915d9f26c4c83f
|
10.9 MB | Download |
|
md5:6edfa408177828c7e3174e874d8e7e09
|
5.4 kB | Download |
|
md5:899ab766e987aa1782d977c2cae1fb61
|
2.4 kB | Download |
|
md5:0a8a506d4e876840e0cd681437d7c6cb
|
1.0 kB | Download |
|
md5:ca3a11f0581c6c6267c9400cef29c2d4
|
776 Bytes | Download |
|
md5:8e1e029fd0427bf3c789ee8a2c66dc93
|
2.6 kB | Download |
|
md5:4dc86424f8a3a352700d6401a0a47abc
|
6.4 kB | Download |
|
md5:2a499fd43c213e5a9319a3ccb8f06e51
|
5.1 kB | Preview Download |
|
md5:a9d0a56ddda3234711b7ce05eef8fee3
|
2.4 kB | Download |
|
md5:0bd2010d7f8e6e9b843f31d33e7844cc
|
533 Bytes | Download |
|
md5:1ffeb5ee22b7c897a941ae3e368058aa
|
1.5 kB | Download |
|
md5:328e5aee098f32474459b237c83146f7
|
2.2 kB | Download |
|
md5:83e979248dc609bc4eeb6db19a796973
|
23.7 kB | Download |
Additional details
Related works
- Is supplement to
- Conference paper: 10.1145/3524842.3528471 (DOI)