This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471
This document comes with the software needed to mine and analyze the data presented in the paper.
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.4
cycler==0.11.0
fonttools==4.31.2
kiwisolver==1.4.0
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.1
patsy==0.5.2
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.1
scipy==1.8.0
six==1.16.0
statsmodels==0.13.2
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.swh-replica.names.tab - forenames and surnames per country with their frequencyzones.acc.tab - countries/territories, timezones, population and world zonesc_c.tab - ccTDL entities - world zones matchesExport data from the swh-replica database to create commits.csv.zst and authors.csv.zst
sh> ./export.shRun the authors cleanup script to create authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zstFilter out implausible names and create authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zstRun the email detection script to create author-country-by-email.tab.zst
sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zstCreate the PostgreSQL DB
sh> createdb zones-commit
Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.
Import data into PostgreSQL DB
sh> ./import_data.shExtract commits data from the DB and create commits.tab, that is used as input for the zone detection script
sh> psql -f extract_commits.sql zones-commitRun the world zone detection script to create commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst
Use ./assign_world_zone.py --help if you are interested in changing the script parameters.Ingest zones assignment data into the DB
psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid.
Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).
sh> ./extract_data.shRun the script to create the graphs from all the previously extracted tabfiles.
sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf