This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions and how they have been affected by the COVID-19 pandemic. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.3
cycler==0.10.0
gender-guesser==0.4.0
kiwisolver==1.3.2
matplotlib==3.4.3
numpy==1.21.3
pandas==1.3.4
patsy==0.5.2
Pillow==8.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
pytz==2021.3
scipy==1.7.1
six==1.16.0
statsmodels==0.13.0
swh-replica
, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. swh-replica
.names.tab
- forenames and surnames per country with their frequencyzones.acc.tab
- countries/territories, timezones, population and world zonesc_c.tab
- ccTDL entities - world zones matchesswh-replica
database to create commits.csv.zst
and authors.csv.zst
sh> ./export.sh
authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
author-fullnames-gender.csv.zst
sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Create the PostgreSQL DB
sh> createdb gender-commit
Notice that from now on when prepending the psql>
prompt we assume the execution of psql on the gender-commit
database.
Import data into PostgreSQL DB
sh> ./import_data.sh
commits.tab
, that is used as input for the
gender detection script
sh> psql -f extract_commits.sql gender-commit
commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst
Use ./assign_world_zone.py --help
if you are interested in changing the script parameters.
psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
commits_tz.tab
, authors_tz.tab
, commits_zones.tab
, authors_zones.tab
, and
authors_zones_1620.tab
. extract_data.sql
if you whish to modify extraction parameters (start/end year, sampling, ...).
sh> ./extract_data.sh
commits_tzs.pdf
, authors_tzs.pdf
, commits_zones.pdf
,
authors_zones.pdf
, and authors_zones_1620.pdf
.
sh> ./create_charts.sh
This package also includes some already-made graphs
authors_zones_1.pdf
: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per periodauthors_zones_2.pdf
: ditto with at least two commits per periodauthors_zones_10.pdf
: ditto with at least ten commits per period