Worldwide Gender Differences in Public Code Contributions - Replication Package

Davide Rossi; Stefano Zacchiroli

doi:10.5281/zenodo.6020475

Published February 9, 2022 | Version v1

Dataset Open

Worldwide Gender Differences in Public Code Contributions - Replication Package

1. University of Bologna, Italy
2. LTCI, Télécom Paris, Institut Polytechnique de Paris, France

Worldwide Gender Differences in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0

Initial data

swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.
We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030.
Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches

Data preparation

Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Gender detection

Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst

Database creation and data ingestion

Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.
Import data into PostgreSQL DB sh> ./import_data.sh

Zone detection

Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script
sh> psql -f extract_commits.sql gender-commit
Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Read zones assignment data from the file into the DB
psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab.
Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh

Additional graphs

This package also includes some already-made graphs

authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period
authors_zones_2.pdf: ditto with at least two commits per period
authors_zones_10.pdf: ditto with at least ten commits per period

Files

README.md

Files (2.9 MB)

Name	Size	Download all
README.html md5:1c8488cdcda3030224072e558b36aa75	6.4 kB	Download
README.md md5:00d9db0cf0077cc620e16d00fea9c832	5.5 kB	Preview Download
replication-package.zip md5:d5ac44b000bbbe8389f116d878148f82	2.9 MB	Preview Download

Additional details

Is supplement to: Conference paper: 10.1145/3510458.3513011 (DOI)

	All versions	This version
Views	646	644
Downloads	281	281
Data volume	227.8 MB	227.8 MB

Worldwide Gender Differences in Public Code Contributions - Replication Package

Authors/Creators

Description

Files

README.md

Files (2.9 MB)

Additional details

Related works