Gender Differences in Public Code Contributions: a 50-year Perspective
Replication Package (version 1.0)

1. commit list
2. export data from Software Heritage
3. author cleanup
4. extract samples
5. filter out implausible names
6. count unique full names
7. count unique words in full names
8. guess gender of author full names
9. import data into local Postgres database
10. quantify non-plausible commit timestamps
11. aggregate totals by period and gender
12. analyze totals and plot results
13. software versions

This page details the steps needed to replicate the findings of the paper: Stefano Zacchiroli, Gender Differences in Public Code Contributions: a 50-year Perspective, IEEE Software, 2021.

1 commit list

The starting corpus, obtained with the first step below ("export data from Software Heritage") cannot be made available in full and aggregate form, due to the presence of personal information, such as author names and emails. Instead, we provide the full list of the identifiers of the 1'661'391'281 commits that constitutes the starting corpus. Each commit is identified by its Software Heritage Identifier (SWHID), using which additional commit information can be retrieved from the Software Heritage archive, e.g., via its API.

For ease of management the commit list is split into 16 lists, based on the first hex digit of the commit SHA1 digest. Each list is compressed using zstd and weight about 2.2 GB, for a total of 35 GB. The lists are stored in the following files included in the replication package:

commit-ids-0.txt.zst
commit-ids-1.txt.zst
commit-ids-2.txt.zst
commit-ids-3.txt.zst
commit-ids-4.txt.zst
commit-ids-5.txt.zst
commit-ids-6.txt.zst
commit-ids-7.txt.zst
commit-ids-8.txt.zst
commit-ids-9.txt.zst
commit-ids-a.txt.zst
commit-ids-b.txt.zst
commit-ids-c.txt.zst
commit-ids-d.txt.zst
commit-ids-e.txt.zst
commit-ids-f.txt.zst

Once information for all commits listed in the above lists is obtained, either from the same source (Software Heritage) or others, the subsequent steps (and code) detailed below can be run to replicate the paper findings. The code itself can also be inspected, independently from the data, for audit purposes.

2 export data from Software Heritage

Code: export.sh, export_commits.sql, export_authors.sql
Requirements: Postgres >= 12, moreutils, zstd

Note: this step is needed only if you haven't already obtained commit information by other means and if you have access to a Software Heritage archive copy to export from.

./export.sh

cat commits.csv.log authors.csv.log

Timing is on.
COPY 1661391281
Time: 9505246,344 ms (02:38:25,246)
Timing is on.
COPY 33660524
Time: 32075,002 ms (00:32,075)

commits	1661391281	1.66 B
authors	33660524	33.7 M

3 author cleanup

Specifically: convert author names to UTF8, ignoring the rest

Code: cleanup.sh, pgcsv2utf8.py, smudge_authors.py
Requirements: Python 3, dateutil, pv, zstd

./cleanup.sh authors.csv.zst
zstdcat authors--clean.csv.zst | wc -l > authors--clean.csv.count

tail -n 1 authors--clean.csv.log
cat authors--clean.csv.count

ERROR:root:skipped 3025 row(s) due to conversion errors
33657517

skipped authors	3025		0.009%
remaining authors	33657517	33.7 M

4 extract samples

Extract samples for manual data inspection: ~1.5 M commits, ~300 K authors.

Code: stocat.pl

Note that these samples are random and do not respect foreign key constraints (i.e., not all author/committer id will be resolvable).

pv commits.csv.zst | unzstd \
  | ./stocat.pl -p .001 \
  | zstdmt > commits--sample.csv.zst
pv authors--clean.csv.zst | unzstd \
  | ./stocat.pl -p .01 \
  | zstdmt > authors--clean--sample.csv.zst

5 filter out implausible names

Code: filter_names.py
Requirements: Python 3, dateutil

pv authors--clean.csv.zst | unzstd \
  | ./filter_names.py 2> authors--plausible.csv.log \
  | zstdmt > authors--plausible.csv.zst

cat authors--plausible.csv.log

ERROR:root:total_in: 33657499
ERROR:root:total_out: 25986273
ERROR:root:nonalpha: 7492927
ERROR:root:email: 153224
ERROR:root:empty: 25044
ERROR:root:toolong: 31
ERROR:root:skipped: 7671226

total_in	33657499	33.7 M
total_out	25986273	26.0 M
skipped	7671226	7.67 M

skipped breakdown:

nonalpha	7492927
email	153224
empty	25044
toolong	31

6 count unique full names

Note: we ignore case, and we use GNU sed to lowercase instead of tr as the latter is not Unicode-aware.

We also take a shuffle of the unique names for ease of manual inspection.

pv authors--plausible.csv.zst | unzstd \
  | cut -f 2 \
  | sed -e 's/./\L\0/g' \
  | sort -u -S 2G \
  | tee >( shuf | zstdmt > authors--plausible--shuffled.txt.zst ) \
  | wc -l

13219629

there are 13.2 M unique full names to check

7 count unique words in full names

Splitting on all sequence of (Unicode) non-word characters.

Underlying idea: as we do not know how to reliably split between given and family name (and in some languages the notion is flawed anyway), we split individual words in names and check the gender of each of them, excluding meaningless results (which we will probably get on family names). We can then apply a majority criterion on the gender of all words in a given full name to determine the author gender.

Code: split_words.py
Requirements: Python 3

pv authors--plausible--shuffled.txt.zst | unzstd \
  | ./split_words.py \
  | sort -u -S 2G \
  | tee >( shuf | zstdmt > author-words--shuffled.txt.zst ) \
  | zstdmt > author-words--sorted.txt.zst

zstdcat author-words--shuffled.txt.zst | wc -l > author-words--shuffled.txt.count
zstdcat author-words--sorted.txt.zst   | wc -l > author-words--sorted.txt.count

cat author-words--sorted.txt.count

there are 9.3 M unique words in full names

8 guess gender of author full names

Using a majority criterion that assign to full names the most popular gender detected by gender-guesser on the individual words that compose a full name. This avoids having to split given names from family names.

Code: guess_gender.py
Requirements: Python 3, gender-guesser

First the real run to produce a table that keeps the mapping between author names (and their IDs) and detected gender:

pv authors--plausible.csv.zst | unzstd \
  | ./guess_gender.py --fullname --field 2 \
  | zstdmt > author-fullnames-gender.csv.zst

Then compute stats about the distribution of recognized genders. Note that we need to lowercase (because detection is case insensitive) and apply uniqueness to full names as the may be repeated under different author IDs.

pv author-fullnames-gender.csv.zst | unzstd \
  | cut -f 2,3 | sed -e 's/./\L\0/g' \
  | sort -t $'\t' -k 1 -u -S 2G | cut -f 2 | sort -S 2G | uniq -c

 542198 female
2975681 male
9701860 unknown

how many	ratio on	ratio on	gender
	total (%)	known (%)
542198	4.1	15.4	female
2975681	22.5	84.6	male
9701860	73.4	275.8	unknown
3517879	26.6	100.0	known

9 import data into local Postgres database

This is for further data analysis, to ease:

range queries on timestamps
joining from person IDs to fullnames and genders

Code: import_data.sh, schema.sql, import_data.sql, schema_indexes.sql
Requirements: Postgres >= 12

createdb gender-commit
./import_data.sh

Warning: this require quite a bit of disk space: ~160 GB of disk space for the DB. Total import time is ~1h10m with SSDs.

cat import_data.log

DROP TABLE
DROP TABLE
DROP TYPE
DROP DOMAIN
CREATE DOMAIN
CREATE TYPE
CREATE TABLE
CREATE TABLE
Timing is on.
COPY 25986273
Time: 18713,424 ms (00:18,713)
COPY 1661391281
Time: 2844759,562 ms (47:24,760)
Timing is on.
CREATE INDEX
Time: 13631,734 ms (00:13,632)
CREATE INDEX
Time: 1234839,400 ms (20:34,839)

10 quantify non-plausible commit timestamps

"non-plausible" as in:

before epoch (1970-01-01)
at the epoch (we know from previous work that they are disproportionally over represented, e.g., due to VCS conversions went wrong)
in the future w.r.t. export date (2020-05-13), with 1 day of slack

select count(*)
from commit
where author_date    <= '1970-01-01'
  or  author_date    >  '2020-05-14'
  or  committer_date <= '1970-01-01'
  or  committer_date >  '2020-05-14' ;

There aren't many: 10966017 (11 M, 0.66%)

11 aggregate totals by period and gender

We aggregate commit and authors totals by hour, for further analysis with pandas. We do each aggregation twice, one by absolute time (for the long-term trend), and one by local time (for daily/weekly patterns).

Code: extract_totals.sh, extract_totals.sql
Requirements: Postgres >= 12

./extract_totals.sh

Took ~2 hours of processing time:

cat extract_totals.log

Timing is on.
CREATE VIEW
Time: 1,849 ms
COPY 584972
Time: 412813,205 ms (06:52,813)
COPY 584972
Time: 2886563,792 ms (48:06,564)
COPY 584972
Time: 1455051,286 ms (24:15,051)
COPY 584972
Time: 2979160,501 ms (49:39,161)

Results are in {commit,author}-gender-by-{utc,local}-hour.csv.

Each row maps a timestamp to 3 columns, which are the total amounts broken down by gender (male/female/unknown) in the matching period.

12 analyze totals and plot results

Code: analyze.py
Requirements: Python 3, numpy, matplotlib, pandas, statsmodels

./analyze.py

Generated charts are available under ./figures/.

13 software versions

the above experiments have been run within a Python virtual environment with the following versions of PyPI packages installed:

certifi==2020.4.5.1
chardet==3.0.4
click==7.1.2
cycler==0.10.0
gender-guesser==0.4.0
idna==2.9
joblib==0.15.0
kiwisolver==1.2.0
Markdown==3.2.2
matplotlib==3.2.1
nltk==3.5
numpy==1.18.4
pandas==1.0.3
patsy==0.5.1
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.1
regex==2020.5.14
requests==2.23.0
scikit-learn==0.23.0
scipy==1.4.1
six==1.14.0
statsmodels==0.12.0
threadpoolctl==2.0.0
tqdm==4.46.0
Unidecode==1.1.1
urllib3==1.25.9

Gender Differences in Public Code Contributions: a 50-year Perspective Replication Package (version 1.0)

Table of Contents