Gender Differences in Public Code Contributions: a 50-year Perspective
Replication Package (version 1.0)
Table of Contents
- 1. commit list
- 2. export data from Software Heritage
- 3. author cleanup
- 4. extract samples
- 5. filter out implausible names
- 6. count unique full names
- 7. count unique words in full names
- 8. guess gender of author full names
- 9. import data into local Postgres database
- 10. quantify non-plausible commit timestamps
- 11. aggregate totals by period and gender
- 12. analyze totals and plot results
- 13. software versions
This page details the steps needed to replicate the findings of the paper: Stefano Zacchiroli, Gender Differences in Public Code Contributions: a 50-year Perspective, IEEE Software, 2021.
1 commit list
The starting corpus, obtained with the first step below ("export data from Software Heritage") cannot be made available in full and aggregate form, due to the presence of personal information, such as author names and emails. Instead, we provide the full list of the identifiers of the 1'661'391'281 commits that constitutes the starting corpus. Each commit is identified by its Software Heritage Identifier (SWHID), using which additional commit information can be retrieved from the Software Heritage archive, e.g., via its API.
For ease of management the commit list is split into 16 lists, based on the first hex digit of the commit SHA1 digest. Each list is compressed using zstd and weight about 2.2 GB, for a total of 35 GB. The lists are stored in the following files included in the replication package:
commit-ids-0.txt.zst
commit-ids-1.txt.zst
commit-ids-2.txt.zst
commit-ids-3.txt.zst
commit-ids-4.txt.zst
commit-ids-5.txt.zst
commit-ids-6.txt.zst
commit-ids-7.txt.zst
commit-ids-8.txt.zst
commit-ids-9.txt.zst
commit-ids-a.txt.zst
commit-ids-b.txt.zst
commit-ids-c.txt.zst
commit-ids-d.txt.zst
commit-ids-e.txt.zst
commit-ids-f.txt.zst
Once information for all commits listed in the above lists is obtained, either from the same source (Software Heritage) or others, the subsequent steps (and code) detailed below can be run to replicate the paper findings. The code itself can also be inspected, independently from the data, for audit purposes.
2 export data from Software Heritage
Code: export.sh
, export_commits.sql
, export_authors.sql
Requirements: Postgres >= 12, moreutils, zstd
Note: this step is needed only if you haven't already obtained commit information by other means and if you have access to a Software Heritage archive copy to export from.
./export.sh
cat commits.csv.log authors.csv.log
Timing is on. COPY 1661391281 Time: 9505246,344 ms (02:38:25,246) Timing is on. COPY 33660524 Time: 32075,002 ms (00:32,075)
commits | 1661391281 | 1.66 B |
authors | 33660524 | 33.7 M |
3 author cleanup
Specifically: convert author names to UTF8, ignoring the rest
Code: cleanup.sh
, pgcsv2utf8.py
, smudge_authors.py
Requirements: Python 3, dateutil, pv, zstd
./cleanup.sh authors.csv.zst zstdcat authors--clean.csv.zst | wc -l > authors--clean.csv.count
tail -n 1 authors--clean.csv.log cat authors--clean.csv.count
ERROR:root:skipped 3025 row(s) due to conversion errors 33657517
skipped authors | 3025 | 0.009% | |
remaining authors | 33657517 | 33.7 M |
4 extract samples
Extract samples for manual data inspection: ~1.5 M commits, ~300 K authors.
Code: stocat.pl
Note that these samples are random and do not respect foreign key constraints (i.e., not all author/committer id will be resolvable).
pv commits.csv.zst | unzstd \ | ./stocat.pl -p .001 \ | zstdmt > commits--sample.csv.zst pv authors--clean.csv.zst | unzstd \ | ./stocat.pl -p .01 \ | zstdmt > authors--clean--sample.csv.zst
5 filter out implausible names
Code: filter_names.py
Requirements: Python 3, dateutil
pv authors--clean.csv.zst | unzstd \ | ./filter_names.py 2> authors--plausible.csv.log \ | zstdmt > authors--plausible.csv.zst
cat authors--plausible.csv.log
ERROR:root:total_in: 33657499 ERROR:root:total_out: 25986273 ERROR:root:nonalpha: 7492927 ERROR:root:email: 153224 ERROR:root:empty: 25044 ERROR:root:toolong: 31 ERROR:root:skipped: 7671226
totalin | 33657499 | 33.7 M |
totalout | 25986273 | 26.0 M |
skipped | 7671226 | 7.67 M |
skipped breakdown:
nonalpha | 7492927 |
153224 | |
empty | 25044 |
toolong | 31 |
6 count unique full names
Note: we ignore case, and we use GNU sed to lowercase instead of tr as the latter is not Unicode-aware.
We also take a shuffle of the unique names for ease of manual inspection.
pv authors--plausible.csv.zst | unzstd \ | cut -f 2 \ | sed -e 's/./\L\0/g' \ | sort -u -S 2G \ | tee >( shuf | zstdmt > authors--plausible--shuffled.txt.zst ) \ | wc -l
13219629
there are 13.2 M unique full names to check
7 count unique words in full names
Splitting on all sequence of (Unicode) non-word characters.
Underlying idea: as we do not know how to reliably split between given and family name (and in some languages the notion is flawed anyway), we split individual words in names and check the gender of each of them, excluding meaningless results (which we will probably get on family names). We can then apply a majority criterion on the gender of all words in a given full name to determine the author gender.
Code: split_words.py
Requirements: Python 3
pv authors--plausible--shuffled.txt.zst | unzstd \ | ./split_words.py \ | sort -u -S 2G \ | tee >( shuf | zstdmt > author-words--shuffled.txt.zst ) \ | zstdmt > author-words--sorted.txt.zst zstdcat author-words--shuffled.txt.zst | wc -l > author-words--shuffled.txt.count zstdcat author-words--sorted.txt.zst | wc -l > author-words--sorted.txt.count
cat author-words--sorted.txt.count
9315899
there are 9.3 M unique words in full names
8 guess gender of author full names
Using a majority criterion that assign to full names the most popular gender detected by gender-guesser on the individual words that compose a full name. This avoids having to split given names from family names.
Code: guess_gender.py
Requirements: Python 3, gender-guesser
First the real run to produce a table that keeps the mapping between author names (and their IDs) and detected gender:
pv authors--plausible.csv.zst | unzstd \ | ./guess_gender.py --fullname --field 2 \ | zstdmt > author-fullnames-gender.csv.zst
Then compute stats about the distribution of recognized genders. Note that we need to lowercase (because detection is case insensitive) and apply uniqueness to full names as the may be repeated under different author IDs.
pv author-fullnames-gender.csv.zst | unzstd \ | cut -f 2,3 | sed -e 's/./\L\0/g' \ | sort -t $'\t' -k 1 -u -S 2G | cut -f 2 | sort -S 2G | uniq -c
542198 female 2975681 male 9701860 unknown
how many | ratio on | ratio on | gender |
---|---|---|---|
total (%) | known (%) | ||
542198 | 4.1 | 15.4 | female |
2975681 | 22.5 | 84.6 | male |
9701860 | 73.4 | 275.8 | unknown |
3517879 | 26.6 | 100.0 | known |
9 import data into local Postgres database
This is for further data analysis, to ease:
- range queries on timestamps
- joining from person IDs to fullnames and genders
Code: import_data.sh
, schema.sql
, import_data.sql
,
schema_indexes.sql
Requirements: Postgres >= 12
createdb gender-commit ./import_data.sh
Warning: this require quite a bit of disk space: ~160 GB of disk space for the DB. Total import time is ~1h10m with SSDs.
cat import_data.log
DROP TABLE DROP TABLE DROP TYPE DROP DOMAIN CREATE DOMAIN CREATE TYPE CREATE TABLE CREATE TABLE Timing is on. COPY 25986273 Time: 18713,424 ms (00:18,713) COPY 1661391281 Time: 2844759,562 ms (47:24,760) Timing is on. CREATE INDEX Time: 13631,734 ms (00:13,632) CREATE INDEX Time: 1234839,400 ms (20:34,839)
10 quantify non-plausible commit timestamps
"non-plausible" as in:
- before epoch (1970-01-01)
- at the epoch (we know from previous work that they are disproportionally over represented, e.g., due to VCS conversions went wrong)
- in the future w.r.t. export date (2020-05-13), with 1 day of slack
select count(*) from commit where author_date <= '1970-01-01' or author_date > '2020-05-14' or committer_date <= '1970-01-01' or committer_date > '2020-05-14' ;
There aren't many: 10966017 (11 M, 0.66%)
11 aggregate totals by period and gender
We aggregate commit and authors totals by hour, for further analysis with pandas. We do each aggregation twice, one by absolute time (for the long-term trend), and one by local time (for daily/weekly patterns).
Code: extract_totals.sh
, extract_totals.sql
Requirements: Postgres >= 12
./extract_totals.sh
Took ~2 hours of processing time:
cat extract_totals.log
Timing is on. CREATE VIEW Time: 1,849 ms COPY 584972 Time: 412813,205 ms (06:52,813) COPY 584972 Time: 2886563,792 ms (48:06,564) COPY 584972 Time: 1455051,286 ms (24:15,051) COPY 584972 Time: 2979160,501 ms (49:39,161)
Results are in {commit,author}-gender-by-{utc,local}-hour.csv
.
Each row maps a timestamp to 3 columns, which are the total amounts broken down by gender (male/female/unknown) in the matching period.
12 analyze totals and plot results
Code: analyze.py
Requirements: Python 3, numpy, matplotlib, pandas, statsmodels
./analyze.py
Generated charts are available under ./figures/
.
13 software versions
the above experiments have been run within a Python virtual environment with the following versions of PyPI packages installed:
- certifi==2020.4.5.1
- chardet==3.0.4
- click==7.1.2
- cycler==0.10.0
- gender-guesser==0.4.0
- idna==2.9
- joblib==0.15.0
- kiwisolver==1.2.0
- Markdown==3.2.2
- matplotlib==3.2.1
- nltk==3.5
- numpy==1.18.4
- pandas==1.0.3
- patsy==0.5.1
- pyparsing==2.4.7
- python-dateutil==2.8.1
- pytz==2020.1
- regex==2020.5.14
- requests==2.23.0
- scikit-learn==0.23.0
- scipy==1.4.1
- six==1.14.0
- statsmodels==0.12.0
- threadpoolctl==2.0.0
- tqdm==4.46.0
- Unidecode==1.1.1
- urllib3==1.25.9