An empirical comparison of ethnic and gender diversity of DevOps and non-DevOps contributions to open-source projects

Diversity has been recognized as a high-value team characteristic. Both open-source and proprietary software organizations have been investing heavily in creating more diverse teams. Prior work has raised diversity concerns about open-source communities; however, to the best of our knowledge, it is not yet clear if those diversity concerns permeate across all of the subteams of the project. Studying diversity in subteams would provide more detailed empirical evidence about the role of diversity in software development teams. Therefore, we perform an empirical study on 110,336 developers who contributed to artifacts of 450 large and thriving open-source projects. We opt to study diversity of the DevOps team because it plays a central role in a project. In particular, we analyze the perceptible ethnic and gender diversity among DevOps contributors to open-source, and we ground our analysis in a comparison to non-DevOps contributors. Overall, our results show that, with respect to perceptible ethnic diversity, contributors with perceptibly White names in a project are the majority of DevOps contributors (median = 87.70%) and non-DevOps contributors (median = 85.50%). With respect to gender diversity, contributors who are perceptible as men in a project are the majority of DevOps contributors (median = 93.75%) and non-DevOps contributors (median = 92.82%). We statistically measure the perceptible ethnic and gender diversity of both DevOps and non-DevOps contributors using diversity metrics, and we find that the diversity of DevOps contributors is significantly less than that of non-DevOps contributors. When analyzing the distribution of diversity change as projects evolve, we find that contributors perceptible as non-Whites (such as Hispanic and Black) are greatly underrepresented. Although the percentage of contributors perceptible as White is decreasing over time, the percentage of contributors perceptible as non-White is still low, i.e., it varies between 0%–16.02% for DevOps and 0%–18.77% for non-DevOps. We observe similar results for gender diversity, where contributors perceptible as men dominate over contributors perceptible as women. Our study provides empirical evidence contributing towards a better understanding of diversity aspects from a different perspective (DevOps vs. non-DevOps contributors). Our findings call for higher awareness, not only of the overall diversity but also of the diversity in specific subteams of the project.


Introduction
Diversity in any profession benefits everyone involved and can lead to a large positive impact on different communities (Galinsky et al. 2015;Reynolds et al. 2017).More specifically, in the software engineering community, greater diversity has been associated with more resilient software solutions, problem-solving from broader perspectives, and tends to improve the effectiveness of teamwork (Vasilescu et al. 2015b).At present, large organizations, such as Google, 1 Microsoft,2 and Facebook3 are investing heavily in fostering diverse teams.
Previous work has explored some aspects of diversity in open-source communities (Bosu and Sultana 2019;Catolino et al. 2019;Nadri et al. 2021b;Vasilescu et al. 2014Vasilescu et al. , 2015b)).For example, Nadri et al. (2021b) empirically analyzed the impact of race and ethnicity on the likelihood of acceptance of the proposed contributions in open-source projects.They showed that contributors with perceptibly White names are responsible for the vast majority of the contributions, and their contributions are more likely to be accepted than the contributions of contributors with perceptibly non-White names.Vasilescu et al. (2015b) observed that contributors who are perceptible as women account for a smaller proportion of GitHub project teams than men.In addition, they found that teams with higher rates of gender diversity tend to be more productive.
While diversity in software development teams has been extensively studied (Catolino et al. 2019;Vasilescu et al. 2014Vasilescu et al. , 2015b)), much of this research has focused on general investigations of diversity without specific attention to particular types of subteams within teams, e.g., quality assurance teams and DevOps teams.Studying diversity in subteams would provide more precise and targeted empirical evidence about the role of diversity in software development teams, as it would allow us to examine whether the diversity concerns that have been observed overall permeate all of the subteams of a project.For instance, developers in the DevOps team of a project require a specialized skill set (Wiedemann and Wiesche 2018).In particular, DevOps contributors typically require expertise in collaboration, automation, measurement, monitoring (Humble and Molesky 2011), and Agile methodologies (Bang et al. 2013;Bass et al. 2015), as well as experience with cloud-based tools and infrastructure (Bang et al. 2013;Cukier 2013).Such specialized skills and tooling expertise may not be as common in the wider open-source community.This is evident in the Stack Overflow survey (2022), 4as it shows that only 10.06% of respondents identified themselves as DevOps developers.The survey results also reveal a gap in the gender diversity of DevOps developers. 5Similar diversity concerns have been raised in popular DevOps reports as well.For example, DORA's State of DevOps report (2021) (Smith et al. 2019) highlighted that the percentage of DevOps developers who found themselves underrepresented has increased from 13.7% in 2019 to 17% in 2021.
Therefore, we set out to study the diversity among DevOps contributors to open-source projects.Note that, to the best of our knowledge, this is the first study on the diversity of subteams within project teams, and we opt to focus on DevOps teams because DevOps plays a central role in a project (Guşeilȃ et al. 2019).In particular, the DevOps contributors in a project are the ones who are responsible for supporting other teams during software development, continuous integration (Vasilescu et al. 2015e), continuous product delivery, and managing and allocating resources, such as computing and storage, to other teams (Erich et al. 2017;Leite et al. 2019).Diversity in DevOps can help to improve the overall performance of a project.In fact, a report, "Why diversity matters in DevOps,"6 discusses the potential of diversity in DevOps teams to lead to more streamlined product development.Moreover, there is a growing focus on diversity of DevOps contributors in the industry as well.For example, the global movement "Women in DevOps"7 aims to close the gender gap in DevOps contributors, believing that a balanced and diverse workforce drives innovation.We ground our analysis of diversity metrics of developers in DevOps teams in comparison to other developers.More specifically, we perform an empirical study involving data from 450 active and mature projects that are hosted on GitHub. 8A preliminary analysis of this data indicates that the majority (median = 86.70%) of the developers in a project concentrate their contributions on non-DevOps artifacts only.Yet, a non-negligible percentage (median = 13.30%) of developers contribute to DevOps artifacts.Therefore, we formulate and address the following research questions: (RQ1) Does the perceptible ethnic and gender diversity of DevOps contributors differ from ethnic and gender diversity of non-DevOps contributors?
Motivation: While there is a lack of research specifically focused on the impact of diversity in DevOps teams, there is evidence regarding the positive impact of diversity in software development teams (Earley and Mosakowski 2000;Hui and Farnham 2016;Vasilescu et al. 2015b).Since DevOps plays a major role in the software development process (Ebert et al. 2016), a diverse DevOps team would ensure the DevOps process is more robust, inclusive, and successful.Therefore, this RQ aims to provide the community with more awareness of the presence or lack of ethnic and gender diversity of DevOps contributors to open-source projects.
Results: With respect to ethnic diversity, contributors with perceptibly White names are the majority among both DevOps contributors (median = 87.70%)and non-DevOps contributors (median = 85.50%).With respect to gender diversity, contributors with names perceived as men are the majority among both DevOps contributors (median = 93.75%)and non-DevOps contributors (median = 92.82%).We statistically measure the ethnic and gender diversity of both DevOps and non-DevOps contributors using diversity metrics, and we find that diversity of DevOps contributors is significantly less than that of non-DevOps contributors (Wilcoxon signed rank test, p < α = 0.0023,9 one-tailed, paired).

(RQ2) How does the distribution of perceptible ethnic and gender diversity change as projects age?
Motivation.While a concerning lack of ethnic and gender diversity in open-source communities has been reported for decades now (Davidson et al. 2014;Ford et al. 2016;Wang and Redmiles 2019), it is not yet clear where the current trend is headed.By analyzing the diversity metrics over time, we can better understand whether the trend of diversity is improving or further degrading.Therefore, we examine the evolution of ethnic and gender diversity of DevOps and non-DevOps contributors in the projects.
Results: As projects evolve, contributors perceptible as non-White remain greatly underrepresented.The overall percentage of DevOps contributors perceptible as White is decreasing over time; however, the percentage of DevOps contributors perceptible as non-White is still low, i.e., it varies between 0%-16.02%.For non-DevOps contributors, the percentage varies between 0%-18.77%.With respect to gender diversity, both DevOps and non-DevOps contributors who are perceptible as men dominate over contributors who are perceptible as women.The percentage of contributors perceptible as women varies between 0%-12.5% and between 0%-9.48% for DevOps and non-DevOps, respectively.
The above answers to our research questions revealed that minorities with respect to perceived ethnicity and gender are more underrepresented among DevOps contributors compared to non-DevOps contributors.However, Ross et al. (2020) discovered that developers at the intersection of being Black and being a woman (i.e., Black women) often have different experiences than Black men and non-Black women in the US.In our study, independently analyzing ethnic and gender diversity may not help to understand how diverse the contributors are when considering the intersection of minority groups of perceptible ethnicity and gender (e.g., contributors with perceptibly non-White and women's names).To that end, we perform an intersectional analysis of ethnic and gender diversity to obtain a more realistic picture with regard to ethnicity and gender (Gren 2018).Our results show that perceptible Whites are the majority of DevOps and non-DevOps contributors who are perceptible as women.More importantly, we find that DevOps contributors' lack of perceived diversity is amplified when considering intersectionality.
We believe that our study provides empirical evidence that contributes towards a better understanding of the perceptible diversity among contributors in open source.We complement prior studies that raise awareness of the lack of perceptible diversity among contributors in open-source projects.While solutions and strategies have been proposed to increase diversity in open-source projects (Davidson et al. 2014;Ford et al. 2016;Wang and Redmiles 2019), our results underscore the importance of encouraging open-source communities to foster a more diverse and inclusive environment, not only considering the overall project team but also the different subteams within the project.
Paper Organization.The remainder of the paper is organized as follows.Section 2 provides an overview of the design of our study.Section 3 presents the results for our research questions.Section 4 discusses our results further.We present the threats to the validity of the study in Section 5. Section 6 discusses the related work to our study.Finally, Section 7 draws conclusions and the broader implications of our study.

Study Design
In this section, we describe our process for collecting and curating the dataset we use to address our research questions. Figure 1 provides an overview of our study design, which is composed of Projects Selection (PS), Data Curation (DC), and Preliminary Analysis (PA) steps.Next, we explain each step in detail.

(PS) Project Selection
Our study aims to analyze the diversity of contributors to DevOps and non-DevOps artifacts in open-source projects.Thus, we need to collect a dataset of open-source projects that adopt tools and technologies for DevOps activities.We begin with the public dataset of Gallaba et al. (2022).This dataset contains data from 23,330,690 software builds that span 7,795 GitHub projects that have been using CircleCI, 10 a leading cloud-based Continuous Integration (CI) platform that has served over one million unique contributors during its nine years of operation. 11This dataset ensures that the projects have potentially been using DevOps tools and technologies to help them automate software development processes, such as building, testing, and deploying the software.Since GitHub hosts projects that are not representative of projects we aim to investigate (e.g., toy or immature projects) (Munaiah et al. 2017), we follow the methodology recommended by previous work (Kalliamvakou et al. 2016) to further curate our dataset by applying the following inclusion criteria: (PS1) Select non-forked projects.We remove forks 12 because they largely contain duplicated project history, which would bias our analysis.To do so, we use the GitHub API 13 to determine whether a project is a fork or not.If the project is a fork, the GitHub API returns the fork status True, and we filter out all such projects.This step reduces our dataset to 7,068 projects.
(PS2) Select active and large projects.Active and large projects are likely to showcase a long-running and collaborative software development process to examine diversity.To detect active and large projects, we consider different thresholds of (1) the number of builds, (2) the number of commits, and (3) the number of contributors.

Number of builds. Figure 2 plots candidate threshold values of the number of builds against
the number of surviving projects.We select a threshold of 500 builds because it is closer to a "knee" in the curve. 14Selecting this threshold further reduces the number of projects in the dataset to 2,124.

Number of commits.
Figure 3 plots candidate threshold values of the number of commits against the number of surviving projects.We select a threshold of 1,500 commits because it is also closer to a "knee" in the curve.Doing so reduces the number of projects in the dataset to 850 projects.

Number of contributors.
Figure 4 plots candidate threshold values for the number of contributors against the number of surviving projects.We select a threshold of 50 contributors because we wanted to study projects with a substantial number of contributors.Doing so reduces the number of projects in our dataset to 450.This dataset of 450 projects comprises large projects from popular organizations, e.g., Facebook,15 Google, 16 and Angular.17Overall, the projects in our dataset run a considerable number of builds (a median of 8,711 builds per project) and have a rich development history (a median of 4,935 commits per project and 145 contributors per project).

(DC) Data Curation
In our study, we integrate data from various sources to acquire all the data we need for our analysis.Figure 1 (Step 2 DC) provides an overview of our data curation process.In the following, we describe each data curation step in detail.

(DC1) Collecting commit information.
To analyze the diversity of contributors, we need to collect historical data on project development activity.Therefore, we first clone local copies of the 450 repositories in our dataset.Then, for each repository, we mine the commit records that appear on its master/main branch (doing so ensures that the most impactful changes to the source code are considered and mitigates the inconsistencies that may arise due to the deletion of temporary branches).For each commit, we extract (meta)data, such as the unique commit ID (i.e., SHA), timestamp, name of the contributor who authored the commit, and files modified (or created) in the commit.
(DC2) Filtering out commits from non-human contributors.Our goal is to examine ethnic and gender diversity among contributors in open-source projects, and hence, any contribution made by non-humans (e.g., bots) should not be considered in our data analysis.Therefore, we filter out commits made by non-human contributors.To do so, we use the GitHub API18 to determine the commit author type, i.e., if the commit is made by a human contributor, the GitHub API returns the type USER.We filter out all commits made by authors of a different To do so, we first search for DevOps tools that have the potential to be used in the studied projects.Then, we identify the adoption of such tools in the projects.
Selecting DevOps tools.We opt to select the tools that are used for CI, Deployment, Containers, Builds, and Configuration because such DevOps practices are commonly used in projects that perform DevOps (Sánchez-Gordón and Colomo-Palacios 2018).The table of DevOps tools (Digital.ai2022) offered by Digital.ai19contains a list of popular DevOps tools used for these practices.For instance, for CI, the listed tools are CircleCI,20 TravisCI,21 Jenkins,22 AWS CodeBuild,23 and Codeship. 24ote that there are several other websites that list DevOps tools (e.g., Atlassian, 25 Guru, 26 Geekflare, 27 etc.).We choose the Digital.aiperiodic table for two reasons.First, this periodic table offers a comprehensive and organized representation of the DevOps tools (Digital.ai2022).More specifically, it defines 17 categories of tools across five different licensing models from open source through enterprise, covering a wide range of functionalities that cater to diverse IT processes in the DevOps domain.Also, it reflects the votes of over 18,000 DevOps practitioners for over 400 tools.Second, Digital.aiperiodic table is widely acknowledged and cited as a reference in academic papers and industry articles in the DevOps field.28Identifying DevOps files.We inspect the documentation of each tool to identify the filenames that are used to configure those tools.For example, to configure CircleCI, a .circleci/config.yml file is required; any project that contains a commit on the .circleci/config.yml file is using (or used) CircleCI service.Some DevOps technologies do not have a filename convention.For example, Ansible29 and Kubernetes30 are configured using .ymlfiles, but the team may choose to name the file however they see fit.Since .yml is a popular extension used for all sorts of files, we cannot rely solely on the extension to identify DevOps files.Therefore, the first two authors inspect a sample of 400 .ymlfiles from our dataset that are not explicitly classified as DevOps files based on the filename.The inspectors prepare a list of keywords found in the filename or file path that make a file a DevOps file or a non-DevOps file.For example, devops, deployment, kubernetes, and chef are some keywords in a file path, which indicate that .ymlfile is a DevOps file; in contrast, if the file path contains changelog, docs/, package, or pullapprove, then that .ymlfile is a non-DevOps file.During this coding, the inspectors also prepare a list of keywords by inspecting the content of a .ymlfile.For example, any .ymlfile related to Kubernetes must contain kind and apiVersion key settings.
After listing the potential keywords, we identify the .ymlfiles that are likely to be DevOps files by matching keywords to the file paths.For the .ymlfiles that are not easily classifiable, we parse the content in search of key settings that we identified as DevOps-related above.After applying these automatic classification steps, 3.4% of the .ymlfiles remain ambiguous.To ensure the integrity of our categories, we remove the ambiguous files from our analysis.
After that, we evaluate our classifier.The second author manually classifies a sample of 400 .ymlfiles from our dataset.Then, we calculate the Cohen's Kappa coefficient (Cohen 1968) of agreement between the author and our classifier.This coefficient is commonly used to evaluate inter-rater agreement for categorical items between two raters.The value of Cohen's Kappa coefficient ranges from -1 to +1, with values greater than zero indicating an agreement better than chance.We obtain a Kappa coefficient of 0.82, indicating near-perfect strength of agreement between the author and the classifier.The false positive rate is 0.005, meaning that 0.5% of negative instances are falsely identified as positive, while the false negative rate is 0.227, indicating that 22.7% of positive instances are incorrectly labelled as negative.
Cleaning files dataset.Through a preliminary inspection of the files in our dataset, we observe that a non-negligible proportion is not relevant to the purpose of our analysis, e.g., node_modules/node-inspector/node_modules/v8-debug/node_module s/node-pre-gyp/node_modules/mkdirp/package.json and vendor/k8s.io/kubernetes/cmd/mungedocs/links.go are automatically generated files that are not maintained by hand.The first two authors inspect a sample of 200 files from our dataset and prepare a list of keywords in a file path that makes a file an auto-generated dependency file or not.For example, the files within the folders named vendor and node_modules contains third-party dependencies.After listing the potential keywords, we identify the files that are likely to be dependency files by matching keywords to the file paths.The percentage of such files amounts to 15.7% of all the files in our dataset, and we remove them from our analysis to ensure the validity of the results.
To evaluate the aforementioned classifier, the second author manually classifies a sample of 400 files from our dataset.Then, we calculate the Cohen's Kappa coefficient of agreement between the author and our classifier.We obtain a Kappa coefficient of 0.81, showing nearperfect agreement between the author and the classifier.The false positive rate is 0.056, meaning that 5.6% of negative instances are falsely identified as positive, while the false negative rate is 0.047, indicating that 4.7% of positive instances are incorrectly labelled as negative.
Lastly, for each project, we include in our dataset the commits made after the first DevOpsrelated commit in the project and until the end of the year 2021.Analyzing the commits after the first DevOps-related commit ensures that the timeframes for non-DevOps and DevOps artifacts align.
(DC4) Unifying identities.On GitHub, commits may not be attributed properly due to variations in a committer's name and email address. 31For example, consider the email address sandyw@gmail.comthat could be associated with two names based on the local configurations of the contributor: Sandy W and Sandy White.Besides, the same contributor may use different email addresses.For example, Sandy White may use their personal email sandyw@gmail.comas well as the noreply email address (sandy@users.noreply.github.com)as their commit-email address. 32In order to unify such cases with different identities of the same GitHub contributor, we use the GitHub-alias-merging script by Vasilescu et al. (2015c).This script uses heuristics to link different aliases and email addresses belonging to the same GitHub contributor.By using this script, we identify 110,336 unique contributors from 138,012 < name,email > pairs in our original dataset.For the rest of the study, we use these unique contributor identities.
(DC5) Inferring perceptible demographics of contributors.Since we aim to investigate gender and ethnic diversity, we need to classify contributors accordingly.Since gender and ethnic identity are difficult to ascertain at scale, similar to prior work (Nadri et al. 2021a, b;Qiu et al. 2019;Vasilescu et al. 2015b), we rely on perceived identity characteristics that can be inferred based on publicly visible profile data.

Inferring perceptible genders. It is not our aim to establish new means of inferring gender.
There already exist several open-source tools that are capable of inferring gender based on the name of the contributor, such as Gender-guesser, 33 GenderComputer, 34 and Wiki-Gendersort. 35Unfortunately, these tools are not without limitations.For example, Santamaría and Mihaljević (2018) evaluated the Gender-guesser tool and found that the tool produced a rate of 20.12% unrecognized names (Santamaría and Mihaljević 2018).In such cases, the tool predicts the gender as None; for example, the gender predicted from 31 https://git-scm.com/book/en/v2/Git-Basics-Git-Aliases 32https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-ongithub/managing-email-preferences/setting-your-commit-email-address 33https://pypi.org/project/gender-guesser/ 34https://github.com/tue-mdse/genderComputer 35https://github.com/nicolasberube/Wiki-Gendersort(2023) 28: 150Page 11 of 37 150 the names, such as rd, ga, and xulin was None.To tackle the problem of the high rate of unrecognized names, inspired by prior work (Sebo 2021), we combine the outcomes from all three gender-inferring tools and evaluate the rate of unrecognized names.Among the different combinations we study, the most effective combination to infer gender for the test dataset (provided by Santamaría and Mihaljević ( 2018)) is first to infer the gender using Wiki-Gendersort, and if the gender is unrecognized, then infer the gender using Gender-guesser and GenderComputer.Doing so reduces the rate of unrecognized names (20.12%) to 5.15%.Thus, we used the aforementioned combination of tools to infer the perceptible genders of contributors in our dataset.We discard the names that none of the tools are able to infer the perceptible gender from our analysis.

Inferring perceptible ethnicities.
Inspired by prior work on diversity (Nadri et al. 2021a, b;Else et al. 2022;Griffin et al. 2021), we use the Name-Prism (Ye et al. 2017) tool to infer the perceptible ethnicity of contributors.Name-Prism (Ye et al. 2017) is a namebased perceptible ethnicity classification tool, which uses name-embeddings to predict the ethnicity of a person using their name.This classifier was trained using US Census Bureau data,36 and it predicts the probabilities of a given name belonging to a person of the White, Black, Asian/Pacific Islander (API), Hispanic, American Indian/Alaskan Native (AIAN), and Mixed Race (2PRACE).A recent study by Preoţiuc-Pietro and Ungar ( 2018) evaluated the performance of the Name-Prism's (Ye et al. 2017) ethnicity classifier.They measured the Area Under the Receiver Operating Characteristic curve (AUC)37 of Name-Prism with respect to the top four ethnic groups: White, API, Hispanic, and Black.Their evaluation showed that the AUC of classifying people from White, API, Hispanic, and Black were 0.719, 0.765, 0.740, and 0.681, respectively.To strengthen the validity of our study, we discard names where the ethnicity is unknown from our ethnic diversity analysis; we find only 0.03% (five) and 0.06% (59) names in which the ethnicity is unknown among DevOps and non-DevOps contributors, respectively.Furthermore, following prior studies that rely on the US Census Bureau's classification to label developers across different countries and contexts (Nadri et al. 2021b), we only use contributors' names where the perceptible ethnicity is inferred by Name-Prism (Ye et al. 2017) with a confidence level greater than 0.8.

The descriptive statistics of the curated dataset.
Table 1 shows the descriptive statistics of our final dataset. it contains 1,197,829 DevOps file changes and 29,750,595 non-DevOps file changes made by 110,336 contributors to 450 GitHub projects, and an anonymized version of this dataset is available online along with our replication package.Table 2 shows a sample of data from this dataset.Some data items in this preview table are altered to protect the contributors' identities.

(PA) Preliminary Analysis: Identifying DevOps and Non-DevOps Contributors
The goal of our study is to perform a comparison of the diversity of contributors to DevOps and non-DevOps artifacts.Thus, we conduct a preliminary analysis to identify contributors to DevOps files and distinguish them from those who contribute only to non-DevOps files.To do so, for each contributor in our dataset, we check for the type of files changed in the commits made by the contributor.
Figure 5 presents box-plots showing the distribution of contributors in a project based on their contributions to DevOps and non-DevOps artifacts.As shown in the figure, the majority of the contributors in a project (median = 86.70%)only contribute to non-DevOps files, which we refer to as non-DevOps contributors.However, a non-negligible percentage (median = 13.30%) of contributors made at least one commit to DevOps files.We refer to such contributors as DevOps contributors.
One may speculate that considering every developer who contributes to at least one DevOps file as a DevOps contributor may not be a precise definition because a DevOps contributor should make more than one DevOps file change.However, if we consider a DevOps contributor as a developer who contributes to more than one DevOps file, we need to determine the threshold number of DevOps file changes a developer should make to include in our definition of a DevOps contributor.Thus, we plot the distribution of the number of DevOps file changes made by a developer vs. the number of such developers in our dataset; Fig. 6 shows the distribution.
From the figure, we observe that using a threshold (t) of two file changes (t = 2) reduces the number of DevOps contributors included in our original definition from 22,580 (t = 1) to 16,246 (t = 2); using a threshold of ten file changes (t = 10 which is closer to a knee in However, setting a threshold may lead to the exclusion of developers who make valuable contributions to DevOps processes but do not meet the threshold.Therefore, a more inclusive and realistic approach is to define a DevOps contributor as a developer who has contributed to at least one DevOps file, recognizing that even a single contribution can have a notable impact on the DevOps process and culture.Thus, for the rest of this paper, we adhere to the initial definition that we employed: contributors who made at least one commit to a DevOps file are DevOps contributors.38

Independent Analysis of Ethnic and Gender Diversity
In this section, we discuss our research questions (RQs).For each RQ, we describe our approach and then present our results.
(RQ1) Does the perceptible ethnic and gender diversity of DevOps contributors differ from ethnic and gender diversity of non-DevOps contributors?Approach.In this RQ, we quantitatively analyze the ethnic and gender diversity of DevOps and non-DevOps contributors in the studied projects.In particular, we perform two analyses.First, we compute the percentages of DevOps and non-DevOps contributors perceptible as different ethnicities and genders.Then, we statistically test the significance of the difference between the percentages of DevOps contributors and non-DevOps contributors.We use the Wilcoxon signed rank test with a 95% confidence level.We use this non-parametric test because we cannot assume our data to be normally distributed.(Simpson 1949), and Prevalence rankings and Diffusion score,39 which are used to measure the diversity in a community.These metrics have been used in prior studies for similar purposes (Buch et al. 2021;Parrotta et al. 2016;Vasilescu et al. 2015b;Zohar and Belmaker 2005).The first metric, richness (S), measures the number of groups in a community (DeJong 1975).We use the richness to measure the number of ethnicities of contributors to DevOps and non-DevOps files in a project.The higher the richness, the more diverse the community is.For example, if a project contains only contributors with perceptibly White names, the ethnic richness of that project is one.If a project contains contributors of two perceptible ethnicities (e.g., White and Asian), the richness value is two.Similarly, for gender diversity, if a project includes contributors who are perceptible as men only, the richness is one.If a project includes perceptible men as well as perceptible women, the richness is two.The second metric, evenness (E), measures the relative species abundances in a community.We use the Brillouin metric (Brillouin 1951) to measure the evenness.Equation (1) shows the formula for computing the evenness in a community, where n i is the number of individuals in a group i, and N is the total number of individuals in the community.
In the context of our study, suppose a project contains 10 contributors (N = 10) spanning four ethnicities (S = 4): three perceptibly Asian contributors (n 1 = 3), two perceptibly Black contributors (n 2 = 2), one perceptibly Hispanic contributors (n 3 = 1), and four perceptibly White contributors (n 3 = 4).The evenness is computed as follows: = 0.94.This value shows a high level of evenness, indicating a high level of ethnic diversity of contributors.Similarly, the closer this number is to zero, the lower the evenness, thus the lower the diversity.Following the same equation, we compute the evenness for gender diversity as well.
The next metric, the Blau index, a.k.a diversity index (D) (Simpson 1949), computes the probability that two individuals randomly selected from a community would belong to two different groups within the community.Equation (2) shows the formula to compute the Blau index of a community.Blau index ranges from zero to one.The closer the Blau index is to one, the more diverse the community is.
In the previous example with 10 contributors, the Blau index is computed as below: 1 − 3 10 2 + 2 10 2 + 1 10 2 + 4 10 2 = 0.7; accordingly, there is high probability of two contributors randomly selected from the project belong two different ethnicities, thus indicating a high level of ethnic diversity in the project.Similarly, we compute the Blau index for gender diversity as well.
The final diversity metric we compute in our analysis is the prevalence rankings and diffusion score. 39The prevalence ranking in a community is the ranking of groups in the descending order of the number of individuals belonging to each group.From the prevalence rankings of a community, first-, second-, and third-prevalent groups could be obtained.The percentage of the individuals that are not in those first-, second-, or third-prevalent groups combined is called the diffusion score.The first-, second-, and third-prevalent ranks in the previous example are owned by perceptibly White, Asian, and Black contributors, with proportions of 40%, 30%, and 20%, respectively.Thus, the diffusion score is 100% − (40% + 30% + 20%) = 10%, which is equal to the percentage of least-prevalent ethnicity (Hispanic) in this example since there are only four ethnicities.For the gender diversity in the context of our study, the diffusion score will be equal to the contributor percentage from the minority gender.This is because we consider a binary classification of gender, and accordingly, there are only first-and least-prevalent groups with respect to the perceptible gender.Thus, we consider the percentage of contributors from the least-prevalent gender in a particular project/setting as the diffusion score of that case.
To test the statistical difference between diversity metrics for DevOps and non-DevOps, we first compute the metrics for each project.Then, to compare the metrics of DevOps and non-DevOps contributors, we perform the Wilcoxon signed rank test with a 95% confidence level and apply the Bonferroni correction40 to account for multiple comparisons, and our adjusted significance level (α) is 0.0023.

(RQ1-a) Perceptible Ethnic Diversity
Observation 1) Contributors to DevOps artifacts tend to have less ethnic diversity than contributors to non-DevOps artifacts.Figure 7 shows the distribution of the percentage of contributors to DevOps and non-DevOps artifacts in the studied projects, per ethnicity.From the figure, we observe that perceptible Whites are the majority of DevOps contributors (median = 87.70%)and non-DevOps contributors (median = 85.50%).
The second dominating ethnicity in our dataset is Asian/Pacific Islander (API).On median, the percentages of DevOps and non-DevOps contributors with perceptibly API names are 9.10% and 10.53%, respectively.We find a statistically significant difference between the percentages of DevOps and non-DevOps contributors with perceptibly API names but with a negligible effect size (Wilcoxon, p << α = 0.0023, one-tailed, paired; Cliff's |δ| = 0.132).The third dominating ethnicity is Hispanic.On median, the percentages of DevOps Fig. 7 Bean plots showing the distribution of percentages of DevOps and non-DevOps contributors from four perceptible ethnicities (i.e., White, API, Hispanic, and Black).The solid lines represent the median percentages, and the dotted lines represent the first and third quantiles contributors with perceptibly Hispanic names are 0% and 3.25%, respectively.We find a statistically significant difference between DevOps and non-DevOps contributors with a medium effect size (Wilcoxon, p << α = 0.0023, one-tailed, paired; Cliff's |δ| = 0.415).The fourth dominating group is the contributors having perceptibly Black names.For those contributors, we find a statistically significant difference between the percentages of DevOps (median = 0%) and non-DevOps (median = 0.77%) contributors with a large effect size (Wilcoxon, p << α = 0.0023, one-tailed, paired; Cliff's δ = 0.783).Finally, we find that the contributors with perceptibly American Indian/Alaskan Native (AIAN) names are the least dominating ethnic group among DevOps contributors and non-DevOps because only two out of 450 projects we use for the analysis has contributors perceptible as AIAN.Moreover, none of the DevOps contributors in those two projects are perceptible as AIAN.
Furthermore, Table 3 presents the analysis results of ethnic diversity metrics (i.e., richness, evenness, etc.) for DevOps and non-DevOps contributors.From the table, we observe that the median values of ethnicity richness of DevOps and non-DevOps contributors are two and three, respectively.Also, we find statistically significant differences in terms of richness of ethnic diversity between DevOps and non-DevOps contributors, with a large effect size.
Similarly, the ethnic evenness of DevOps contributors is significantly less than that of non-DevOps contributors, with a medium effect size.For the Blau index, we find statistically significant differences between the ethnic diversity of DevOps and non-DevOps contributors, but the effect size is negligible.Also, note that for the diffusion score, we do not find statistical significance between DevOps and non-DevOps contributors.

(RQ1-b) Perceptible Gender Diversity
Observation 2) The perceptible gender diversity of DevOps contributors tends to be less than that of non-DevOps contributors.
Figure 8 shows the distribution of percentages of DevOps and non-DevOps contributors who are perceptible as men and women.From the figure, we observe that the DevOps contributors who are perceptible as men are the majority of DevOps contributors (median = 93.75%)as well as non-DevOps contributors (median = 92.82%).We find a statistically significant difference between the percentages of DevOps and non-DevOps contributors who are perceptible as men, with a small effect size (Wilcoxon, p << α = 0.0023, one-tailed, paired; Cliff's |δ| = 0.149).
Moreover, we find that the median percentage of DevOps contributors who are perceptible as women (median = 6.25%) is less than that of non-DevOps contributors (median = 7.18%).We further find a statistically significant difference between DevOps and non-DevOps contributors who are perceptible as women, with a small effect size.This is in line with the results of the Stack Overflow survey (2022) 41 as well.This survey shows that the percentage of DevOps developers who identify themselves as women is 2.10%, while the other developers, i.e., non-DevOps, who identify themselves as women, is 5.13%.
Furthermore, Table 4 shows the analysis results of gender diversity metrics for DevOps and non-DevOps contributors.From the table, we observe that the median value of richness in terms of gender is two for both DevOps and non-DevOps contributors.However, we find a statistically significant difference in gender richness between DevOps and non-DevOps contributors, with a small effect size.Similarly, the evenness, Blau index, and diffusion score of DevOps contributors in projects, in terms of gender diversity, are statistically less than those of non-DevOps contributors, with a small effect size.Furthermore, we find that 29.11% of the projects in our dataset did not contain any DevOps contributors perceptible as women.In contrast, only 2.00% of the projects in our dataset did not contain non-DevOps contributors perceptible as women.
Note that we exclude contributors whose perceptible gender is not determined by genderinferring tools (Section 2 DC5).However, studying the effects of perceiving all unknown genders as either men or women can provide valuable insights into gender representation and potential biases in various contexts.Thus, we follow the approach of Vasilescu et al. (2014), who faced a similar issue.In particular, we investigate the impact of assuming all unknown genders as women.Since women contributors are typically underrepresented in GitHub (Vasilescu et al. 2015b), assuming all unknown genders as women would allow us to evaluate whether biases persist even when we artificially increase the representation of women contributors.From this analysis, we observe that our findings still hold: perceptible women 41 https://insights.stackoverflow.com/surveyAs for perceptible ethnic diversity, we observe that contributors perceptible as non-White (API, Hispanic, Black, and AIAN) are more underrepresented among DevOps contributors compared to non-DevOps contributors.With respect to perceptible gender diversity, we find contributors perceptible as women are more underrepresented among DevOps contributors compared to non-DevOps contributors.Overall, there is a statistically significant difference in the diversity metrics between DevOps and non-DevOps contributors.

(RQ2) How does the Distribution of Perceptible Ethnic and Gender Diversity Change as Projects Age?
Approach.In this RQ, we aim to examine the evolution of gender and ethnic diversity of DevOps and non-DevOps contributors in the studied projects.To do so, we partition the commits in our dataset into segments, where each segment represents a set of commits that are made during a year.Then, for each segment (year), we identify DevOps and non-DevOps contributors who contributed to the commits made during that year and analyze perceptible ethnic and gender diversity.In particular, we analyze the percentages of and compute the diversity metrics for DevOps and non-DevOps contributors in terms of ethnic (RQ2-a) and gender diversity (RQ2-b).Lastly, we compute the average growth in diversity metrics over the last ten years.Equation ( 3) shows how we compute the average growth, for example, in evenness, for the last ten years where n = 10.

(RQ2-a) Change in Perceptible Ethnic Diversity Over Time
Observation 3) Over time, the overall perceptible ethnic diversity of DevOps and non-DevOps contributors increases.Still, the contributors perceptible as non-White (API, Hispanic, and Black) are the minority.
Figure 9 shows (a) the percentage of DevOps contributors from each perceptible ethnicity per year and (b) the values of the diversity metrics per year.From Fig. 9a, we observe that the percentage of DevOps contributors who are perceptible as White generally varies between 83.98%-100%.
Also, we observe that all DevOps contributors who made commits in the studied projects before 2010 are perceptible as White (except for year 2006).Hence, the evenness metric of these years is zero (Fig. 9b).From 2010 onward, DevOps contributors with perceptibly non-White names have started to participate in the studied projects; however, the percentage remains low.For example, the percentage of DevOps contributors with API names ranges from 0%-12.63%, and the percentage of contributors with Hispanic names ranges from 0%-3.30%.This is further evident in Fig 9b .The figure shows that the diversity metrics for the ethnicity of DevOps contributors are increasing over time, yet the evenness, Blau index, and diffusion score mostly remain constant over the last ten years.In particular, the evenness ranges from 0.068-0.138,and its average growth in evenness is 0.008 over the last ten years; the Blau index ranges from 0.136-0.278with an average growth of 0.016; similarly, the normalized diffusion score42 ranges from 0.045-0.126,and its average growth in evenness is 0.009 over the last ten years.
The diversity of non-DevOps contributors belonging to different ethnicities follows a similar trend to that of DevOps contributors.Figure 10 shows that diversity metrics for ethnicity of non-DevOps contributors are gradually increasing over time.We observe that the ethnic evenness, Blau index, and diffusion score for non-DevOps contributors range between 0.079-0.157,0.158-0.315,and 0.051-0.149,respectively, over the last ten years; their average growth over time are 0.009, 0.017, and 0.010, respectively.
Empirical Software Engineering (2023) 28:150 Fig. 9 Change in the perceptible ethnic diversity of DevOps contributors over time Note that the above observation discussing the increase of ethnic diversity over time is in line with the diversity analysis of US-based DevOps job-seekers by Zippia43 which is a web application that is used to search US-based jobs.Zippia's analysis revealed that the percentage of non-White developers increased from 37.16% in 2010 to 42.77% in 2021.

(RQ2-b) Change in Perceptible Gender Diversity Over Time
Observation 4) Over time, the perceptible gender diversity of DevOps and non-DevOps contributors is increasing.Still, the contributors who are perceptible as women remain underrepresented.
Figures 11 and 12 show (a) the percentage of perceived genders of DevOps and non-DevOps contributors and (b) the change in diversity metrics over time.We observe that the percentage of DevOps and non-DevOps contributors who are perceptible as men varies between 87.50%-100% and between 90.52%-100%, respectively.Moreover, we observe that the overall richness, evenness, Blau index, and diffusion score for DevOps and non-DevOps contributors have been increasing over time, yet the improvement is gradual.For example, the evenness for DevOps contributors over the past ten years ranges from 0.040-0.086,while it is 0.047-0.085for non-DevOps contributors.In addition, we find that the average growth in evenness is 0.005 for DevOps contributors and 0.004 for non-DevOps contributors.We observe similar trends for the Blau index and diffusion score as well.This observation complements (Prana et al. 2021) study, which found that gender diversity in open-source projects (GHTorrent44 dataset) has increased over time (2014)(2015)(2016)(2017)(2018); however, there is still much room for improvement.Finally, similar to the RQ1-b sensitivity analysis, for RQ2-b, we investigate the impact of assuming all contributors with unknown genders as women (Vasilescu et al. 2014).We find that the sensitivity analysis results do not change the observation above.A full preview of our results assuming all unknown genders are women is available in our Appendix B. Although the perceptible ethnic diversity of contributors is increasing over time, DevOps and non-DevOps contributors with perceptibly non-White names are greatly underrepresented.Similarly, the perceptible gender diversity of contributors is gradually increasing over time.Still, perceptible women remain substantially underrepresented.

Intersectional Analysis of Perceptible Ethnic and Gender Diversity
Our study thus far has been performed to independently analyze the perceptible gender and ethnic diversity of DevOps and non-DevOps contributors.However, it is unclear how diverse DevOps and non-DevOps contributors are when considering the intersection of minority groups of perceptible ethnicity and gender (e.g., perceptibly women contributors with perceptibly non-White names).In fact, prior studies (Albusays et al. 2021;Green 2017;Modi et al. 2012;Prana et al. 2021;Ross et al. 2020;Varma et al. 2023) have shown that individuals who belong to the intersection of two minority groups face specific challenges in the fields of Science, Technology, Engineering, and Mathematics (STEM).Indeed, contributors who belong to two or more underrepresented categories tend to receive fewer opportunities (Green 2017).
Hence, to shed light on the intersectionality of gender and ethnic diversity of DevOps and non-DevOps contributors, we examine the participation of contributors with perceptibly non-White and women's names.We first compute the percentage of DevOps and non-DevOps contributors perceptible as women of different perceived ethnicities.Then, we test the significance of the difference between such DevOps and non-DevOps contributors in the studied projects using the Wilcoxon signed rank test with a 95% confidence level (with Bonferroni correction; α = 0.0023).Lastly, to examine the trend of intersectionality over time, we  compute the diversity metrics (e.g., richness and evenness) for DevOps and non-DevOps contributors per year.
Observation 5) Among perceptibly women DevOps and non-DevOps contributors, perceptibly White contributors are the majority.
Table 5 shows the percentage of perceptible ethnicities of perceptibly women DevOps and non-DevOps contributors (since none of the DevOps contributors in our dataset are perceptible as AIAN, we do not consider the AIAN in this intersectional analysis).From the table, we observe that contributors with perceptibly White names are the majority among perceptibly women DevOps contributors (median = 100%) and non-DevOps contributors (median = 70%) in a project.The second ethnic majority among perceptibly women contributors is API; the median percentages of DevOps and non-DevOps contributors are 0% and 25%, respectively.Perceptibly women contributors with perceptibly Hispanic and Black names are the least included among DevOps and non-DevOps contributors (median = 0%).
From the table, we also observe statistically significant differences in the perceived ethnic diversity between DevOps and non-DevOps contributors with perceptibly women names.In particular, the contributors who are perceptible as White women are more represented in DevOps compared to non-DevOps, with a small effect size (Wilcoxon, p << α = 0.0023, one-tailed, paired; Cliff's |δ| = 0.221).With respect to contributors who belong to the intersection of two minorities, perceptibly API women and Hispanic women contributors are more underrepresented among DevOps contributors compared to non-DevOps contributors, with a small effect size.Lastly, we do not observe a significant difference between the DevOps and non-DevOps contributors who are perceptible as black women.However, we observe that a substantial proportion of projects do not contain such contributors.In particular, 34% of the projects do not have perceptibly black women among DevOps contributors, while 4% of the projects do not have any perceptibly black women among non-DevOps contributors.
Observation 6) Over time, the perceptible ethnic diversity of perceptibly women DevOps and non-DevOps contributors is increasing.Perceptibly women contributors with perceptibly White names remain the majority for both DevOps and non-DevOps contributors.
Figure 13 presents (a) the percentages of different perceptible ethnicities of perceptibly women DevOps contributors per year, and (b) the values of the corresponding diversity metrics for perceptibly women DevOps contributors per year.From Fig. 13a, we observe that the number of ethnicity groups of DevOps contributors with perceptibly women names is increasing.For example, the percentage of perceptibly API women contributors has increased from 0% in 2010 to 22.9% in 2021, excluding the exception of 2006).This is further evident in Fig. 13b, as it shows an overall increase in diversity metrics with respect to the perceived Fig. 13 Change in the perceptible ethnic diversity of the perceptibly women DevOps contributors ethnic diversity of perceptibly women DevOps contributors.For example, the ethnic evenness of perceptibly women DevOps contributors increased from zero in 2010 to 0.218 in 2021.Still, we observe that the improvement in evenness over the last few years is not substantial; the average growth in evenness over the last ten years is 0.007.We observe similar trends for the Blau index and diffusion score as well.
For a subset of authors (7.4%), ethnic and gender-inferring tools could not identify both the perceptible ethnicity and the gender with the expected confidence level.Thus, for years 2005, 2007, 2008, and 2009 in Fig. 13a, we do not observe perceptibly women DevOps contributors with a perceptible ethnicity confidence level greater than 0.8.For 2006, we find an exception where only one perceptibly women DevOps contributor is available with a perceptible ethnicity confidence level greater than 0.8 in our dataset.
With respect to non-DevOps contributors, Fig. 14 shows that perceptibly women contributors of different ethnicities follow a similar trend to DevOps contributors.In particular, the richness is increasing, yet the other metrics have remained almost constant in the last ten years; for example, the evenness ranges between 0.165-0.229,and the average growth is only 0.005.Contributors who are perceptible as White women are the majority of perceptible women contributors, followed by contributors with perceptibly API, Hispanic, and Black names.We find statistical differences between the participation of perceptible women in DevOps and non-DevOps (e.g., contributors who are perceptibly API and Hispanic women are more underrepresented in DevOps compared to non-DevOps).Additionally, it is worth noting that perceptibly black women are absent among DevOps contributors in 34% of the projects, while among non-DevOps contributors, this is only 4%.Over time, the richness of perceptible ethnic diversity of perceptibly women contributors tends to increase, yet the diversity metrics remain low.

Threats to Validity
In this section, we describe the threats to the validity of our study.

Construct Validity
We use the Name-Prism (Ye et al. 2017) tool to infer the perceptible ethnicity of contributors.
Using Name-Prism to infer ethnicity has three main limitations.First, the ethnicity of an individual is a complex and multifaceted social construct that is not always easily identifiable (Fearon and Laitin 2000), and the tool may misassign ethnicities to contributors.For example, contributors with Brazilian names but are not Hispanic (because their ancestors do not speak Spanish) may be perceived as Hispanic by the Name-Prism.Mixed-race contributors also may be perceived as belonging to a particular ethnicity, which may not necessarily align with the ethnicity that they identify with personally.Also, some Black individuals may have names perceived as White (Fryer and Levitt 2004), and using the perceived ethnicity may exclude individuals who identify as Black.However, note that in open-source communities, such as GitHub, one would only perceive the ethnicity of the other unless and otherwise it is revealed.Thus, our goal is to study the perceived diversity instead of the actual diversity.Moreover, as suggested by prior work (Nadri et al. 2021b), we only use contributors' names in which the perceptible ethnicity is inferred by the tool with a confidence level greater than 0.8 for better validity of our results.
The second limitation of using Name-Prism is that the US government officially categorizes people with origins in Lebanon, Iran, Egypt, and other countries in the Middle East and North Africa (MENA) region as White. 45According to Maghbouleh et al. (2022), both non-MENA Whites and MENA individuals consider MENA-related traits as MENA rather than White.In addition, treating individuals from the MENA region as Whites does not necessarily correspond to the discrimination-related experiences of many of these people. 46 Because of this, considering individuals from the MENA region as White as per US Census Bureau may not accurately reflect the perceptible diversity of the DevOps and non-DevOps contributors we study.Therefore, to check the validity of our conclusions, we recompute our results by considering the perceptibly White contributors with perceptible MENA-related traits as a separate ethnic group.However, the contributor names available in our current dataset do not include the necessary information (e.g., nationality) to check if a contributor is associated with a MENA-related trait or not.Thus, we use Name-Prism's perceptible nationality-inferring API (which has an F-score of 0.795).We find our conclusions remain unchanged.A full preview of our results corresponding to this new distinction of contributors perceptible as from MENA countries is available in our Appendix E.
The third limitation of using the Name-Prism is that the results obtained with respect to US Census Bureau's classification may not be completely generalizable for contributors who are based outside the US (Albusays et al. 2021;Sansone 2003).To reflect on the validity of our study, we perform a new analysis to understand the impact of using Name-Prism for both US-based and non-US-based contributors.To categorize contributors into "US-based" and "Outside US," we attempt to use the GitHub API to obtain the geographical locations, specifically the countries, of the contributors within our dataset.Because our current dataset does not include GitHub usernames, which are exclusively needed to retrieve locations via GitHub API, 47 we manually check for GitHub profiles of those contributors for locations.Note that Name-Prism's perceptible nationality-inferring API that we used in our previous analysis on contributors perceptibly associated with MENA-related traits does not infer locations but rather nationalities; for example, it does not make the distinction between perceptible Whites from the US vs. the perceptible Whites from other countries.For our manual check, we obtain a significant sample of 500 48 contributors by searching on GitHub, and we check if there is a substantial difference in the observations we made with respect to the two distinct groups based on their locations: "US-based" and "Outside US."We do not observe substantial changes in the observations concerning the two groups.For example, we find that, regardless of the location, perceptibly White contributors are the majority; contributors from other perceptible ethnicities are generally underrepresented.This is consistent with the observations we made in Section 3. A full preview of our results corresponding to this new analysis is available in our Appendix J. Furthermore, to facilitate future studies on diversity, we encourage researchers to train more sophisticated ethnicity-inferring tools with datasets spanning many countries and diverse populations.
We use gender-inferring tools (e.g.Gender-guesser) to infer the perceptible gender of contributors.Using such tools to infer gender has two main limitations.First, gender-inferring tools may not recognize the perceptible gender correctly as discussed in Section 2 DC4 (e.g., rate of unrecognized names in Gender-guesser was 20.12% (Santamaría and Mihaljević 2018).We tackle this problem by combining the outcomes from three gender-inferring tools to take advantage of the strengths of all the tools as discussed in Section 2. We use the most effective combination, which we obtain by testing different combinations of the tools against the labelled dataset provided by Santamaría and Mihaljević (2018).The selected combination approach reduces the rate of unrecognized names to 5.15%.Then, we removed 46 https://www.npr.org/2022/02/17/1079181478/us-census-middle-eastern-white-north-african-mena 47https://docs.github.com/en/free-pro-team@latest/rest/users/users?apiVersion=2022-11-28#get-a-user 48 The sample size of 500 individuals exceeds the minimum recommended threshold (383) for achieving a 95% confidence level with a 5% margin of error when making inferences about the entire contributor population within our dataset.
the names with unrecognized genders from the analyses on perceived gender diversity for better accuracy.
The second limitation of using gender-inferring tools is that gender-inferring tools have different classifications of gender given a name.To ensure consistency across all the tools used in this study, we follow the definition of gender as a binary classification (either men or women) as used in prior studies by Vasilescu et al. (2012Vasilescu et al. ( , 2014Vasilescu et al. ( , 2015b)).As mentioned in Section 2 (DC5), we use three gender inferring tools: Wiki-Gendersort, Gender-guesser, and GenderComputer.The first tool, the Wiki-Gendersort, classifies a name into one of the three gender groups: male, female, and unisex.We consider female category as women and male category as men; we consider the remaining category, unisex, to be unknown because we limit our discussion to a binary classification of gender (Vasilescu et al. 2012(Vasilescu et al. , 2014(Vasilescu et al. , 2015b)).The next tool.the Gender-guesser, classifies a name into one of the five categories: female, mostly female, androgynous, mostly male, and male.We consider female and mostly female categories as women; similarly, we consider male and mostly male categories as men.We consider the Gender-guesser's remaining category (androgynous) as unknown, as similar to Wiki-Gendersort's unisex category.
The third tool, the GenderComputer, classifies gender into two main categories: male and female. 49Although we are aware that gender is a more complex issue than a binary variable We consider the female category inferred from GenderComputer to be women, and the male category to be men.Although we are aware that gender is a more complex issue than a binary variable (Santamaría and Mihaljević 2018), we believe that following the terminologies used in prior work mitigates the risk of misrepresenting genders.Future researchers may consider training more sophisticated gender-inferring tools with much larger datasets taking the diverse representations of gender into consideration.
We use Vasilescu et al.'s GitHub-alias-merging approach (Vasilescu et al. 2015c) to track name changes of users, as discussed in Section 2 (DC4).If a user changes the name to a completely different one but keeps using the same email address, we are able to unify the identities of such cases.However, if the user makes substantial changes to both their name and email address, we may miss such cases.In fact, it is infeasible to recognize such cases by human contributors as well since no ground truth data is available for users.Nonetheless, we only discuss the perceived diversity in this paper, and missing such extreme cases would not substantially impact our study's conclusions.
Lastly, in our study, we use four measures of diversity (richness (DeJong 1975), evenness (Brillouin 1951), Blau index (Simpson 1949), and prevalence rankings, and diffusion score 40 ).Other indices, such as Gini index (Gini 1936) andTheil index (Thei 1961), are not applicable to our study because they are not comparable across different teams when teams have different numbers of ethnicities or genders, as in our study.

Internal Validity
To identify DevOps files, we rely on filename and keyword conventions.Despite our best effort, our automated classification script may still misclassify files.On the other hand, our initial results from a validation set (of 400 files) show that the agreement between a human labeller and the automated classification script is near-perfect (Cohen's Kappa = 0.82).
Our preliminary analysis of the dataset shows that some DevOps files are automatically generated and hence, are not valid targets for our analyses.We filter out such files using keyword matching.Initial results from a validation set of 400 randomly selected files that are classified as DevOps show that the agreement between a human labeller and the automated classification script is near-perfect (i.e., Cohen's Kappa of 0.81).
We considered DevOps contributors as the ones who made at least one change to DevOps files.A threshold of one DevOps file change may introduce selection bias to our definition of DevOps contributors.As discussed in Section 2, we recompute our results with two more thresholds: two file changes and ten file changes.We find our results still hold, indicating that our choice of the threshold does not include a substantial selection bias.

External Validity
We with a dataset of projects using CircleCI, and carefully select a meaningful sample of projects that are worthy of study.To do so, we filter out the projects that are immature and have little development history by applying filtration criteria used by prior studies (Kalliamvakou et al. 2016).This selection results in a dataset of 450 projects that adopt DevOps tools and technologies.Our dataset might be considered small when it is compared to the whole population of GitHub projects.To check the validity of our analyses beyond the projects that adopt CircleCI, we recompute the results of our research questions using another dataset that contains commits of projects that adopt GitHub Actions. 50We find that the overall conclusions of the study remain unchanged for this new dataset of projects that adopt GitHub Actions.To access a complete preview of this analysis and its corresponding figures, we direct readers to our Appendix F.

Related Work
In this section, we discuss the most work related to ours.In particular, we describe the prior work that discusses diversity in the context of software engineering.
A plethora of prior work focused on analyzing the diversity of project teams (Canedo et al. 2020;Izquierdo et al. 2018;Vasilescu et al. 2015a, d).For example, Vasilescu et al. (2015d) analyzed diversity data for 23,493 GitHub projects.They found that 75.3% of the studied projects have no gender diversity at all.On the other hand, Canedo et al. (2020) surveyed 34 manually gender-confirmed women core developers and found that most of them (65.7%) have never experienced gender discrimination.Canedo et al. (2020) performed another analysis of commits, and they found that the total number of commits and the frequency of commits made by a woman developer are not significantly different from those made by a man developer.51However, Canedo et al. analyzed commits and commit messages of contributors in 711 GitHub projects, and it revealed that women developers tend to contribute more to re-engineering tasks than men developers.Vasilescu et al. (2015a) surveyed GitHub developers to identify their perceptions of diversity on GitHub.They found that 48.24% of the developers are aware of the gender of their fellow developers, while it is 29.94% for ethnicity.Izquierdo et al. (2018) analyzed the diversity of OpenStack, a well-known opensource project involving thousands of contributors.They found that women make up 10% of all OpenStack contributors.
Other work investigated how the demographic characteristics of contributors can influence the acceptance of Pull Requests (PRs) (Furtado et al. 2020;Iyer et al. 2019;Nadri et al. 2021a, b;Rastogi 2016;Rastogi et al. 2018;Terrell et al. 2017). Terrell et al. (2017) found that, among external contributors, women whose gender is identifiable have a lower PR acceptance rate.Rastogi (2016); Rastogi et al. (2018) also found that PRs are more likely to get accepted when both the submitter and the integrator are from the same geographical location.Moreover, Nadri et al. (2021a) found contributors with non-White races have a tendency to receive rejections without rationale.Later, Nadri et al. (2021b) found that PRs from perceptibly White developers have a higher acceptance chance, and perceptibly non-White contributors are more likely to get their PRs accepted if the integrator is also from the same race.Furtado et al. (2020) found that contributors from countries with low development in health, education, and income submit fewer PRs but receive the most rejections.Iyer et al. (2019) found that personality traits and social factors are significant and comparable to technical factors that influence the likelihood of PR acceptance.
Other research work focused on the relationship between diversity and team productivity (Bosu and Sultana 2019;Catolino et al. 2019;Horwitz and Horwitz 2007;Terrell et al. 2017;Vasilescu et al. 2015b).Bosu and Sultana (2019) studied ten popular open-source projects hosted on GitHub.They found that despite no project indicating any significant productivity differences between men and women developers, women developers tend to receive slower feedback during code reviews, and are underrepresented in leadership roles.Catolino et al. (2019) studied the gender imbalance of developers in projects and its effect on community smells (e.g., deliberate cessation of communications and defiant community members).They found that women were more restrained in the discussion forums, regardless of how senior or productive they were.Terrell et al. ( 2017) studied 4,500 GitHub contributors, and found that gender diversity is a significant and positive predictor of productivity.Vasilescu et al. (2015b) and Ortu et al. (2017) have also studied gender diversity in GitHub projects, and found that gender diversity has a positive effect on team productivity.However, a few studies found contradictory results as well Horwitz and Horwitz (2007); Gila et al. (2014).For example, Horwitz and Horwitz (2007) found that bio-demographic (e.g., gender and ethnicity) diversity has no relationship with team performance.
Prior studies have also investigated the diversity of developers participating in online socio-technical platforms (Oliveira et al. 2018;Vasilescu et al. 2012).Vasilescu et al. (2012) studied the gender representation in Stack Overflow (a Q&A platform for developers).They found that men represent the vast majority of contributors to Stack Overflow.Also, they found that men earn more reputation and are more engaged than women.Oliveira et al. (2018) studied the design of Stack Overflow and contrasted their findings with the views of participants of various cultural backgrounds.They found that the design of Stack Overflow follows individualist values, such as productivity and reputation scores, which can discourage the engagement of those with collectivist values.
Recent surveys set out to understand the trend of the research area concerned with perceived diversity in Software Engineering (Rodríguez-Pérez et al. 2021;Trinkenreich et al. 2022).Rodríguez-Pérez et al. (2021) highlighted that researchers have been exploring the gender bias problems in software engineering more than presenting the solutions to mitigate these problems.Trinkenreich et al. (2022) analyzed the literature to identify strategies proposed to mitigate the challenges faced by women in open-source projects, such as de-stereotyping the open-source project contributors, training mentors to assist women, creating and enforcing a code of conduct, and placing women in leadership positions.
Several studies (Albusays et al. 2021;Green 2017;Modi et al. 2012;Prana et al. 2021;Ross et al. 2020;Varma et al. 2023) indicated that individuals who identify themselves as in the intersection of two or more minority groups encounter specific obstacles in technical fields.For example, Black women tend to have less exposure to Software Engineering (Ross et al. 2020) and face systematic impediments to upward career mobility (Green 2017).Prana et al. (2021) surveyed 122 developers worldwide and measured gender diversity with respect to different regions using the Blau index (Simpson 1949).Prana et al. found that gender diversity is low worldwide, but there is a perceived distinction in diversity across regions that is not statistically substantial; for example, the gender diversity of respondents from Africa is lower than that among respondents from America.Our intersectional analysis (of ethnicity and gender) adds a complementary perspective to these existing studies, since we analyze the representation of developers with respect to their contributions: i.e., contributors to DevOps artifacts and non-DevOps artifacts.Our findings show that the lack of perceived diversity among DevOps contributors is amplified when considering the intersection of ethnicity and gender" Several surveys and/or analyses by online communities have also raised concerns about the diversity of DevOps developers: yearly surveys of Stack Overflow and the US-based analysis from "Zippia."52Both of the above show that men are the majority among DevOps developers.For example, the Stack Overflow survey (2022)53 reported that the percentage of developers who identified themselves as DevOps developers is 10.06%, and of them, the vast majority are men (94.37%).A study by Wurzelová et al. (2019) also analyzed the dataset of the Stack Overflow survey (2018). 54They found that 27.41% of the women who report to contribute to open source are DevOps specialists.In the context of our study, we find that 19.89% of the perceptibly women contributors in open source are also contributing to DevOps artifacts.While the Stack Overflow-related analyses and Zippia's analysis provide insights into the overall diversity of the DevOps community, our study complements them with a contribution-based analysis of perceptible ethnic and gender diversity at the project level and over time.Moreover, we ground our analysis in comparison with non-DevOps contributors in the same set of studied projects.

Conclusion
In this paper, we investigate the perceived gender and ethnic diversity of DevOps and non-DevOps contributors to open-source projects by quantitatively analyzing 4,207,735 commits made by 110,336 contributors to 450 open-source GitHub projects.Below, we discuss the implications of our findings.
The lack of perceived diversity in the project subteams (i.e., DevOps) is even more prominent than for teams overall.Our findings indicate disparities in the perceived eth- nic and gender representation among DevOps and non-DevOps contributors in the GitHub community.In particular, our observations (1) and (2) show that the group of contributors who contribute to DevOps tend to have less perceptible diversity than contributors to non-DevOps artifacts.Prior studies (Khan et al. 2022;Leite et al. 2019) have discussed several challenges of adopting DevOps, such as difficulties in choosing appropriate tools from a diverse set of tools, which may be contributing to the lack of perceptible diversity.Potential reasons for the lack of diversity in DevOps have been discussed in various online reports as well.For instance, the report "Inspiring Women to Join the DevOps Movement" 55 discussed a lack of awareness among women developers about the availability of career paths such as DevOps.Several reports 56,57 also discussed the stereotypical perception that DevOps is man-dominated, which may discourage women from pursuing careers in this area.Concerns were raised about the gender pay gap 58, 59 in DevOps, which may also contribute to the lack of diversity.We encourage future studies to direct efforts towards further understanding why DevOps contributions are less attractive to perceptibly non-Whites and women developers.
The perceived diversity of DevOps and non-DevOps contributors is slowly increasing over time, but there is still room for improvement.Observations (3) and (4) show that perceptible diversity is increasing over time, but the improvement is not substantial.To bridge this gap, one particular effort is the "Women in DevOps" platform, 60 which was established in 2017 specifically to address the issue of gender imbalance in the DevOps industry.Furthermore, to improve diversity in Software teams in general, several studies (Nadri et al. 2021b;Prana et al. 2021;Wang and Redmiles 2019) have suggested strategies.For example, Wang and Redmiles (2019) have proposed to design a series of carefully crafted and empirically tested training courses that aim to reduce gender bias in both educational institutions and software development organizations.Prana et al. (2021) emphasized the fact that current automatic tools, such as bug assignment tools (Jonsson et al. 2016;Lee et al. 2019), make recommendations based on the similarity between developers (homophily), restricting the promotion of diversity.That said, to the best of our knowledge, it is yet to investigate whether projects get the benefit of implementing such strategies.Future work could examine the impact of various strategies, such as mentorship programs (Prana et al. 2021;Wang and Redmiles 2019) and code of conduct amendments (Ford et al. 2019;Prana et al. 2021), shedding light on specific interventions required to promote diversity and inclusion for developers contributing to different project activities (e.g., DevOps and non-DevOps).
The lack of perceived diversity is amplified when considering the intersection of ethnicity and gender of DevOps and non-DevOps contributors.Much prior work (e.g., Bosu and Sultana (2019); Nadri et al. (2021a, b); Vasilescu et al. (2015b)) that studied diversity has mainly considered independent analyses of ethnic and gender diversity.Considering the intersection of perceived ethnic and gender diversity is important to provide a richer understanding of the problems that individuals in this intersection encounter.For example, our observations ( 5) and ( 6) show that DevOps and non-DevOps contributors perceptible as Hispanic women and Black women are greatly underrepresented (median = 0%), while DevOps contributors perceptible as API women and Hispanic women tend to be more underrepre-sented than non-DevOps contributors.We believe that the challenges faced by those are perceptible at the intersection of a minority ethnicity and minority gender may not be treated with a single overarching solution targeted towards one identity factor.Thus, we encourage future studies to explore those challenges and amplify the voices of developers situated at the intersection.

Fig. 1
Fig. 1 An overview of our study design showing Project Selection (PS), Data Curation (DC), and Preliminary Analysis (PA) steps

Fig. 2 Fig. 3
Fig. 2 Threshold plot for the number of builds

Fig. 4
Fig. 4 Threshold plot for the number of contributors

Fig. 6
Fig. 6 Threshold plot for the number of DevOps file changes (t) made by a contributor vs. the number of contributors

Fig. 8
Fig. 8 Bean plots showing the distribution of percentages of contributors perceptible as men and women among DevOps contributors as well as among non-DevOps contributors.The solid lines present the median, and the dotted lines present the first and third quantiles

Fig.
Fig. Change in the perceptible ethnic diversity of non-DevOps contributors over time

11
Fig. 12 Change in the perceptible gender diversity of non-DevOps contributors over time

Fig. 14
Fig. 14 Change the perceptible diversity of the perceptibly women non-DevOps contributors

Table 1
Descriptive statistics of our curated dataset of 450 projects

Table 2
A sample of our curated dataset Box-plots showing the percentages of contributors to non-DevOps files only and the percentages of contributors to DevOps files in the studied projects (Satopaa et al. 2011the number of DevOps contributors included in our original definition from 22,580 (t = 1) to 6,616 (t = 10) which is a substantial reduction of the number of DevOps contributors included in our definition; lastly, using a threshold of 286 file changes (t = 286 which is the knee detected by the Kneedle algorithm(Satopaa et al. 2011)) is extra conservative and reduces the number of DevOps contributors included in our original definition from 22,580 (t = 1) to 625 (t = 286) which is an extreme reduction of the number of DevOps contributors included in our definition.

Table 3
Results of statistical analysis of ethnic diversity metrics for DevOps and Non-DevOps contributors

Table 4
Results of statistical analysis of gender diversity metrics for DevOps and Non-DevOps contributors

Table 5
Results of the statistical analysis of the percentages of the perceptible ethnicities of perceptibly women DevOps and non-DevOps contributors