Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab

An effective and efficient application of Continuous Integration (CI) and Delivery (CD) requires software projects to follow certain principles and good practices. Configuring such a CI/CD pipeline is challenging and error-prone. Therefore, automated linters have been proposed to detect errors in the pipeline. While existing linters identify syntactic errors, detect security vulnerabilities or misuse of the features provided by build servers, they do not support developers that want to prevent common misconfigurations of a CD pipeline that potentially violate CD principles (“CD smells”). To this end, we propose CD-Linter, a semantic linter that can automatically identify four different smells in pipeline configuration files. We have evaluated our approach through a large-scale and long-term study that consists of (i) monitoring 145 issues (opened in as many open-source projects) over a period of 6 months, (ii) manually validating the detection precision and recall on a representative sample of issues, and (iii) assessing the magnitude of the observed smells on 5,312 open-source projects on GitLab. Our results show that CD smells are accepted and fixed by most of the developers and our linter achieves a precision of 87% and a recall of 94%. Those smells can be frequently observed in the wild, as 31% of projects with long configurations are affected by at least one smell.


INTRODUCTION
Continuous Integration (CI) and Delivery (CD) are widely adopted practices in software development. A CI/CD pipeline automates the process of building, testing, and deploying new software versions. There is plenty of empirical evidence for the positive effects of using CI/CD, including early defect discovery [29], increased developer productivity [42], and fast release cycles [22]. To achieve those benefits, it is recommended to follow various principles and best practices. For example, developers should build and test the software on every change that is committed to a project's version control system [24]. Several catalogs of CI/CD best practices exist [22,23,30,38], but while their adoption has been advocated in research papers, white papers, and books, developers have difficulties to apply them in practice [28], deviating from principles and generating anti-patterns [43].
Some of these anti-patterns are related to the way developers use a CI/CD pipeline. For example, developers do not integrate their changes frequently or they remove failed tests to repair a build failure, and previous researchers [43] implemented tools that help developers avoid those bad practices by analyzing logs and past changes. Other anti-patterns emerge when the CI/CD pipeline is configured. To support developers when configuring CI/CD pipelines, DevOps build servers such as GitLab [3] can validate their configuration files using online linters [4]. However, those tools only spot basic syntactic errors such as the use of reserved keywords when naming build steps.
Previous work has proposed approaches for detecting misuses of specific configuration options. In particular, Gallaba et al., [26] achieved a high user acceptance when pointing out the misuse of four different configuration options, like executing commands in the wrong build step. Rahman et al. [37] focused on security-related issues and Sharma et al. [39] on Infrastructure-as-Code (IaC) smells. While these works show that semantic linting of CD pipelines is useful, it does not solve the problem of avoiding CD anti-patterns in configurations files. For example, a systematic manual job execution is not a misuse, but it violates a CD principle.
In this work, we want to help developers with configuring their CD pipelines by helping them to avoid violations of accepted CD principles in their configuration files. We propose a novel semantic linter named CD-Linter, to detect process-related violations of CD principles, in the following referred to as "CD smells". CD-Linter is currently capable of detecting four types of CD smells that are related to violations of principles and best practices described in the literature [23,30]. We evaluated CD-Linter through a large-scale and long-term study consisting of 145 issues opened in as many projects. We monitored the reactions to those issues over a period of 6 months and found that 53% of the project maintainers agreed with the reported CD smells either accepting the issues (9%) or directly fixing them (44%). We also analyzed the reasons for rejected issues and use them to further improve CD-Linter. Finally, we measured the accuracy of the latest version of CD-Linter and investigated the occurrence of the four CD smells in the wild.
The contributions of this paper can be summarized as follows: (1) The operationalization of four violations of CD principles in pipeline configurations (CD smells), and the empirical validation of their relevance. (2) CD-Linter, an open-source semantic linter that can detect these CD smells in configuration files of GitLab pipelines. We show that, overall, CD-Linter has a precision of 87% and a recall of 94%.
(3) A large-scale empirical investigation of the extent to which the considered CD smells occur in a large set of 5,312 opensource projects. All datasets and scripts used in our studies (together with CD-Linter implementation) are available in our online appendix [44].

METHODOLOGY OVERVIEW
This paper investigates the problem of violating CD principles (i.e., CD smells) in configuration files of CD pipelines. We propose CD-Linter, a semantic linter that detects CD smells and evaluate its usefulness by answering the following research questions: RQ 1 Are the four CD-Linter CD smells relevant to developers? RQ 2 How accurate is CD-Linter? RQ 3 How frequent are the investigated CD smells in practice? Figure 1 provides a high-level overview of the different parts of this paper, the details of the empirical study design will be covered in Section 4. Inspired by existing literature in this area, we started by selecting four CD smells that affect the definition of CD pipelines (1) (Section 3.2 provides more details about the selection). To study these CD smells, we selected a dataset of 5,312 open-source projects that are publicly available on the GitLab platform (2). We built detectors, ran them against the dataset, and incrementally improved the corresponding detection strategies (3). The four CD smells types and their detection strategies will be introduced in Section 3.
Initially, we detected 5,237 smells in our dataset (4). To validate the relevance of the selected CD smells and the correctness of our detectors, we started to open issues in the issue trackers of the affected open-source projects. We used feedback from the early iterations to further improve the detection strategies, once we were confident that the detectors work properly, we created a balanced sample of 168 issues (5). After validating the reports manually, we rejected 23 issues and posted the approved 145 issues to the issue trackers of the corresponding open-source projects. We then monitored how professional developers reacted to the opened issues for 6 months (6). In Section 5.1, we will answer RQ 1 by analyzing the internal rating of the authors and the reactions of the original developers to our reports (7).
The feedback that we received through rejected issues also enabled us to further improve our detection strategies, which reduced the total number of identified issues to 5,011 (4). We created a stratified sample of 868 issues to validate the precision of our detectors on a large scale, while we validate the recall by manually inspecting 100 projects (8). The sample size made it infeasible to open further issues on GitLab, because we could not have followed-up on all of them, so we only rated the validity of these issues internally. Our rating provided the required data to answer RQ 2 (9), which will be discussed in Section 5.2.
Finally, we analyzed the results for the complete dataset of 5,312 projects to investigate how frequently CD smells occur in practice (10). These results will be discussed in Section 5.3.

CD LINTER DESCRIPTION
Organizations implement CD pipelines using pieces of technology such as Jenkins, TravisCI, or GitLab. While this paper copes with build-server agnostic smells, we implement CD-Linter on GitLab. GitLab is an integrated platform that hosts both the repository and the issue tracker, which is particularly interesting for our evaluation. In February 2020, a search via the GitLab API revealed that the site hosted more than 1.57M projects. As GitLab can also be used in private installations, this makes it a very popular solution for enterprises [18]. By supporting GitLab, CD-Linter targets industrial and open-source projects alike.

Background
In the following, we provide some background information about the relevant configuration parts for this work.
Build Server. A build server is a reusable infrastructure which enables developers to define custom CD pipelines, and is configured through configuration files. In GitLab the configuration file is .gitlab-ci.yml; other build servers have similar configuration files.
An example of such a configuration file is shown in Figure 2. In the top part of the file, the different stages of the build are defined that every change committed to a version control system as Git has to pass during the build. If no stages are defined, the default stages in GitLab are build, test, and deploy. The automation tasks are defined as jobs, the basic unit of the CD pipeline. The example defines two jobs, code_quality and unit_test, which invoke specific shell scripts that are defined in the script line. For example, a Java project could include the script line script: mvn test to start all unit tests through the Maven build tool. Developers can also configure when to run a job (e.g., when: manual), how many times a job can be auto-retried in case of failures (e.g., retry: 3), and whether a job is allowed to fail (i.e., allow_failure: true).
Specialized Build Tools. In addition to the configuration of the high-level orchestration of the build pipeline, most CD pipelines use specialized build tools to perform the actual automation tasks, which require separate configuration parameters. In contrast to the build server, these build tools depend on the programming language that is used in the project. For this paper, we chose to support the typical build tools of Java and Python to have two representatives for strictly-typed and dynamically-typed languages. The typical configuration differs between these languages and is too complex to be covered here. We will introduce the relevant bits, once we have described the CD smells that we are going to support.

Selection of Relevant CD Smells
CD-Linter features the implementation of an initial set of CD smells to be evaluated. Clearly, there may be many smells in CD pipelines (e.g., Duvall [23] defined 50 anti-patterns). Practically speaking, a CD-Linter can detect a limited subset of smells, and, being a linter,  only those that can be statically identified. Therefore, we aimed to find a set of suitable CD smells, not all the most relevant ones. We collected all the good and bad practices that are illustrated in the Foundations part of Humble and Farley [30], a well-known book about CD practices. Some CD smells require historical information for the detection (using artifacts like logs or repositories), which is only available after the CD pipeline is being used and not when it is configured [43]. This is out-of-scope for a static linter, so we judged the feasibility of detecting the anti-patterns from configuration files alone, without relying on other artifacts. The complete list is available in our replication package [44] and we selected four CD smells.
Fake Success. Each stage of the CD pipeline checks for several categories of defects. For example, jobs executed in the code quality stage can reveal the presence of poorly-written code snippets, while jobs in the test stage typically spot bugs at unit and integration levels. Every executed job should be able to fail the build. If not, developers can miss or ignore the underlying issue, which adds technical debt and might result in problems later. A Fake Success arises when a failure in a job does not affect the overall build result.
Retry Failure. The build process has to be deterministic. Flaky behavior, e.g., tests that sometimes fail [34,36], should be avoided at any cost, because they hinder development experience, slow down progress and hide real bugs. Some pipelines address this issue by rerunning a job multiple times after failures. However, this might not only hide an underlying problem but makes issues also harder to debug when they only occur sometimes.
Manual Execution. CD means to keep the code base in a deployable stage at any given time. Thus, a fully automated build process up until the deploy stage is required. Manual jobs might introduce errors and delay the delivery of code changes to the customers. This CD smell occurs when a job (that is executed before the deploy stage) needs to be explicitly started by a user.
Fuzzy Version. Developers should always specify the exact version of the external libraries that are used. If not, a build could not be reproduced. Failing to be specific on versions also leads to an occasional long debugging session tracking down errors due to the use of different library versions. Using the terminology of semantic versioning, we differentiate between the following sub-types of the smell: (i) Missing Version: No version number is defined; (ii) Only Major Version: Only a major release number is defined; (iii) Any Minor Version: Any equal or higher minor release with the same major version is allowed; or (iv) Any Upper Version: every equal or higher version can be used.

Parsing CD Configuration Files
To detect the CD smells, we parse the configuration files and map their content onto meta-models that we have created for each type of configuration file CD-Linter supports. These meta-models cover the parts of the configuration files that matter for the detection of the CD smells. CD-Linter considers three types of configurations:  GitLab, Maven, and pip (Python). In the following, we describe the analysis of the different analyses.
GitLab Config. From the .gitlab-ci.yml file, we capture the list of jobs, the stages and the variables. For each job, we record the name, the stage and the script lines (script, before_script, after_script) as well as the retry, allow_failure, when and the environment parameters. For the retry parameter, we keep track of the maximum number of retries (max) and for which kinds of failures the job is allowed to retry (when). We filter out dot-prefixed jobs as GitLab does not process them.
Maven Config. The Maven build tool is very popular among Java projects. Maven can automate various tasks, for example, the dependency resolution and automated download from a centralized repository. Figure 3 shows a configuration excerpt ("pom.xml"), in which two dependencies, foo.x and bla.blubb are being defined.
From the pom.xml, we capture the unique coordinate of the artifact (artifact ID, group ID, version), all defined properties, and the coordinates of all dependencies. All properties are automatically replaced with their actual value. We also include all referenced modules recursively and link them together. Values such as versions are then inherited from ancestor POMs where available.
For Python, two things are relevant. First, the file requirements.txt is often used to define all dependencies that are required in the Python environment to run a particular piece of software. These requirement files can be hierarchical and include other requirement files that are inherited. Second, the script line in the GitLab configuration file often contains manual calls to the package manager pip to download external dependencies. To find these, we search for the keyword pip install, strip other pip options, and remove quotes from the arguments. It is also possible to specify dependencies by pointing to files, folders, and URLs to version control systems. We use simple heuristics to detect these cases and exclude them from linting.

Detection of the CD Smells
Having access to the parsed information in the meta-models, we could proceed to implement the various analyses that detect instances of the four CD smells.
Fake Success. The allow_failure parameter set to true allows a job to fail without impacting the rest of the build. Figure 2 shows an example in which the build execution can succeed despite potential errors in the unit_test job. We detect a Fake Success every time a job's definition contains allow_failure: true. Note that we do not report Fake Success for the stages sast (static application security testing) and dast (dynamic application security testing). GitLab defines templates [6, 8] that contain allow_failure set to true for the default job used in the those stages.
Retry Failure. The retry parameter allows developers to configure how many times a job is going to be retried in case of a failure (see the example in Figure 2). We detect all cases in which retry is set to a positive value. The proposed solution for such a case is to control retry by matching a specific failure cause (e.g., when:runner_system_failure). We only found very few cases in which projects used when, so we decided to simplify the detection in CD-Linter and report all such usages of retry for now. Handling these cases properly is a simple matter of implementation.
Manual Execution. The when parameter can also be used to specify when a job shall be executed. To detect manual triggers of steps, we selected all jobs that contain when:manual in their definitions. For example, job code_quality in stage build ( Figure 2) needs to be explicitly started by a user.
Not all manual triggers are a problem though. CI/CD advocates the automated execution of all stages to ensure a releasable project state at every point in time, however, it is acceptable to manually decide when this release should happen. Therefore, we do not report cases in which the manual execution only affects deploy stages. Apart from using the default deploy stage, GitLab users can also define custom deploy stages [5]. To build a comprehensive list of deploy stage names, we extracted the stage names from a random project sample from our dataset (see Section 4). We identified all keywords that hint at a deploy stage such as 'deploy', 'release' or 'publish' and exclude jobs and stages that contain these keywords in their name. Also, we did not report Manual Execution for jobs in the triage and review stages, because GitLab suggests that these stages should be started manually [7,9]. Furthermore, we exclude cases in which the action parameter of environment is set to stop, which used to define a manual way to shutdown an environment that is used in the build.
Fuzzy Version. The way dependencies are declared is specific to the programming language and the corresponding dependency management tool. For what concerns versioning, CD-Linter supports Python and Java projects (the latter using Maven). Table 1 shows a comparison of the version syntax.
Python projects typically use pip to manage their dependencies and our meta-model contains information about all dependencies that are either defined in the requirements.txt file or through direct invocations of pip install. CD-Linter distinguishes between several Fuzzy Version subcategories. (i) If no version is defined, we report a Missing Version, (ii) if a version specifier only consists of a single number, we report an Only Major Version violation, (iii) a Any Minor Version when the minor release number is an asterisk (*), and (iv) Any Upper Version if the version number only defines a lower bound, but omits the upper bound (e.g., numpy>=10.4).
In Java projects, dependency resolution is typically handled by the build tool. In case of Maven, dependencies are defined in pom.xml. To detect Missing Version, we identify dependencies that do not specify a <version> tag. In cases that include the tag, we detect Only Major Version as we do in Python projects and Any Upper Version checking whether the upper version in a range is missing (e.g., [1.2.3,)). Any Minor Version is impossible by design, because at least a range will always be declared for minor releases. When analyzing dependencies, CD-Linter handles transitive dependencies by traversing the POM hierarchy recursively.
When reviewing the detection strategies (step 3 of Figure 1), we realized how some libraries self-manage dependency versioning, e.g., as Spring Boot, a popular framework for web apps, does. We have compiled a list of affected dependencies for which we do not report a Fuzzy Version CD smell, because omitting the version is acceptable in these cases.
As a reaction to developer feedback (RQ 1 ), we differentiate between libraries used in production code and tools used in the pipeline. Not specifying a version for a tool is less critical, because no source code relies on an API that might break in newer versions. On the contrary, having a new version with fixed bugs and updated features might even be desired. To this end, we compiled a list of tools used for Python and Java projects. These include, for example, pipenv [12] [19], and PMD [13] for Java (the complete list is in our online appendix [44]).

EMPIRICAL STUDY DESIGN
The goal of this study is to evaluate CD-Linter, to determine whether it could be useful for developers to avoid CD smells in their CD pipeline. The quality focus is two-fold: the perceived usefulness from original developers of projects where CD smells are detected and the accuracy of CD-Linter. The perspective is of researchers that have developed CD-Linter and want to transfer it to practice. The context consists of 5,312 open-source projects hosted on Git-Lab and using CD. More specifically, the study answers the three research questions formulated in Section 2.

Context Selection
To answer our research questions, we selected open-source projects hosted on GitLab. Using the GitLab API, we filtered projects that do not have at least one star or that are forked from other projects to avoid duplicates. From the resulting 26,984 projects, we removed all the projects that do not contain a .gitlab-ci.yml file in their repositories (i.e., do not use GitLab as CD server). The last filter left us with 5,312 projects that we could analyze for the presence of CD smells. These projects have a diverse team size (from 1 to 633 members with a median of 2) and age (from 1 to 133 thousand commits with a median of 75). Regarding the languages, our dataset mainly includes JavaScript (16%), Python (14%), C (10%), Java (7%), Go (4%), and Ruby (4.4%) repositories. Also, there are projects with varying CD adoption history.

Monitoring of the Opened Issues
We first run CD-Linter on the dataset of 5,312 projects that has been described in Section 4.1. Then, we identified a random set of CD smells in a way to achieve (i) a balanced set of CD smells of each type, and (ii) at most one CD smell per project owner, to avoid flooding the same owner with many issues. For Manual Execution we could detect a maximum of 42 smells across owners, and we ended up detecting a total of 168 CD smells.
Once the detected CD smells were uploaded to the CD-Linter web-based platform, which can automatically report issues, each issue was shown to two independent evaluators (two of the authors, one of which was not involved in the CD-Linter implementation) to remove false positives (object of a different study in Section 5.2). Each evaluator could read the report generated by CD-Linter, browse the file in which the CD smell was found, and, if needed, browse the entire repository and its history through a GitLab link. Once two evaluators reported a positive assessment, the issue was automatically posted and opened on GitLab. The disagreement cases were discussed and, in case of positive agreement, an issue was also opened. In total, we opened 145 issues.
We have monitored the issues over a period of 6 months (from August 2019 to February 2020). During this period, we collected 64 reactions, counting issues that have been upvoted/downvoted, commented, assigned, or closed. 59 projects did not show any activity during the observation period, so we decided to ignore them in our analysis. The response rate of the remaining, active projects was 74%. We performed a card sorting [40] of the received 120 comments to identify agreements and disagreements with our issues and their motivations. The card sorting was performed by two authors that, after a first round of independent tagging, met and merged their annotations.
In addition to the reactions, we checked the source code to see whether a reported smell has been removed or reintroduced in the observation period. In some cases, the smell has been fixed despite a negative reaction or without any reaction to the issue whatsoever.
Based on the developers reaction, issues have been classified into the following 5 categories.
Ignored The issue has been closed without any further reaction. Rejected The issue has been closed with a majority of downvotes or with a negative feedback from the comments. Pending The issue is still open and under discussion among the maintainers without a clear agreement/disagreement. Accepted The issue has been assigned for fix, or has a majority of upvotes or a positive feedback.
Fixed The smell reported in the issue has been removed.
To address RQ 1 , we report and discuss the responses to the opened issues for each smell type. We report the number and percentages of positive and negative reactions, the rationale for rejecting the issues, and provide examples of positive feedback and false positives. The results of RQ 1 directly improved CD-Linter (see Section 3.4).

Manual Validation of CD Smells
We executed the enhanced version of CD-Linter on the 5,312 projects, which resulted in the detection of 5,011 CD smells. Then, we formed a sample to be manually validated. We selected, for each owner, one CD smell of each type, if detected. Since for Fuzzy Version we have four sub-categories, we considered one of each sub-category (Missing Version, Only Major Version, Any Minor Version, and Any Upper Version), if present. We obtained as result a sample that consists of 868 issues and achieves an error margin of ±3% (setting a confidence level of 95% and a percentage of 50%). Then, similarly to what was done in RQ 1 , each issue was independently validated by two authors. After each annotator concluded the tagging, we measured the Cohen's kappa inter-rater agreement (k) [20]. We obtained k = 0.76, i.e., a high agreement, therefore no re-coding was necessary. Finally, the two annotators discussed and solved the disagreement cases. To address RQ 2 , we report the overall precision of CD-Linter on the validated sample, defined as T P/(T P + F P ), where T P: true positives and F P: false positives. We also computed the recall, defined as T P/TT P (TT P: total true positives), using a randomly selected sample of 100 projects (methodology similar to Gallaba et al. [26]), making sure that those projects were not the same used to calculate the precision.

Measurement of CD Smell Occurrences
To address RQ 3 , we run CD-Linter on the latest snapshot of the 5,312 projects described in Section 4.1. The analysis has been performed on an Intel Xeon(R) CPU E5-2640 with 2.50GHz (4 cores) with 4GB of available main memory and took a total of 74 seconds. We report the number of CD smells of different types we detected, as well as the percentage of projects and owners affected by at least one CD smell of each type. The latter provides us with an idea of the diffuseness of the considered CD smells.

EMPIRICAL STUDY RESULTS
In this section, we will answer the three research questions and report on the results of the study defined in Section 4.

Are the four CD-Linter CD smells relevant to developers?
During the observation period of 6 months, 64 projects reacted to our issues (response rate of 74%), Figure 4 illustrates the reactions. Overall, 53% of the project maintainers reacted positively to our issues: 9% acknowledged the presence of a problem and are about to solve it, and 44% fixed the reported CD smells. We have also verified that the fixes were not reverted later and could not find cases, in which the reported CD smells were re-introduced. Developers took on average 50 days to fix a CD smell, with a maximum of 5.5 months and minimum period of 1 hour. This high Looking at the negative cases, 9% of the issues were closed without reactions (i.e., ignored issues) and 32% were rejected. Several project maintainers that rejected issues provided us with reasons why they want to keep the CD smell. We found other cases in which developers rejected our issues simply because of a lack of trust in automated issue-reporting tools. In the following, we describe the reactions to each CD smell type (see Figure 4), the feedback we received when the issues were rejected, and how we refined our detection strategies based on the analyzed comments. We also report the percentage of false positives that we found during the assessment stage.
Fake Success. We found only one false-positive case and opened 27 issues that report Fake Success cases, achieving a response rate of 70%. No issues were ignored, 10% of them were accepted, and the CD smell was removed in 37% of the projects. 9 opened issues (47% of the total) were instead rejected by the project maintainers. In several cases, developers generally agreed to our reported violation but decided to accept the CD smell nevertheless. Some developers prefer that non-essential jobs may fail, for example, checks for outdated dependencies, execution of static analysis tools, or external tools which might fail for unknown reasons. Other developers allow jobs to fail that are not fully implemented yet and, thus, should not impact the final build status. These projects typically state that they plan to remove the CD smell when the pipeline design is finished. Another developer, while agreeing on the violation, did not fix the CD smell because allow_failure:true is recommended for certain jobs that run static and dynamic application security tests provided by GitLab. It is no longer necessary to use this configuration flag explicitly (it has been moved to a template and will be used implicitly through inheritance), but old tutorials still describe it as a best practice. While failing the overall build because of warnings raised by static analysis tools or errors in external tools is a key CD principle [22], we completely agree that projects should follow configuration recommendations of their CD provider. CD-Linter recognizes these cases and does not report them.
Retry Failure. We found 7 false positives for Retry Failure issues, which corresponds to 16.7% of the validated sample. We reported the remaining 19 smells with a return rate of 58% (the lowest rate among all CD smell types). 9% of the maintainers confirmed the existence of the problem and 55% removed the CD smell from the configuration of their projects. Only 4 issues (36%) were rejected. This CD smell seems to be introduced to "hide" flakiness instead of solving it, thus, we decided to not modify our detection strategy. One developer mentioned that she deploys her application to a remote service that is randomly failing. Because the tool is out of her control, she decided to automatically retry the job multiple times hoping that it will succeed without breaking the overall build.
Manual Execution. Manual Execution is the category where we found the largest percentage of false positives (26.2%), due to some periodic deployment jobs that CD-Linter did not recognize. We opened 16 issues and achieved a response rate of 81%. While 8% of the reported CD smells were accepted and 38% were fixed, 31% were ignored. Only 2 issues (15%) were rejected. In both cases, developers agreed on the importance of detecting this CD smell, but they also provided reasons for rejecting it. One of them set when:manual in a job executed in a stage that is not fully integrated yet with the rest of the pipeline. This can be addressed by allowing developers to directly configure CD-Linter and ignore jobs that are not part of the CD pipeline. The other developer rejected the issue because of a lack of trust on an automated reporting tool (CD-Linter). While this can be a threat to the study, it does not constitute a problem in a usage scenario where a developer uses the tool herself.
Fuzzy Version. We only found 3 Fuzzy Version false-positive instances, and we could open 24 issues achieving the highest response rate (87%) among all CD smell types. 9% of the reported CD smells were accepted and 48% fixed. While we cannot learn from the 9% of the issues that were ignored, we used the comments from the remaining 28% rejected issues to refine our detection strategy. Most complaints concern the reports about tools for which the version is left unspecified. In contrast to libraries, tools that are invoked in the pipeline (e.g., tools that compute code coverage) should always be updated to the latest version, especially because they might contain security improvements. Furthermore, tools are dependencies of the project rather than of the source code. Thus, in the case of uncontrolled updates, such tools would not affect the outcome of the build nor introduce errors, so we decided to incorporate this feedback and exclude tools from the detection of Fuzzy Version.
RQ 1 summary: We received reactions from 74% of the projects. 53% of the project maintainers reacted positively to our issues, either accepting 9% or fixing 44% the reported CD smells. In the rejected issues, we received precious suggestions on how to improve CD-Linter, which we incorporated whenever possible. Table 2 reports the detection precision of CD-Linter. As the table shows, the detection precision varies between 73% of Manual Execution and 100% of Retry Failure and Fake Success, with an 81% for Fuzzy Version. Looking at the results of Fuzzy Version for the different sub-types of CD smells, Table 3 indicates that the detection precision is the lowest for the Missing Version category (77%), which, however, is the most common one. Instead, when a  version is incompletely specified (i.e., Any Upper Version, Only Major Version, or Any Minor Version CD smell), the CD-Linter accuracy raises to 95% or above.

How accurate is CD-Linter?
In the following, we discuss false positives. As shown in Table 2, we found no false positives for Retry Failure and Fake Success. Note that this does not mean that CD-Linter would always be correct in such cases, because developers might use these options for a specific, valid purpose.
For Manual Execution, false positives were mostly related to cases where the job name, content, or even comments added to the .gitlab-ci.yml file suggested that the job is related to a deployment activity that developers intentionally perform periodically, and therefore manually trigger (e.g., issuing a release). Despite filtering out jobs related to deployment, as explained in Section 3.4, we still encountered unforeseen cases. Examples include a job named test-prerelase (the typo in the job name made our filtering fail), but also a job named push, which was pushing Docker images to a repository (this case may or may not be fully automated). Also, the names of several jobs with the when parameter set to manual suggest that they should not be manually triggered. However, both the implementation and a comment left there indicate that developers intentionally configured a manual job triggering. Future work could improve CD-Linter by using Natural Language Processing (NLP) techniques to analyze comments in CD configuration files to infer the rationale of choices made by developers.
False positives of Fuzzy Version mostly relate to cases in which dependencies for pipeline-related tools were lacking a version number. As a consequence of the preliminary analysis conducted with developers in RQ 1 , we argue that libraries used in production code should specify an exact version of a dependency, to avoid build failures or introducing bugs. However, this may not be strictly necessary for tools, because developers may want to always use the latest version to have fixed bugs and enhanced features. We have derived an initial list of such tools after the feedback from developers and exclude tools like coala [1] (rule-based linter), sphinx [17] (documentation generation), or wheel [16] (packaging utility).
We estimated the recall of CD-Linter on a sample of 100 randomly-selected projects. Two authors individually inspected the configuration files and agreed on the presence of 90 CD smells. We applied CD-Linter to the same sample and could detect 85 of the manually identified incidents, achieving a recall of 94%. 3 false negatives were Fuzzy Version smells. Two of them were not occurring in script lines while the other affected a requirements.pip file, a file name which has not been considered by our tool. The remaining 2 false negatives were Manual Execution smells. Those smells were not detected by our tool because their names contain deploy-related keywords. However, they were executed in the stages metrics and the build_unit_test. Section 3.4 established name-based inclusion/exclusion criteria through inspecting a sample of projects, but it is not feasible to derive a simple heuristic that can cover all cases.
We believe that future iterations of CD-Linter can remove these false negatives by considering other features of the .gitlab-ci.yml (e.g., non-script lines in jobs) or other files that are currently not supported, and by enabling developers to configure their inclusion/exclusion criteria for job and stage names.

How frequent are the investigated CD smells in practice?
To understand the frequency of CD smells in practice, we analyzed the latest snapshot of 5,312 projects (as described in Section 4.1). Among them, 863 projects are either written in Java (and built with Maven) or in Python and, therefore, qualify for an analysis of the existence of CD smells in the wild, including Fuzzy Version, the only language-specific smell. Note that 136 of the initially considered projects were then deleted and were not available for our analysis. Table 4 illustrates the occurrence of CD smells in the analyzed projects. We detected 2,874 instances of CD smells that affect 13% of the projects (14% of the investigated owners). Fuzzy Version is the most common CD smell (54.6%), and is present in 37.1% of the analyzed projects. Fake Success and Retry Failure account for 22% and 18.5% of the identified CD smells respectively. While Fake Success occurs in 5.4% of the projects, Retry Failure is present in 1.6% of them. 4.9% of the CD smells were Manual Execution and affected 69 projects (1.3% of the total).
Humble and Farley advocate that a CD pipeline should be composed of at least three separate stages, i.e., compile, test, and deploy [30]. However, organizations can call those stages differently, introduce additional stages, and they can define multiple jobs within one stage. This begs the question whether more complex pipelines are also more prone to contain CD smells, which would make a tool like CD-Linter even more relevant. A qualitative insight from our manual analysis of Section 5.2 indicated that longer .gitlab-ci.yml files seem to contain more complex CD pipeline definitions. We decided to split the analysis and discuss the different subgroups separately.
We distinguish three groups, small, medium, and long and define these categories through the first and third quartile over the length distribution of all .gitlab-ci.yml files. Small .gitlab-ci.yml  files have up to 15 lines (9.9% of the smelly projects have them), while a long .gitlab-ci.yml file is of at least 55 lines (59.6% of the files with CD smells are long). The other projects (30.6%) are medium. In Table 5, we illustrate how CD smell instances are spread across the different clusters and it is immediately clear that the cluster of long .gitlab-ci.yml files contains most of the CD smells.
The cluster with small .gitlab-ci.yml files includes 7.6% of the detected CD smells, with 13.1% of the total Fuzzy Version smells, 3.6% of the Manual Execution incidents, and a few of the other CD smell types. Projects with medium .gitlab-ci.yml sizes contain 35 Table 4).
Being the most common smell, we further analyzed Fuzzy Version and investigated its sub-categories concerning the different files that it can affect (Table 6). Overall, the Fuzzy Version incidents are mainly detected in requirements.txt files. Those files were  (Table 5).
RQ 3 summary: The most frequent CD smell among the ones CD-Linter detects is the Fuzzy Version CD smell (54.6% of the instances). Overall, CD smells affect 13% of the analyzed projects and 14% of their owners, mainly occurring in long configuration files.

THREATS TO VALIDITY
Threats to construct validity are related to possible imprecisions in our measurements. They can be mainly related to possible mistakes in the CD-Linter's implementation, beyond what we could discover by testing it. The extensive manual evaluation performed in RQ 2 mitigates this threat. In addition, results of RQ 2 , as well as the feedback provided by developers (RQ 1 ) gave us indications on how to make CD-Linter more accurate. Threats to internal validity concern factors, internal to our evaluation, that could influence the results. One threat is the subjectiveness of the manual validation of detected smells in RQ 2 (precision and recall). To limit this threat, we employed two evaluators, which discussed and resolved the cases of disagreement. Also for the coding of comments that developers posted on opened issues (RQ 1 ), having two coders limited the subjectiveness of the results. The reactions we got in RQ2 and results of RQ 3 may depend on the characteristics of the analyzed projects. In particular, projects with different degrees of maturity may adopt CD pipelines of different complexity, and may or may not adhere to CD principles and good practices. We have mitigated this threat through the project selection criteria illustrated in Section 4.1.
Threats to external validity concern the generalization of our findings. While we are aware that GitLab is not as popular as GitHub, its adoption and the number of repositories there is increasing. As explained in Section 4.1, it gives the advantage of analyzing projects using the same CD infrastructure. Besides being limited to a sample (though relatively large) of projects, our evaluation, because of the current limitations of the CD-Linter's implementation, is limited to GitLab configuration files, Maven builds, and Python dependencies. While the detection principles explained in Section 3 can be applied to other pieces of technology, the underlying concepts would not change. In this paper, our purpose was to study the reaction of developers to the detection of CD smells, rather than coping with any possible technology.

DISCUSSION
The empirical evaluation of CD-Linter, especially the developers' feedback collected in RQ 1 , allowed us to distill useful lessons learned and formulate implications for future research in this area.
Linters are fast and can support the pipeline definition. Undoubtedly, a paramount advantage of linters is that they are fast and that they can already be applied in early development phases. Our experiments have shown that CD-Linter can analyze configuration files from thousands of projects in the order of seconds. Many of the contacted developers have acknowledged (and often fixed) CD smells that we have pointed out in the project. We can conclude that using linters to support the pipeline definition and to catch smells early on is, indeed, a promising research direction.
Issue reporting is useful, but must be carefully dosed. One problem we experienced in our empirical evaluation is that some developers are irritated by (and tend to discard) automatically-posted issues. While we tried to elaborate in the opened issues that they were a result of a manual review process, some developers still considered them a sort of spam, even when the suggestion was meaningful. Recent works show promising results when bots are used to aid software engineers [32,33], but we found that developers seem to be sensitive in the context of the issue tracker. Despite some negative reactions, our efforts were generally well-received by developers though. To mitigate the negative effect described above, one author followed-up on all comments on the opened issues, to explain the purpose of CD-Linter, justify the opened issue, and -most importantly-show that there was a human in the loop. Overall, involving open-source developers in our research was valuable for both sides, but it was crucial to take the time and talk to developers to show respect and emphasize the importance of the research.
Linters are intrinsically imprecise. A common issue of linters is their intrinsic imprecision. Not every deviation from an advocated principle is a smell and, often, a violation can only be assessed when the specific context is being considered. Such decisions have to be taken on a case by case basis for a project and go beyond the scope of static analysis tools. This phenomenon is not specific to CD-Linter though, a low precision of static analysis tools has already been reported as an adoption barrier multiple times [21,31,47]. In our case, CD-Linter seems to balance precision and recall well. Despite many rejected smell reports, the number of fixed reports and the generally positive feedback that we have received from developers indicates that developers appreciate the effort and that tools like CD-Linter can have a positive effect on CD practices.
Long and complex CD configurations are often smelly. While we find relatively few instances of the CD smells in simple configuration files, the density increases with the length (and complexity). One explanation could be that developers have to cope with phenomena such as flakiness, the need for manual task triggers, accepting failures from some jobs, or with special requirements for dependency management. For such reasons, we expect CD-Linter to be particularly beneficial for projects with a complex pipeline.
Quickly report findings. One of our lessons learned from RQ 1 was that identified issues need to be reported timely or the issue may disappear. In some of our cases, the CD smell has been resolved already by the time we were done validating it, so the reported issues has been unnecessary. A timely reporting is also essential in cases that include line numbers, because these are fragile when the source code changes and can be outdated fast. In these cases, it might be helpful not to link to the latest version in the repository, but to the exact commit that has been analyzed for the report.
Overall, this paper shows a promising future for linters of CI/CD pipelines. Future linters can extend the ideas in several ways, for example, not only consider dependency versions, but also other versioned entities in the build configuration, like build plugins or container images, in which the build is run. The results in this paper emphasize the need for more research on linters in this domain.

RELATED WORK
This section describes related work about bad practices and their identification in CI/CD and infrastructure-as-code scripts.

Bad Practices in CI/CD
In their landmark books about CI [22] and CD [30], previous researchers outlined wrong decisions while applying CI/CD. The lack of build automation and project visibility together with the inability to create deployable software are a few examples of those practices that prevent organizations from achieving the expected benefits. Duvall collected these and other bad practices in a catalog of 50 anti-patterns (and their corresponding patterns) that occur during several steps of a CI/CD pipeline [24]. Zampetti et al. [48] empirically characterized CI bad practices, finding commonalities but also differences with the ones advocated by Duvall [24]. Anti-patterns also occur because developers face several barriers when adopting CI/CD [28]. For instance, developers need to debug failures occurring on a remote server and maintain complex build infrastructures.
The catalogs of anti-patterns and the studies discussed above constitute the foundations of our work, as we use them to derive principles for which CD-Linter detects smells.

Detection of Smells in Development Workflows
Several researchers have proposed approaches to automate the identification, and, in some cases, the removal of problems arising in build and, more in general, Infrastructure as Code (IaC) scripts. Gallaba et al. [26] developed an approach for detecting and eliminating misuses such as the presence of unused properties and bypassed security checks in Travis-CI build scripts. While we also statically analyze configuration files, our approach detects those anti-patterns that are violations of CI/CD principles.
Deviations from such principles have been also investigated by Vassallo et al. [43]. They proposed CI-Odor, a tool that analyzes artifacts produced during CI such as logs and revisions to detect anti-patterns (e.g., builds become slow, developers work on feature branches for a longer period) that occur over time and cause a CI decay. Differently from this work, we focus on the anti-patterns that can be statically detected in configuration files.
Troubleshooting build failures is challenging and often causes delays in the delivery process. A previous work [46] has proposed a taxonomy of build failures based on their root causes. Researchers have implemented solutions that automatically repair some of these build failure types [35,41]. Another tool [45] improves the understandability of build failures through log summarization. Despite those approaches, developers still allow failures [25,27]. This strengthens our motivation for including Fake Success in our linter.
Finally, other related works are devoted to the detection of smells in IaC scripts. Sharma et al. [39] leveraged best practices associated with code quality management to assess configuration code quality and derived a catalog of configuration smells for IaC scripts developed in Puppet. While those smells are more similar to traditional code smells (i.e., they concern with maintainability and understandability of Puppet code), CD-Linter detects smells specific to the CI/CD configuration where developers violate principles. Rahman et al. [37] implemented a linter that detects seven types of security problems in IaC scripts. Their work is complementary to ours as it deals with a very specific category of problems related to IaC scripts. Many of their security smells can also occur in CI/CD pipelines.

SUMMARY
Previous work has introduced generic [43] or specialized [26,37] linters that can help developers to improve their CD configuration. In contrast to previous work on CI smells that relies on historical information [43], in this paper we proposed CD-Linter, a static analysis tool able to identify four types of CD smells in CD pipelines, right when they are introduced in the pipeline configuration. Our empirical evaluation has shown that the supported CD smells are relevant in practice, that CD-Linter is accurate, and that the supported smells frequently occur in the wild. Linters generally suffer from many false positives, sometimes up to 90% and more [47], but CD-Linter reaches a precision of 87% and recall of 94%, which represent acceptable results and a good compromise. In a large set of 5,312 projects, we found that 31% of pipelines with long configuration files are affected by at least one instance of the detected smells. The empirical evaluation of CD-Linter, and especially the developer feedback that we have received for RQ 1 illustrate the usefulness of CD-Linter and it allowed us to distill useful insights that can foster the adoption of CD-Linter in practice and stimulate research on similar tools to further advance this area.