Published April 2, 2021 | Version v1
Presentation Open

An Empirical Study of Flaky Tests in Python

  • 1. BMW Group

Description

Tests that cause spurious failures without any code
changes, i.e., flaky tests, hamper regression testing, increase
maintenance costs, may shadow real bugs, and decrease trust
in tests. While the prevalence and importance of flakiness is
well established, prior research focused on Java projects, thus
raising the question of how the findings generalize. In order to
provide a better understanding of the role of flakiness in software
development beyond Java, we empirically study the prevalence,
causes, and degree of flakiness within software written in Python,
one of the currently most popular programming languages. For
this, we sampled 22 352 open source projects from the popular
PyPI package index, and analyzed their 876 186 test cases for
flakiness. Our investigation suggests that flakiness is equally
prevalent in Python as it is in Java. The reasons, however, are
different: Order dependency is a much more dominant problem
in Python, causing 59% of the 7 571 flaky tests in our dataset.
Another 28%were caused by test infrastructure problems, which
represent a previously undocumented cause of flakiness. The
remaining 13% can mostly be attributed to the use of network
and randomness APIs by the projects, which is indicative of the
type of software commonly written in Python. Our data also
suggests that finding flaky tests requires more runs than are
often done in the literature: A 95% confidence that a passing
test case is not flaky on average would require 170 reruns.

Files

An empirical analysis of flaky tests in Python.mp4

Files (120.9 MB)