SPLASH 2020
Sun 15 - Sat 21 November 2020 Online Conference
Tue 17 Nov 2020 09:20 - 09:40 at SPLASH-I - T-2 Chair(s): Karim Ali, Aritra Sengupta
Tue 17 Nov 2020 21:20 - 21:40 at SPLASH-I - T-2 Chair(s): Yaoda Zhou, Iulian Neamtiu

Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to flakiness. Ideally, one would detect flaky tests right when flakiness is introduced, so that developers can then immediately remove the flakiness. Some software organizations, e.g., Mozilla and Netflix, run some tools—detectors—to detect flaky tests as soon as possible. However, detecting flaky tests is costly due to their inherent non-determinism, so even state-of-the-art detectors are often impractical to be used on all tests for each project change. To combat the high cost of applying detectors, these organizations typically run a detector solely on newly added or directly modified tests, i.e., not on unmodified tests or when other changes occur (including changes to the test suite, the code under test, and library dependencies). However, it is unclear how many flaky tests can be detected or missed by applying detectors in only these limited circumstances.

To better understand this problem, we conduct a large-scale longitudinal study of flaky tests to determine when flaky tests become flaky and what changes cause them to become flaky. We apply two state-of-theart detectors to 55 Java projects, identifying a total of 245 flaky tests that can be compiled and run in the code version where each test was added. We find that 75% of flaky tests (184 out of 245) are flaky when added, indicating substantial potential value for developers to run detectors specifically on newly added tests. However, running detectors solely on newly added tests would still miss detecting 25% of flaky tests. The percentage of flaky tests that can be detected does increase to 85% when detectors are run on newly added or directly modified tests. The remaining 15% of flaky tests become flaky due to other changes and can be detected only when detectors are always applied to all tests. Our study is the first to empirically evaluate when tests become flaky and to recommend guidelines for applying detectors in the future.

Tue 17 Nov

Displayed time zone: Central Time (US & Canada) change

09:00 - 10:20
T-2OOPSLA at SPLASH-I +12h
Chair(s): Karim Ali University of Alberta, Aritra Sengupta Amazon Web Services, USA
09:00
20m
Talk
Formulog: Datalog for SMT-Based Static Analysis
OOPSLA
Aaron Bembenek Harvard University, Michael Greenberg Pomona College, Stephen Chong Harvard University
Link to publication DOI Media Attached
09:20
20m
Talk
A Large-Scale Longitudinal Study of Flaky Tests
OOPSLA
Wing Lam University of Illinois at Urbana-Champaign, Stefan Winter TU Darmstadt, Anjiang Wei Peking University, Tao Xie Peking University, Darko Marinov University of Illinois at Urbana-Champaign, Jonathan Bell Northeastern University
Link to publication DOI Media Attached
09:40
20m
Talk
Handling Bidirectional Control Flow
OOPSLA
Yizhou Zhang University of Waterloo, Guido Salvaneschi University of St. Gallen, Andrew Myers Cornell University
Link to publication DOI Media Attached
10:00
20m
Talk
WATCHER: In-Situ Failure Diagnosis
OOPSLA
Hongyu Liu Purdue University, Sam Silvestro University of Texas at San Antonio, Xiangyu Zhang Purdue University, Jian Huang University of Illinois at Urbana-Champaign, Tongping Liu University of Massachusetts at Amherst
Link to publication DOI Media Attached
21:00 - 22:20
T-2OOPSLA at SPLASH-I
Chair(s): Yaoda Zhou University of Hong Kong, Iulian Neamtiu New Jersey Institute of Technology
21:00
20m
Talk
Formulog: Datalog for SMT-Based Static Analysis
OOPSLA
Aaron Bembenek Harvard University, Michael Greenberg Pomona College, Stephen Chong Harvard University
Link to publication DOI Media Attached
21:20
20m
Talk
A Large-Scale Longitudinal Study of Flaky Tests
OOPSLA
Wing Lam University of Illinois at Urbana-Champaign, Stefan Winter TU Darmstadt, Anjiang Wei Peking University, Tao Xie Peking University, Darko Marinov University of Illinois at Urbana-Champaign, Jonathan Bell Northeastern University
Link to publication DOI Media Attached
21:40
20m
Talk
Handling Bidirectional Control Flow
OOPSLA
Yizhou Zhang University of Waterloo, Guido Salvaneschi University of St. Gallen, Andrew Myers Cornell University
Link to publication DOI Media Attached
22:00
20m
Talk
WATCHER: In-Situ Failure Diagnosis
OOPSLA
Hongyu Liu Purdue University, Sam Silvestro University of Texas at San Antonio, Xiangyu Zhang Purdue University, Jian Huang University of Illinois at Urbana-Champaign, Tongping Liu University of Massachusetts at Amherst
Link to publication DOI Media Attached