Software permeates every aspect of human life. We are all very aware of software in our phones, laptops, and web services; but we also use large software systems, unthinkingly, when we fly, drive, cook, play, bank, or shop—in just about any type of human activity in the modern world. Software engineering researchers have made tremendous advances, to produce software faster, at lower cost and higher quality, for all these different uses.

Despite great advances in quality, programmers still make mistakes every now and then. Because of the sheer ubiquity of software, these software “bugs” can cause serious disruptions. So software engineering researchers are still hard at work, finding ways to detect, localize, and sometimes even automatically recover from bugs.

These software quality researchers, as they test their ideas, need to know well their ideas work. Can they find, diagnose, localize and even fix real bugs, the kind that real programmers make? For this they need access to large sample collections of REPRODUCIBLE software defects, that they can actually re-create on demand; the ability to address a large, real, dataset of actual defects would help assess the value of their ideas, and how to make them even better. Software defects are actually hard to reproduce. Typically, a large set of hardware, software elements must be present simultaneously, in the same versions, to allow the defect to be reproduced.

A team of researchers at UC Davis, led by Prof. Cindy Rubio Gonzalez of the Computer Science department, have discovered a way to create a very large dataset of such reproducible defects, and have received over one million dollars from the National Science Foundation to do just that. The team, which includes Dr. Bogdan Vasilescu, a post-doc (who will be starting as a Professor at Carnegie-Mellon University in the fall) and Prof. Prem Devanbu, also a faculty member in the CS department, and two undergraduate students, Yichen Wang and Naji Dmeiri.

Prof. Rubio Gonzalez’ group’s work began with their recognition that most modern software development takes place in the cloud: programmers create code and fix mistakes on the cloud, leaving extensive, fully transcribed and archived record of all their activities. These records can then be extracted and processed to create “replication packages” that allow every single creation, error, and repair activity that programmers engage in. Cloud development services such as GitHub offer the possibility to create literally THOUSANDS of such replication packages. The team has proposed to leverage this and other popular technologies to create a large-scale repository of reproducible defects, tests, and patches called BugSwarm. This idea was recognized by the NSF as innovative, important, and transformative for future research in software quality, and funded at the full amount requested by the researchers.

The NSF Award Abstract can be found here.