ICST 2017: The Theory of Composite Faults

Fault masking happens when the effect of one fault serves to mask that of another fault for particular test inputs. The coupling effect is relied upon by testing practitioners to ensure that fault masking is rare. It states that complex faults are coupled to simple faults in such a way that a test data set that detects all simple faults in a program will detect a high percentage of the complex faults..

While this effect has been empirically evaluated, our theoretical understanding of the coupling effect is as yet incomplete. Wah proposed a theory of the coupling effect on finite bijective (or near bijective) functions with the same domain and co-domain, and assuming uniform distribution for candidate functions. This model however, was criticized as being too simple to model real systems, as it did not account for differing domain and co-domain in real programs, or for syntactic neighborhood. We propose a new theory of fault coupling for general functions (with certain constraints). We show that there are two kinds of fault interactions, of which only the weak interaction can be modeled by the theory of the coupling effect. The strong interaction can produce faults that are semantically different from the original faults. These faults should hence be considered as independent atomic faults. Our analysis show that the theory holds even when the effect of syntactical neighborhood of the program is considered. We analyze numerous real-world programs with real faults to validate our hypothesis.

Posted in All, Publications | Tagged , , , , , | Leave a comment

FSE 2016: Can Testedness be Effectively Measured?

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of “testedness” have been proposed; unfortunately, we do not know whether these are truly effective.

In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.

We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a “poorly tested” element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

Posted in All, Publications | Tagged , , , , , , , | Leave a comment

Software Quality Journal 2016: Does The Choice of Mutation Tool Matter?

Mutation analysis is the primary means of evaluating the quality of test suites, though it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature, and most implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite, and few criteria exist to guide a practitioner in choosing a tool, or a researcher in comparing previous results. We investigate an ensemble of measures such as traditional difficulty of detection, strength of minimal sets, diversity of mutants, as well as the information carried by the mutants produced , to evaluate the efficacy of mutant sets. By these measures, mutation tools rarely agree, often with large differences, and the variation due to project, even after accounting for difference due to test suites, is significant. However, the mean difference between tools is very small indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that research using a single tool, a small number of projects, or small increments in mutation score may not yield reliable results. There is a clear need for greater standardization of mutation analysis; we propose one approach for such a standardization.

Posted in Publications | Tagged , , , , , , | Leave a comment

ICSTW 2016: Measuring Effectiveness of Mutant Sets

Redundant mutants, where multiple mutants end up producing same the semantic variant of the program is a major problem in mutation analysis, and a measure of effectiveness is an essential tool for evaluating mutation tools, new operators, and reduction techniques. Previous research suggests using size of disjoint mutant set as an effectiveness measure.

We start from a simple premise: That test suites need to be judged on both the number of unique variations in specifications they detect (as variation measure), and also on how good they are in detecting harder to find bugs (as a measure of subtlety). Hence, any set of mutants should to be judged on how best they allow these measurements.

We show that the disjoint mutant set has two major inadequacies — the single variant assumption and the large test suite assumption when used as a measure of effectiveness in variation, which stems from its reliance on minimal test suites, and we show that when used to emulate hard to find bugs (as a measure of subtlety), it discards useful mutants.

We propose two alternative measures, one oriented toward the measure of effectiveness in variation and not vulnerable to either single variant assumption, or to large test suite assumption and the other towards effectiveness in subtlety, and provide a benchmark of these measures using diverse tools.

Posted in Publications | Tagged , , , , , , | Leave a comment

ICSE 2016: On the limits of mutation reduction strategies

Although mutation analysis is considered the best way to evaluate the effectiveness of a test suite, hefty computational cost often limits its use. To address this problem, various mutation reduction strategies have been proposed, all seeking to gain efficiency by reducing the number of mutants while maintaining the representativeness of an exhaustive mutation analysis. While research has focused on the efficiency of reduction, the effectiveness of these strategies in selecting representative mutants, and the limits in doing so has not been investigated.

We investigate the practical limits to the effectiveness of mutation reduction strategies, and provide a simple theoretical framework for thinking about the absolute limits. Our results show that the limit in effectiveness over random sampling for real-world open source programs is 13.078% (mean). Interestingly, there is no limit to the improvement that can be made by addition of new mutation operators.

Given that this is the maximum that can be achieved with perfect advance knowledge of mutation kills, what can be practically achieved may be much worse. We conclude that more effort should be focused on enhancing mutations than removing operators in the name of selective mutation for questionable benefit.

Posted in Publications | Tagged , , , , , , , | Leave a comment

ISSRE 2015: How hard does mutation analysis have to be, anyway?

Mutation analysis is considered the best method for measuring the adequacy of test suites. However, the number of test runs required for a full mutation analysis grows faster than project size, which is not feasible for real-world software projects, which often have more than a million lines of code. It is for projects of this size, however, that developers most need a method for evaluating the efficacy of a test suite. Various strategies have been proposed to deal with the explosion of mutants. However, these strategies at best reduce the number of mutants required to a fraction of overall mutants, which still grows with program size. Running, e.g., 5% of all mutants of a 2MLOC program usually requires analyzing over 100,000 mutants. Similarly, while various approaches have been proposed to tackle equivalent mutants, none completely eliminate the problem, and the fraction of equivalent mutants remaining is hard to estimate, often requiring manual analysis of equivalence.

In this paper, we provide both theoretical analysis and empirical evidence that a small constant sample of mutants yields statistically similar results to running a full mutation analysis, regardless of the size of the program or similarity between mutants. We show that a similar approach, using a constant sample of inputs can estimate the degree of stubbornness in mutants remaining to a high degree of statistical confidence, and provide a mutation analysis framework for Python that incorporates the analysis of stubbornness of mutants.

Posted in Publications | Tagged , , , , , , | Leave a comment

ASE 2015: How Verified is My Code? Falsification-Driven Verification

Posted in Publications | Tagged , , , , , | Leave a comment

ESEM 2015: An empirical study of design degradation: how software projects get worse over time

Software decay is a key concern for large, long lived software projects. Systems degrade over time as design and implementation compromises and exceptions pile up. However, there has been little research quantifying this decay, or understanding how software projects deal with this issue. While the best approach to improve the quality of a project is to spend time on reducing both software defects (bugs) and addressing design issues (refactoring), we find that design issues are frequently ignored in favor of fixing defects. We find that design issues have a higher chance of being fixed in the early stages of a project, and that efforts to correct these stall as projects mature and code bases grow leading to a build-up of design problems. From studying a large set of open source projects, our research suggests that while core contributors tend to fix design issues more often than non-core contributors, there is no difference once the relative quantity of commits is accounted for.

Posted in Publications | Tagged , , , , , | Leave a comment

ISSRE 2014: Mutations How close are they to real faults?

Mutation analysis is often used to compare the effectiveness of different test suites or testing techniques. One of the main assumptions underlying this technique is the Competent Programmer Hypothesis, which proposes that programs are very close to a correct version, or that the difference between current and correct code for each fault is very small. Researchers have assumed on the basis of the Competent Programmer Hypothesis that the faults produced by mutation analysis are similar to real faults. While there exists some evidence that supports this assumption, these studies are based on analysis of a limited and potentially non-representative set of programs and are hence not conclusive. In this paper, we separately investigate the characteristics of bugfixes and other changes in a very large set of randomly selected projects using four different programming languages. Our analysis suggests that a typical fault involves about three to four tokens, and is seldom equivalent to any traditional mutation operator. We also find the most frequently occurring syntactical patterns, and identify the factors that affect the real bug-fix change distribution. Our analysis suggests that different languages have different distributions, which in turn suggests that operators optimal in one language may not be optimal for others. Moreover, our results suggest that mutation analysis stands in need of better empirical support of the connection between mutant detection and detection of actual program faults in a larger body of real programs.

The full paper is available here

Posted in All, Publications | Tagged , , , , , | Leave a comment

Sunbelt 2014: Temporal Visualization of Dynamic Collaboration Graphs of OSS Software Forks

In this work, we studied collaboration network of three open source projects using a combined analysis method of temporal visualization and temporal quantitative analysis. We based our study on two papers by [Robles and Gonzalez-Barahona 2012] and [Hanneman and Klamma 2013], and identified three projects that had forked in the recent past. We mined the collaboration data, formed dynamic collaboration graphs, and measured social network analysis metrics over an 18-month period time window. We also visualized the dynamic graph (available online) and as stacked area charts over time. The visualizations and the quantitative results showed the differences among the projects in the three forking reasons of personal differences among the developer teams, technical differences (addition of new functionality) and more community-driven development. The personal differences representative project was identifiable, and so was the date it forked, with a month accuracy. The novelty of the approach was in applying the temporal analysis rather than static analysis, and in the temporal visualization of community structure. We showed that this approach shed light on the structure of these projects and reveal information that cannot be seen otherwise.

Posted in All, Publications | Tagged , , , , | Leave a comment

ICSE 2014: Code Coverage for Suite Evaluation by Developers

One of the most fundamental concerns of developers testing code is how to determine if a test suite strikes a good balance between the cost of undetected faults and the cost of further testing. The most common approach may be to use code coverage as a measure for test suite quality, with diminishing returns in increased coverage or high absolute coverage as a stopping rule. In testing research, suite quality is often evaluated by measuring the ability of a suite to kill mutations, artificially seeded code changes. Mutation testing is effective but expensive and complex, thus seldom used by practitioners. Determining which criteria best predict mutation kills is therefore critical to practical estimation of suite quality. Previous work uses only a small set of programs, and usually compares multiple suites for a single program. Practitioners, however, seldom compare suites— they evaluate one suite. Using suites (both manual and automatically generated) from a large set of real-world open-source projects shows that results for evaluation differ from those for suite-comparison: statement (not block, branch, or path) coverage predicts mutation kills best.

The full paper is available here

Posted in Publications | Tagged , , , , , | Leave a comment

OSS 2014: An Exploration of Factors Affecting Code Quality in FOSS Projects

It is a widely held belief that Free/Open Source Software (FOSS)
development leads to the creation of software with the same, if not higher
quality compared to that created using proprietary software development
models. However there is little research on evaluating the quality of FOSS
code, and the impact of project characteristics such as age, number of core
developers, code-base size, etc. In this exploratory study, we examined 110
FOSS projects, measuring the quality of the code and architectural design
using code smells. We found that, contrary to our expectations, the overall
quality of the code is not affected by the size of the code base, but that it was
negatively impacted by the growth of the number code contributors. Our
results also show that projects with more core developers don’t necessarily
have better code quality.

Posted in Publications | Tagged , , , , , | Leave a comment

OSS 2014: Drawing the Big Picture: Temporal Visualization of Dynamic Collaboration Graphs of OSS Software Forks

How can we understand FOSS collaboration better? Can social issues that emerge be identified and addressed as they happen? Can the community heal itself, become more transparent and inclusive, and promote diversity? We propose a technique to address these issues by quantitative analysis and temporal visualization of social dynamics in FOSS communities. We used social network analysis metrics to identify growth patterns and unhealthy dynamics; This gives the community a heads-up when they can still take action to ensure the sustainability of the project.

Posted in Publications | Tagged , , , , | Leave a comment

CHI 2014: Abandonment of Social Networks: Shift from Use to Non-Use and Experiences of Technology Non-Use

In this paper we describe a qualitative research on abandonment of a social network, i.e. Facebook, and why some people opt to terminate their use. Interviews were conducted with subjects who previously had daily use experience, and now opted for non-use. Four major themes were found as contributing to this technology abandonment. The insider story shared by the interviewees, of their technology non-use sheds light on the contributing factors leading to a shift from a user to a non-user.

Posted in Publications | Tagged , , | Leave a comment

Entries

The website of HCI group is here. This is the internal journal of the group.

You must be logged in to see the contents.

Posted in All, Notes | Comments Off on Entries