Skip to content

Publications

Published works directly discussed in my dissertation

01 :: Learning to predict software testability

Software testability is the propensity of code to reveal its existing faults, particularly during automated testing. Testing success depends on the testability of the program under test. On the other hand, testing success relies on the coverage of the test data provided by a given test data generation algorithm. However, little empirical evidence has been shown to clarify whether and how software testability affects test coverage. In this article, we propose a method to shed light on this subject. Our proposed framework uses the coverage of Software Under Test (SUT), provided by different automatically generated test suites, to build machine learning models, determining the testability of programs based on many source code metrics. The resultant models can predict the code coverage provided by a given test data generation algorithm before running the algorithm, reducing the cost of additional testing. The predicted coverage is used as a concrete proxy to quantify source code testability. Experiments show an acceptable accuracy of 81.94% in measuring and predicting software testability.

02 :: Learning to predict test effectiveness

The high cost of the test can be dramatically reduced, provided that the coverability as an inherent feature of the code under test is predictable. This article offers a machine learning model to predict the extent to which the test could cover a class in terms of a new metric called Coverageability. The prediction model consists of an ensemble of four regression models. The learning samples consist of feature vectors, where features are source code metrics computed for a class. The samples are labeled by the Coverageability values computed for their corresponding classes. We offer a mathematical model to evaluate test effectiveness in terms of size and coverage of the test suite generated automatically for each class. We extend the size of the feature space by introducing a new approach to define submetrics in terms of existing source code metrics. Using feature importance analysis on the learned prediction models, we sort sources code metrics in the order of their impact on the test effectiveness. As a result of which we found the class strict cyclomatic complexity as the most influential source code metric. Our experiments with our prediction models on a large corpus of Java projects containing about 23,000 classes demonstrate the Mean Absolute Error (MAE) of 0.032, Mean-Squared Error (MSE) of 0.004, and an R2 score of 0.855. Compared with the state-of-the-art coverage prediction models, our models improve MAE, MSE, and an R2 score by 5.78%, 2.84%, and 20.71%, respectively.

03 :: An ensemble meta-estimator to predict source code testability

Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R2 of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.

04 :: Method name recommendation based on source code metrics

Method naming is a critical factor in program comprehension, affecting software quality. State-of-the-art naming techniques use deep learning to compute the methods’ similarity considering their textual contents. They highly depend on identifiers’ names and do not compute semantical interrelations among methods’ instructions. Source code metrics compute such semantical interrelations. This article proposes using source code metrics to measure semantical and structural cross-project similarities. The metrics constitute features of a KNN model, determining the k-most similar methods to a given method. Experiments with 4000000 Java methods on the proposed model demonstrate improvements in precision and recall of state-of-the-arts with 4.25 and 12.08%.

05 :: Natural language requirements testability measurement based on requirement smells

Requirements form the basis for defining software systems’ obligations and tasks. Testable requirements help prevent failures, reduce maintenance costs, and make it easier to perform acceptance tests. However, despite the importance of measuring and quantifying requirements testability, no automatic approach for measuring requirements testability has been proposed based on the requirements smells, which are at odds with the requirements testability. This paper presents a mathematical model to evaluate and rank the natural language requirements testability based on an extensive set of nine requirements smells, detected automatically, and acceptance test efforts determined by requirement length and its application domain. Most of the smells stem from uncountable adjectives, context-sensitive, and ambiguous words. A comprehensive dictionary is required to detect such words. We offer a neural word embedding technique to generate such a dictionary automatically. Using the dictionary, we could automatically detect Polysemy smell (domain-specific ambiguity) for the first time in 10 application domains. Our empirical study on nearly 1000 software requirements from six well-known industrial and academic projects demonstrates that the proposed smell detection approach outperforms Smella, a state-of-the-art tool, in detecting requirements smells. The Precision and Recall of smell detection are improved with an average of 0.03 and 0.33, respectively, compared to the state-of-the-art. The proposed requirement testability model measures the testability of 985 requirements with a mean absolute error of 0.12 and a mean squared error of 0.03, demonstrating the model’s potential for practical use.

06 :: Testability-driven development: An improvement to the TDD efficiency

Test-first development (TFD) is a software development approach involving automated tests before writing the actual code. TFD offers many benefits, such as improving code quality, reducing debugging time, and enabling easier refactoring. However, TFD also poses challenges and limitations, requiring more effort and time to write and maintain test cases, especially for large and complex projects. Refactoring for testability is improving the internal structure of source code to make it easier to test. Refactoring for testability can reduce the cost and complexity of software testing and speed up the test-first life cycle. However, measuring testability is a vital step before refactoring for testability, as it provides a baseline for evaluating the current state of the software and identifying the areas that need improvement. This paper proposes a mathematical model for calculating class testability based on test effectiveness and effort and a machine-learning regression model that predicts testability using source code metrics. It also introduces a testability-driven development (TsDD) method that conducts the TFD process toward developing testable code. TsDD focuses on improving testability and reducing testing costs by measuring testability frequently and refactoring to increase testability without running the program. Our testability prediction model has a mean squared error of 0.0311 and an R2 score of 0.6285. We illustrate the usefulness of TsDD by applying it to 50 Java classes from three open-source projects. TsDD achieves an average of 77.81% improvement in the testability of these classes. Experts’ manual evaluation confirms the potential of TsDD in accelerating the TDD process.

07 :: A Systematic literature review on the code smells datasets and validation mechanisms

The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.

08 :: A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges

Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

---

Published works relevant to my dissertation

09 :: An automated extract method refactoring approach to correct the long method code smell

Long Method is amongst the most common code smells in software systems. Despite various attempts to detect the long method code smell, few automated approaches are presented to refactor this smell. Extract Method refactoring is mainly applied to eliminate the Long Method smell. However, current approaches still face serious problems such as insufficient accuracy in detecting refactoring opportunities, limitations on correction types, the need for human intervention in the refactoring process, and lack of attention to object-oriented principles, mainly single responsibility and cohesion–coupling principles. This paper aims to automatically identify and refactor the long method smells in Java codes using advanced graph analysis techniques, addressing the aforementioned difficulties. First, a graph representing project entities is created. Then, long method smells are detected, considering the methods’ dependencies and sizes. All possible refactorings are then extracted and ranked by a modularity metric, emphasizing high cohesion and low coupling classes for the detected methods. Finally, a proper name is assigned to the extracted method based on its responsibility. Subsequently, the best destination class is determined such that design modularity is maximized. Experts’ opinion is used to evaluate the proposed approach on five different Java projects. The results show the applicability of the proposed method in establishing the single responsibility principle with a 21% improvement compared to the state-of-the-art extract method refactoring approaches.

10 :: Format-aware learn&fuzz: deep test data generation for efficient fuzzing

Appropriate test data are a crucial factor to succeed in fuzz testing. Most of the real-world applications, however, accept complex structure inputs containing data surrounded by meta-data which is processed in several stages comprising of the parsing and rendering (execution). The complex structure of some input files makes it difficult to generate efficient test data automatically. The success of deep learning to cope with complex tasks, specifically generative tasks, has motivated us to exploit it in the context of test data generation for complicated structures such as PDF files. In this respect, a neural language model (NLM) based on deep recurrent neural networks (RNNs) is used to learn the structure of complex inputs. To target both the parsing and rendering steps of the software under test (SUT), our approach generates new test data while distinguishing between data and meta-data that significantly improve the input fuzzing. To assess the proposed approach, we have developed a modular file format fuzzer, IUST-DeepFuzz. Our experimental results demonstrate the relatively high coverage of MuPDF code by our proposed fuzzer, IUST-DeepFuzz, in comparison with the state-of-the-art tools such as learn&fuzz, AFL, Augmented-AFL, and random fuzzing. In summary, our experiments with many deep learning models revealed the fact that the simpler the deep learning models applied to generate test data, the higher the code coverage of the software under test will be.

11 :: Supporting single responsibility through automated extract method refactoring

The responsibility of a method/function is to perform some desired computations and disseminate the results to its caller through various deliverables, including object fields and variables in output instructions. Based on this definition of responsibility, this paper offers a new algorithm to refactor long methods to those with a single responsibility. We propose a backward slicing algorithm to decompose a long method into slightly overlapping slices. The slices are computed for each output instruction, representing the outcome of a responsibility delegated to the method. The slices will be non-overlapping if the slicing criteria address the same output variable. The slices are further extracted as independent methods, invoked by the original method, if certain behavioral preservation are made. The proposed method has been evaluated on the GEMS extract method refactoring benchmark and three real-world projects. On average, our experiments demonstrate at least a 29.6% improvement in precision and a 12.1% improvement in the recall of uncovering refactoring opportunities compared to the state-of-the-art approaches. Furthermore, our tool improves method-level cohesion metrics by an average of 20% after refactoring. Experimental results confirm the applicability of the proposed approach in extracting methods with a single responsibility.

12 :: Multi-type requirements traceability prediction by code data augmentation and fine-tuning MS-CodeBERT

Requirement traceability is a crucial quality factor that highly impacts the software evolution process and maintenance costs. Automated traceability links recovery techniques are required for a reliable and low-cost software development life cycle. Pre-trained language models have shown promising results on many natural language tasks. However, using such pre-trained models for requirement traceability needs large and quality traceability datasets and accurate fine-tuning mechanisms. This paper proposes code augmentation and fine-tuning techniques to prepare the MS-CodeBERT pre-trained language model for various types of requirements traceability prediction including documentation-to-method, issue-to-commit, and issue-to-method links. Three program transformation operations, namely, Rename Variable, Swap Operands, and Swap Statements are designed to generate new quality samples increasing the sample diversity of the traceability datasets. A 2-stage and 3-stage fine-tuning mechanism is proposed to fine-tune the language model for the three types of requirement traceability prediction on provided datasets. Experiments on 14 Java projects demonstrate a 6.2% to 8.5% improvement in the precision, 2.5% to 5.2% improvement in the recall, and 3.8% to 7.3% improvement in the F1 score of the traceability prediction models compared to the best results from the state-of-the-art methods.

---

Preprints and under review works

13 :: Measuring and improving software testability at the design level

Context: The quality of software systems is significantly influenced by design testability, an aspect often overlooked during the initial phases of development. Design testability measures the ease with which a system can be tested based on its original design. Unfortunately, the implementation may deviate from this design, resulting in decreased testability.Objective: To address this issue, we present a learning-based approach that aims to estimate the design testability of extended class diagrams derived from production code. This approach employs design metrics to represent design classes within a substantial corpus of Java projects. We also introduce an automated refactoring tool designed to implement dependency injection and factory method patterns, intending to enhance design testability.Method: The methodology involves creating a machine learning model for design testability prediction using a large dataset of design metrics, followed by developing and implementing an automated refactoring tool. The design classes are labeled with testability scores calculated from a mathematical model. After that, a voting regressor model is trained to predict the design testability of any class diagram based on these design metrics. The refactoring tool is applied to various open-source Java projects, and its impact on design testability is assessed.Results: The proposed design testability model demonstrates its effectiveness by satisfactorily predicting design testability, as indicated by a mean squared error of 0.04 and an R2 score of 0.53. The automated refactoring tool has been successfully evaluated on six open-source Java projects, revealing an enhancement in design testability by up to 19.11%.Conclusion: The proposed automated approach offers software developers the means to continuously evaluate and enhance design testability throughout the entire software development life cycle, mitigating the risk of testability issues stemming from design-to-implementation discrepancies.

  • (Related chapters: Ch. 6)

  • Read more at 2023 SSRN

14 :: Flipped boosting of automatic test data generation frameworks through a many-objective program transformation approach

Automated test case generation is critical to support the testing of large-scale systems without guaranteeing good coverage while reducing the manual effort of writing test cases. The coverage of test data also relies on the testability of the code under test. Testability depends on the code design, which can be improved by refactoring. Despite various studies on automated refactoring, the impact of refactoring on testability has remained an open problem. The main challenge is to preserve other quality attributes while refactoring to improve testability. This paper offers a search-based refactoring approach, using a many-objective genetic algorithm to look for the most appropriate sequence of refactorings, improving the testability of a given program. A machine learning model measures the testability as an element in the fitness vector of each individual in the genetic population. The model predicts the testability of a given class, represented as a vector of source code metrics, affecting the testability. Along with testability, seven other quality attributes, including reusability, flexibility, understandability, functionality, extendability, effectiveness, and modularity, are measured and added to the fitness vector. We evaluated our approach on six open-source projects using 18 different refactorings in different numbers and orders. The results demonstrate an average improvement of 29.95% in testability with a quality gain of 20.06%. In addition, the manual precision of the suggested refactoring solutions has improved by 19.5% compared to state-of-the-art refactoring tools.

  • Read more at 2023 SSRN

  • (Related chapters: Ch. 7)

----