Constructing automated test oracle for low observable...

Scientia Iranica D (2020) 27(3), 1333{1351

Sharif University of TechnologyScientia Iranica

Transactions D: Computer Science & Engineering and Electrical Engineeringhttp://scientiairanica.sharif.edu

Constructing automated test oracle for low observablesoftware

M. Valueian1, N. Attar, H. Haghighi, and M. Vahidi-Asl�

Faculty of Computer Science and Engineering, Shahid Beheshti University, G.C, Tehran, P.O. Box 1983963113, Iran.

Received 27 July 2018; received in revised form 26 January 2019; accepted 10 August 2019

KEYWORDSSoftware testing;Test oracle;Machine learning;Arti�cial neuralnetwork;Software observability.

Abstract. The application of machine learning techniques for constructing automatedtest oracles has been successful in recent years. However, existing machine learning basedoracles are characterized by a number of de�ciencies when applied to software systems withlow observability, such as embedded software, cyber-physical systems, multimedia softwareprograms, and computer games. This paper proposes a new black box approach to constructautomated oracles that can be applied to software systems with low observability. Theproposed approach employs an Arti�cial Neural Network algorithm that uses input valuesand corresponding pass/fail outcomes of the program under test as the training set. Toevaluate the performance of the proposed approach, extensive experiments were carried outon several benchmarks. The results manifest the applicability of the proposed approach tosoftware systems with low observability and its higher accuracy than a well-known machinelearning based method. This study also assessed the e�ect of di�erent parameters on theaccuracy of the proposed approach.

© 2020 Sharif University of Technology. All rights reserved.

1. Introduction

Due to the considerable size, complexity, distribu-tion, pervasiveness, and criticality of recent softwaresystems, producing faultless software seems to bean unattainable dream. Furthermore, the increasingdemand for software programs and competition insoftware industry lead to short time to market, whichincreases the likelihood of shipping faulty software.These facts highlight the signi�cance of quality as-surance activities in improving software quality and

1. Present address: Department of Computer Engineering,Sharif University of Technology, Tehran, Iran.

*. Corresponding author. Tel.: +98 21 29904131E-mail addresses: [email protected] (M. Valueian);n [email protected] (N. Attar); h [email protected] (H.Haghighi); mo [email protected] (M. Vahidi-Asl)

doi: 10.24200/sci.2019.51494.2219

reliability, among which software testing plays a vitalrole.

Software testing is known as a labor-intensive andexpensive task. The huge cost, di�culty, and inaccu-racy of manual testing have motivated researchers toseek for automated approaches, which aim to improvethe accuracy and e�ciency of the task. As a con-sequence, there has been a burgeoning emergence ofsoftware testing methodologies, techniques, and toolsto support testing automation in recent years.

Among di�erent testing activities, test data gen-eration is one of the most e�ective generations thataims to create an appropriate subset of input val-ues to determine whether a program produces theintended outputs. The input values should satisfysome testing criteria such that the testers have adependable estimation of the program reliability [1]. Toreduce laboriousness, inaccuracy, and intolerable costsof manual test data generation, a signi�cant amount of

1334 M. Valueian et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 1333{1351

research has been dedicated to automating the processof test data generation [2,3].

Despite great advances in di�erent software test-ing domains, e.g., test data generation, one impor-tant challenge has been overlooked by academia andindustry. The question arises, \who will evaluatethe correctness of the outcome/behavior of a softwareprogram according to the given test input data?" Themechanism of labeling program outcomes, as pass orfail, subjected to speci�c input values is referred to as\test oracle" [4].

Providing an accurate and precise test oracle isthe main prerequisite for achieving robust and realisticsoftware testing techniques and tools. This essentialaspect of software testing, left as an open problem inmany works, is facing serious challenges. Typically,in real world applications, there might be no commontest oracle except human assessment on pass or failoutcome of a program run [5]. One primary challengeof manual test oracles is the diversity and complexityof software systems and platforms, each of which hasvarious types of input parameters and output data.Moreover, due to demanding business expansion andthanks to advances in communications technologies,software components may be produced simultaneouslyby numerous developers, perhaps located in far apartplaces. This makes the manual judgment on thecorrectness of diverse and complicated componentsinaccurate, incomplete, and expensive.

The mentioned challenges emphasize the need forautomatic test oracles. A typical automatic test oraclecontains a model or speci�cation by which the outcomeof a Software Under Test (SUT) could be assessed.Figure 1 illustrates the structure of a conventionalautomated test oracle. As shown in the �gure, thetest inputs are given to both SUT and automatedtest oracle. Their outputs are compared with anappropriate comparator, which decides whether theoutcome of the SUT, subjected to the given test inputs,is correct.

An automated test oracle can be derived from theformal speci�cation of the SUT. Usually, there is a lackof the full formal speci�cation of the SUT features.In these situations, this can use a partial test oraclethat can assess the program behavior with respect toa subset of test inputs [4]. One approach to producing

Figure 1. The overall view of a conventional test oracle.

a partial oracle is the application of metamorphictesting, i.e., building the test oracle based on knownrelationships between expected behaviors of the SUT.

Recently, Machine Learning (ML) methods havebeen employed to build test oracles. A typical ML-based test oracle involves two main steps: choosingan appropriate learning algorithm for constructing alearning model and preparing an appropriate trainingdataset from SUT's historical behavior, represented asinput-output pairs; this process is illustrated in Fig-ure 2. According to the �gure, the constructed modelre ects the correct behavior of the SUT assuming thatthe model is of 100% precision.

Most of ML-based approaches model the rela-tionships between a limited number of inputs and thecorresponding output values as the training set. Theidea behind these approaches is that the ML-basedoracles generalize the limited relationships, involvedin the training set, to the whole behavior of theSUT according to its input space. In other words,they produce expected results for the remaining inputsbased on what they learned in the training phase.In the testing phase, the inputs of a test set aregiven to both SUT and test oracle. The expectedresults, generated by the oracle, are compared with theactual outputs of the SUT. If the results are similar orclose together (based on a prede�ned threshold), theoutcome is labeled passed; otherwise, the execution isconsidered to be failed. The choice of the ML methodmay a�ect the precision of the test oracle.

Some of ML-based works are based on classifyingthe software behavior by using input/output pairs [6]and, sometimes, along with execution traces [6{8]as input features of the classi�er. Neither of thementioned methods is applicable to low observablesoftware systems. The reason is that, in these systems,the expected results and actual outputs are not easilyrepresentable. To be more precise, it is not easy toobserve and understand the behavior of low observablesystems in terms of their outputs, e�ects on theenvironment, and other hardware and software compo-nents. Sometimes, even if the outputs are observable,their precise encoding or representation, which is aprerequisite for comparison, is di�cult. Therefore, inthe cases categorized below, comparing the expectedand actual results is inaccurate or even impossible:

1. Embedded software testing, in which the softwarehas a direct e�ect on a hardware device(s) thatlimits the observability of the software [9]. In these

Figure 2. A block diagram of a Machine Learning (ML)based test oracle.

M. Valueian et al./Scientia Iranica, Transactions D: Computer Science & ... 27 (2020) 1333{1351 1335

situations, assuming that the hardware workscorrectly, most of the testers prefer to evaluateembedded software by observing the behavior ofthe hardware;

2. Multimedia software programs, e.g., image editors,audio and video players, and computer games,where typically there is no scalar (and evenstructured) output. In these programs, thetypes of outputs are commonly unstructured orsemi-structured [10]. Therefore, encoding theseoutputs is mainly di�cult, inaccurate, or eveninfeasible, while labeling them with pass/fail isusually feasible;

3. Graphical User Interface (GUI) based programs, inwhich users are faced with various graphical �eldsinstead of neat values. Unlike the previous case,these programs have structured outputs. However,their encoding to numerical values is a challengingprocess. In these situations, testers usually labelthe outputs with pass/fail by taking the opinion ofdomain experts;

4. Compatibility testing: In this case, the goal is toevaluate the compatibility of the SUT's behavioron di�erent platforms [11]. For example, thecompatibility testing of web-based applicationsinvolves evaluating the rendered web pages ondi�erent browsers. Encoding, representing, andcomparing actual results on di�erent platformsare too di�cult and labor-intensive, while labelingthem with pass/fail is much simpler.

There are also some other situations where theexisting methods are not applicable. For instance,consider the cases in which explicit historical data,represented as input-output pairs, are not available,while there are documents that inform us about failure-inducing inputs or scenarios. The failure-inducinginputs refer to a subset of input values that causethe program to behave incorrectly. Similarly, failure-inducing scenarios address those conditions in whichthe program generates unexpected results. In manyindustrial software projects, there exist test reportsrelated to previous iterations of the test executionthat indicate pass/fail scenarios. These reports involvehistorical information and, therefore, can be useful toconstruct appropriate test oracles, although they donot include input-output pairs. As another example,consider documents produced by end users during thebeta testing process of a software product, containingvaluable pass/fail scenarios.

As an idea, in all of the mentioned situations,the relationships between inputs and correspondingpass/fail behaviors (instead of the correspondingoutputs) of the program can be modeled. In this way,without using concrete output values, this study may

achieve a test oracle to predict the pass/fail behaviorof the program for the given test inputs during thetesting phase.

Based on the above idea, in a conference pa-per [12], a learning based approach is employed, whichmerely requires input values and the correspondingpass/fail outcomes/behaviors as the training set. Thetraining set is used to train a binary classi�er thatserves as the program's test oracle. Later, during thetesting phase, several input parameters for which thecorresponding execution outcome is unknown are givento the classi�er. The classi�er labels the outcome aspass or fail. Figure 3 illustrates an overview of ourapproach using Arti�cial Neural Network (ANN) as thebinary classi�er.

Besides solving the mentioned problems, the pre-sented approach in [12] is characterized by the followingadvantages:

� It is based on black-box testing. This means thatone needs neither the program source code northe design documents of the SUT to generate anautomated oracle;

� Regarding the testing phase of the proposed ap-proach, shown in Figure 3, there is no need toexecute the SUT to compare its output with theoracle's output; as mentioned earlier, only the inputdata is required in the testing phase. This isadvantageous speci�cally when there is no access tothe SUT or when software execution is a time andcost consuming or risky work. These considerationsare usually seen in many low observable and safety-critical systems;

� The test oracles presented in [1,13] are approxima-tors rather than classi�ers. They assume that eachof the output components in the input output pairsgiven to the SUT in the training phase is correctwith respect to the corresponding input, or in otherwords, the oracle model constructed by the giveninputs in the test set re ects the intended behaviorof the SUT. However, this assumption introducesa considerable limitation when preparing historicaldata for the training phase since those input valuesinducing unexpected results must be discarded. In

Figure 3. The process of building and using anautomated test oracle based on the proposed approach.


contrast, since the proposed approach considersboth passing and failure-inducing inputs for modelconstruction, it does not su�er from this limitation.

It should be mentioned that this paper is an extendedversion of [12], and some parts of this paper havealready been presented in [12]. In comparison to [12],the extensions in this paper are given below:

� This study conducts various experiments to evaluatethe proposed approach in terms of the followingparameters: percentage of passing test cases in thetraining dataset, Code Coverage Percentage (CCP)of the training dataset, training dataset size, andcon�guration parameters of ANN. According to theexperiments, these parameters have major impactson the accuracy of the constructed oracle;

� In [12], benchmarks are only considered with integerinput values. However, there are many situationswhere low observable systems (e.g., embedded soft-ware and cyber-physical systems) accept signalsinstead of ordinary input values such as integers.In these situations, binary �les play the role ofinputs for the embedded software. These kinds ofinputs are not suitable for the classi�er and makeit unreasonably complex with respect to the ANNsize. In this paper, this complexity decreases byreducing the size of binary �les for the classi�er. Inorder to demonstrate the capability of the proposedapproach to programs with input signals, threerelated programs are added to our benchmarks;

� Literature review is updated according to the studiesthat have been done in recent years.

The remaining parts of the paper are organized asfollows. Section 2 presents a review of related worksand a brief background of neural networks needed toread the paper. Section 3 describes the details of theproposed approach. Section 4 gives the experimentalresults and analysis. Section 5 includes conclusion andsome directions for future work.

2. Literature review

2.1. Related worksDi�erent approaches have been proposed for automat-ing test oracles based on available software artifacts.Formal oracles use the existing formal speci�cation ofthe systems' behavior, typically based on the mathe-matical logic. Formal speci�cation languages can beroughly categorized into two groups:

1. Model-based speci�cation languages, which includestates and operations where pre-conditions andpost-conditions constrain the operations. Eachoperation may limit an input state by some pre-conditions, while postconditions de�ne e�ects of

the operation on the program state. Peters andParnas proposed an algorithm to generate a testoracle from program documentation [14]. In theirapproach, the documentation is written in fullyformal tabular expressions;

2. State transition systems, which have graphicalsyntax including states and transitions betweenthem. Here, states include abstract sets ofconcrete states of the modeled system. Gargantiniand Riccobene applied Abstract State Machines(ASMs) as an oracle model to predict the expectedoutputs of the SUT [15].

Although, formal speci�cation-based oracles are highlyaccurate and precise, their applicability is limited.Generally, for most of the software products, thereexists no formal speci�cation to construct an adequateand complete test oracle. Furthermore, in most situa-tions, it is costly and di�cult to write documentationsof an SUT in a formal way.

Implicit oracles are generated using some implicitknowledge to evaluate the behavior of the SUT. Im-plicit oracles can be used to detect anomalies suchas bu�er over ow, segmentation fault, etc. that maycause programs to crash or show execution failure [16{18]. Therefore, there is no need for any formal speci�-cation to generate this kind of oracle, and it can be usedfor almost all programs. For example, Walsh et al. [19]proposed an automated technique to detect some typesof layout failures in responsive web pages using theimplicit knowledge of common responsive failure types.In this approach, they distinguished between intendedand incorrect behaviors of a layout by checking ele-ments' positions relative to each other in di�erent view-port widths. For example, if two elements of a layoutalways overlap in di�erent viewports, the e�ect is con-sidered intended. If the elements overlap infrequently,it may produce a responsive layout failure. Accordingto the empirical study in [19], their approach detected33 distinct failures in 16 out of 26 real-world web pages.

There are some approaches that use semi-formalor non-formal documents, data sets collected duringsystem executions, properties of SUT, etc. to producean oracle model. Carver and Lei [20] proposed atest oracle for message-passing concurrent programsusing Labeled Transition Systems (LTSs). The mainchallenge of using stateful techniques to generate anoracle for these kinds of programs is the state explosionproblem. Therefore Carver and Lei [20] proposed astateless technique for generating global and local testoracles from LTS speci�cation models. Local oraclesare used to test individual threads without testing thesystem as a whole. However, a global test oracle testsa global relation between the model of the system andits implementation using test inputs generated from aglobal LTS model of the complete system. Therefore,


using local test oracles decreases the number of global,executed test sequences.

Metamorphic testing is used through some ap-proaches to produce a partial oracle. This methodutilizes metamorphic relations. For instance, if func-tion f(x) = sin(x) is implemented as a part of theSUT, a metamorphic relation would be sin(� � x) =sin(x) that must be held across multiple executions.The metamorphic relations are not necessarily limitedto arithmetic equations. For example, Zhou et al.proposed an approach for testing search engines, e.g.,Google and Yahoo!, using metamorphic relations [21].They built metamorphic relations in terms of theconsistency of search results. Discovering metamorphicrelations is an important step to construct a test oracle.Automating this step is the most challenging part ofthe metamorphic testing. Simons has constructed alazy systematic unit-testing tool called JWALK [22],in which the speci�cation of a system is lazy andeventually learned by the interaction between JWALKand the developer. This might be a convenient tool toextract metamorphic relations.

Go� proposed a white-box method to evaluatesoftware behavior in his thesis [23]. Moreover, heand his colleagues proposed two approaches to gen-erate test oracles [24{27]. In the �rst work, oracle isgenerated from software redundancy [24,26,27], whichis considered as a speci�c application of metamorphictesting [28]. They used the notion of cross-checkingoracle. The idea behind this oracle is that two similarsequences of method calls are supposed to behaveequivalently; however, their actual behaviors might bedi�erent because of a faulty implementation. There-fore, if an equivalent check of two similar sequencesfails, it is concluded that there is a fault in the code.In order to �nd identical sequences in the level ofmethod calls, they proposed a search-based techniqueto synthesize sequences of method invocations thatwere equivalent to a target method within a �nite set ofexecution scenarios [29]. Considering 47 methods of 7classes taken from the Stack Java Standard Library andthe Graphstream library, they automatically synthe-sized 123 equivalent method sequences of 141 manuallyidenti�ed sequences. It is implied that their approachgenerates 87% of equivalent method sequences, au-tomatically. They improved their previous work bysynthesizing more equivalent method calls for relevantcomponents of the Google Guava library [30].

In the second approach, Go� et al. [25] proposeda technique that automatically creates test oracles forexceptional behaviors from Javadoc comments. Thismethod utilizes natural language processing and run-time instrumentation and is supported by a tool called`Toradocu'.

Some of program behaviors can be automaticallyevaluated against the extracted/detected invariants.

Program invariants are properties that must be heldduring program execution. For example, a loop invari-ant is a condition that is true at the beginning andend and in each iteration of the loop. Invariants canbe included in test oracles considering that invariantdetection is an essential part of this method. Ernstet al. proposed a ML-based technique to detect invari-ants [31]. They implemented the Daikon tool for thispurpose [32].

Elyasov et al. [33] proposed a new type of auto-mated test oracles, called Execution Equivalence (EE)invariants. These invariants can be extracted fromsoftware logs, which include events and states of thesoftware. They presented a tool called LOPI (LOg-based Pattern Inferencer) for mining EE-invariants.They also compared Daikon with LOPI on some bench-marks. According to the experiments, the e�ectivenessof EE-invariants is compared to the ones found byDaikon.

N-version programming can be applied to producean oracle model, where software is implemented indi�erent ways by di�erent development teams, but withthe same functionality. Each of the software versionsmay be seen as a test oracle for the others. Thissolution is expensive because each version must beimplemented by a di�erent development team. Fur-thermore, there is no guarantee for one implementationto be fault-free to be a reference oracle model forthe other versions. In order to decrease the cost ofN-version programming, Feldt proposed a method togenerate multiple software versions automatically usinggenetic programming [34].

As another solution to the oracle problem, thenotion of decision table was used in [35] just for WebApplications (WAs). In this work, a WA is consideredcomposed of pages as a test model for both clientand server side. Decision table is a representation ofWA behavior. It contains two main parts: ConditionSection (which is the list of combinational conditionsaccording to inputs) and Action Section (which is thelist of responses to be produced when correspondingconditions are true). Table 1 illustrates a decision tabletemplate. In this table, the Input Section demonstratesconditions related to Input Variables, Input Actions,and State Before Test, while, in the Output Section, theactions associated with each condition are describedby Expected Results, Expected Output Sections, andExpected State After Test. At the end of the executionof each test case, a comparator compares the actualresults against the expected values of output variables,output actions, exceptions, and the environment stateobtained after the test execution.

Memon et al. applied automated planning to gen-erate oracle models for GUIs [36]. Their models includetwo parts: Expected-state generator and veri�er. Inorder to generate expected-state, they used a formal


Table 1. Decision table template [55].

Variant

Input section Output sectionInput

variablesInput

actionsState

before testExpected

resultsExpected

output sectionsExpected

state after test

... ...

model of GUI, which is composed of GUI elements andactions, in which actions are displayed by preconditionsand e�ects. This model is derived from speci�cations.Therefore, this approach is feasible if there exist ap-propriate speci�cations. When a test case runs, theactual values of properties for an element or elementsare known. At this moment, the veri�er can comparethese values against the expected values to determineif they are equal. Therefore, the veri�er is a processthat compares the expected state of the GUI with theactual state and returns a verdict of equal or not equal.

Last et al. demonstrated that data mining modelscould be employed for recovering system requirementsand evaluating software outputs [37]. To prove thefeasibility of the approach, they applied Info FuzzyNetwork (IFN) as a data mining algorithm.

Zheng et al. proposed a method to constructan oracle model for web search engines [38]. Theycollected item sets that include queries and searchresults. Then, they applied the association analysistechnique to extract rules from the items. The derivedrules play the role of a pseudo test oracle, whichmeans that by giving new search results, the mentionedapproach detects the search results that violate themined rules and presents them to testers for manualjudging.

Singhal et al. employed the decision tree algo-rithm to create oracle models [39]. They utilizedcode predicates to recognize input features to constructa decision tree. They applied this approach to thetriangle benchmark program, where the leaves of thetree are labeled with the triangle classes (equilateraltriangles, isosceles triangles, scalene triangles, andinvalid triangles).

Wang et al. utilized Support Vector Machine(SVM) as a supervised machine learning algorithmto train an oracle model [40]. They annotated theprogram code of SUTs by Intelligent Test OracleLibrary (InTOL) to collect test traces according toprocedure calls. They extracted features from each testtrace as an input for the SVM algorithm and, then,used the constructed SVM model as the test oracle.

Vanmali et al. proposed an oracle using ANN totest a new version of software [41]. Their methodologyis based on black-box testing. It is assumed that thefunctions existing in the previous version are still pre-served in the new version. Their training set includes

input-output mappings from the previous version of thesoftware. They used prede�ned thresholds to comparethe actual result of the new version and the estimatedexpected output of the ANN.

Zhang et al. proposed an oracle to test SUTsthat work as classi�ers [42]. For example, the PRIMEprogram, which is used in their evaluations, is a pro-gram that works as a classi�er. It determines whetherthe input number is a prime number. Therefore,the output of the program is a member of the setfPRIM;NOT PRIMg. They used a probabilisticneural network as an oracle to test these kinds ofSUTs. It is worth mentioning that they limited SUT toclassi�cation problems, while the method can be usedfor di�erent types of software.

Almaghairbe and his colleagues [43] proposed testoracles by clustering failures. They utilized anomalydetection techniques using software's input/outputpairs. They did the experiments based on the assump-tion that failures tended to be in small clusters. Inthe next work [44], they extracted dynamic executiontraces using Daikon and used them along with relatedinput/output pairs in order to improve the accuracyof the approach. Moreover, Almaghairbe and Roperapplied classi�cation methods to evaluate the behaviorof the software [6]. In another work, LO et al. [7]proposed a method to classify software behavior byextracting iterative patterns from execution traces. Inaddition, Yilmaz and Porter [8] proposed a hybridinstrumentation approach that uses hardware countersfor collecting program spectra to train a decision treeclassi�er in order to classify the software behavior.

Shahamiri et al. exerted single neural network tobuild oracle models aiming to test a program's decision-making structures [45] and verify logical modules [46].Both decision rules and logical modules were modeledusing the neural network technique.

In [13], an idea was presented to use multineural networks for constructing test oracles. Theproposed scheme includes an I/O relationship analysisto create the training dataset, train ANNs, and testthe constructed ANN as an automated oracle. In [1],Shahamiri et al. [46] showed how multi neural networkcould improve the performance of a test oracle on morecomplicated software, with multiple outputs, comparedto their previous work. To this end, a separateANN was constructed for each output. Therefore, the


complexity of the SUT was distributed over severalANNs, and the ultimate oracle was composed of theseANNs.

The main di�erence between the idea in [1] andours is the approach of composing the elements ofdatasets, an essential issue that has a signi�cant impacton the ANN learning process and its precision. Inour approach, there is no need to employ multi neuralnetworks because the software's concrete output valuesare not included in our datasets. For each input vector,a single value is assigned, which is the passing or failingstate of the software after running the SUT with thatinput vector.

Some of the existing ML-based oracles suchas [1,13] are constructed based on test sets containinga large number of test cases, each of which includesinputs and expected outputs. The oracle learns howto produce the desired outputs. The next step iscomparing the desired outputs with the actual out-puts, generated by the SUT, according to prede�nedthresholds. If the di�erence is less than the prede�nedthreshold, the actual output will be considered to becorrect. Otherwise, it would be labeled as failure.In addition, some of the existing approaches, suchas [6{8], consider outputs as input features of the MLmethod to classify the software behavior. Althoughthese proposed oracles can work on speci�c programs,they su�er from the following weak points:

� These methods cannot be applied to systems withlow observability;

� Some works such as [1,13] that generate actualoutputs and compare them with expected resultscannot guarantee that the prede�ned thresholdsare appropriate. This limitation highly a�ects theassessment of the constructed test oracle;

� Most existing approaches assume that the datasamples given in the training phase are all correct.Based on this assumption, the constructed oracleonly re ects the correct behavior of the SUT. Sincethey do not consider failing test cases, they have noidea about failing patterns of the code. Therefore,due to the lack of failing patterns, their model'sapproximations are likely to be imprecise;

� The mentioned approaches have to execute the SUTin order to achieve outputs required for evaluat-ing the software behavior; however, the proposed-method in this paper does not need to execute theSUT at least in the testing phase.

The proposed approach overcomes these de�cien-cies: It applies to systems that have low observabil-ity and/or produce unstructured or semi-structuredoutputs. Furthermore, since we considered pass/failoutcome/behavior of an SUT rather than its expectedoutput values to train a binary classi�er, there would

be no need for thresholds or comparators. Thissigni�cantly increases the accuracy of the constructedtest oracle. In addition, we do not use outputs of theSUT in order to evaluate it. Therefore, there is noneed to perform the software, which is a costly andsometimes risky operation. Moreover, the proposedtest oracle is built according to both the correct andincorrect behaviors of the SUT.

2.2. Arti�cial Neural Network (ANN)In recent years, a wide variety of ML methods havebeen proposed that can be used to discover hiddenknowledge from data. Among many categories ofindustrial-strength ML algorithms, ANN has attractedmuch attention over past few years. According to [47],an ANN is a computational system that consists ofelements, called units or nodes, whose functionality isbased on biological neurons. As illustrated in Figure 4,each neural cell consists of two important elements:

1. Cell body, which includes the neuron's nucleus. Itis worth noting that the computational tasks takeplace in this element;

2. Axon, which can be seen as a wire that passes anactivity from one neuron to another.

Furthermore, each neuron receives thousands of signalsfrom other neurons. Eventually, these incoming signalsreach the cell body. They are integrated together inthe cell body. If the resulting signal is more than athreshold, this neuron will �re and send signal to theforward neurons [47].

The main idea of the ANN algorithm is inspiredby biological neural systems. Each unit or nodein ANN is equivalent to a biological neuron. Eachnode receives inputs from other nodes or externalresources. Moreover, each input is associated withsome weight, which indicates the importance of thatinput. The node applies a function to the weightedsum of its inputs. This function is called \ActivationFunction", the result of which is the output of the node.The structure of each node in ANN is illustrated inFigure 5. The purpose of the activation function isto add a non-linearity feature to the node's output.Since most of the real-world data are non-linear andwe intend to learn these non-linear representations,

Figure 4. A biological neural cell.


Figure 5. Node structure in Arti�cial Neural Network (ANN).

activation functions are applied. Figure 6 illustratessome activation functions that are useful in practice.

ANN can solve complex mathematical problemssuch as stochastic and nonlinear problems using simplecomputational operations. Another feature of ANNis its self-organizing capability. This feature enablesANN to use the knowledge of previous solutions inorder to solve current problems. The ability of ANN incomparison with old mathematical methods are:

1. Parallel and high speed processing;

2. Learning and adapting with the environment of theproblem [48].

Therefore, ANN is used as our classi�er in therest of this paper. In the following subsection, the

Figure 6. Commonly used activation functions.

type of the neural network used in the experiments isdescribed.

2.2.1. Feed-forward neural networkThe feed-forward neural network is the simplest type ofANNs. It contains multiple nodes arranged in di�erentlayers. The nodes in each layer are fully connectedto the nodes in the next layer. All connections areassociated with some weights. Figure 7 illustrates anexample of a feed-forward neural network. A feed-forward neural network consists of three types of layers:

� Input layer, which includes nodes whose inputs areprovided from outside resources;

� Hidden layer, which includes nodes that performcomputational tasks on the outputs of the inputlayer. A feed-forward neural network has a singleinput and a single output layer, but zero or morehidden layers;

� Output layer, which includes nodes that are re-sponsible for computational tasks and transferringinformation from the network to the outside world.

It is worth noting that feed-forward neural networkswith no hidden layers and at least one hidden layer

Figure 7. An example of feed-forward neural network.


are called single layer perceptron and multi layerperceptron, respectively.

As mentioned in the previous section, the weightsthat are associated with inputs of a speci�c nodeillustrate the importance of each input to that node.Therefore, the main challenge of using ANN is how tocalculate optimal weights of connections between nodesin order to produce the desired output. The process ofassigning optimal weights to the connections is calledthe training phase of ANN. In the next subsection, thealgorithm that is utilized to train ANN is introducedin this paper.

2.2.2. Back-propagation algorithmBack-propagation is one of the several ways to trainANN. It falls in the category of supervised learning al-gorithms, meaning that it learns from labeled trainingdata. This means that we know the expected labels forsome input data.

Initially, the weights of all edges are randomlyassigned. Then, for each input vector in the trainingdataset, ANN calculates the corresponding output.This output is compared with the desired label in thetraining set and the error is propagated back to theprevious layer. The weights are adjusted according tothe propagated error. This process is repeated untilthe error is less than the prede�ned threshold. Whenthe algorithm terminates, the ANN is learned. At thispoint, ANN is ready to produce output labels for newinputs.

3. The proposed method

To resolve the de�ciencies of the existing ML-basedtest oracles, a binary classi�er has been applied to twoclasses of passing and failing input test data. Theconstructed oracle models the relationships betweenthe inputs and the corresponding pass/fail outcomesof a given program.

Our training data can be provided from di�er-ent resources that have already labeled a subset ofprogram inputs as pass/fail according to the programoutcome/behavior. For example, these resources couldbe human oracles, and documents or reports indicatingpassing/failing scenarios in the previous software ver-sions. Since the cost of some of these resources couldbe non-trivial, attempt is made to train the model withas small as possible dataset.

To provide an appropriate dataset, we seek for theavailable training resources. For example, we may buildthe dataset from both the available regression test suiteand human oracles. For the latter case, we run SUTwith some random input data and ask human oracle(s)or domain experts to label the outcomes as pass/fail.Typically, in real-world systems, the number of failingruns is less than the passing ones. Nevertheless, in

our experiments, various datasets are considered withvarious ratios of passing and failing test data to analyzethe sensitivity of the proposed approach in terms ofdi�erent ratios.

The proposed approach has three main phasesthat are detailed in the remaining parts of this section.The owchart of the approach is shown in Figure 8.

3.1. Data preparationThe collected dataset includes inputs and the corre-sponding execution outcomes. Suppose that the SUThas n input parameters. The inputs can be organizedas an input vector X, shown by Eq. (1), where featurexi represents the ith input parameter of the SUT withn input parameters:

X =< x1; x2; x3; � � � ; xn > : (1)

The set of all possible input vectors, representedby T in Eq. (2), can be shown as a Cartesian productof every input parameter domain, D(xi) (for i: 1 � � �n):

T = D(x1)�D(x2)� � � � �D(xn): (2)

The number of possible input vectors is the product ofthe domain size of the input parameters as:

jT j= jD(x1)j�jD(x2)j�� jD(xn)j=nYi=1

jD(xi)j:(3)

The SUT is executed by each input vector, andthe resulting outcomes/behaviors of the execution arelabeled with the members of set C as:

C = fFail;Passg: (4)

The training dataset is de�ned as a partial func-tion from T to C. For an SUT with n input parametersand m input vectors, the dataset includes two matrices:

1. An m � n input data matrix, where xij , 1 � i � nand 1 � j � m, is the value of input parameter ifrom input vector j, as shown below:0BBB@

x11 x2

1 � � � xn1x1

2 x22 � � � xn2

......

. . ....

x1m x2

m � � � xnm

1CCCA :

2. An m � 1 outcome vector, where ck 2 f0; 1g,1 � k � m, and ck = 1, indicates that the resultof the SUT subjected to the input data vector< x1

k; x2k; � � � ; xnk > is failure; otherwise, it is a pass.

The vector is given as follows:[email protected]

1CCCA :


Figure 8. The owchart of the proposed approach.

3.2. Training a binary classi�er with thetraining dataset

We have utilized a multilayer perceptron neural net-work for generating the oracle model. The outputof each layer is fed into the next layer as input. Inmultilayer perceptron neural networks, the neurons arefully-connected, meaning that the output from eachneuron is distributed to all of the neurons of the layer.At the beginning of the training phase, each connectionis initialized with a random weight. As mentioned inSection 3.1, we have produced a set of inputs and theircorresponding pass/fail labels as the training datasetin order to apply it to this network. A multilayerperceptron ANN has three types of layers:

� Input layer: The inputs of the SUT are mapped toneurons of the input layer. Therefore, the numberof neurons in this layer equals the number of SUT'sinputs;

� Hidden layer: Outputs of the input layer are fully-connected to the hidden layer as inputs. We applied`tan-sigmoid', as an activation function, to thislayer;

� Output layer: Outputs of the hidden layer arefully-connected to inputs of this layer. We applied`sigmoid', as an activation function, to this layer.Outputs of this layer are considered as the outputsof the created oracle model.


There are various parameters that determine howmuch the ANN should be trained. One of the most im-portant parameters is called `epoch', which is the num-ber of iterations that training samples pass throughthe learning algorithm. In other words, epoch is thenumber of times that all of the training data are usedonce to update the weights of ANN. We considered theconstant number of 1000, parameter Max in Figure 8,for the epoch value in our experiments. The otherparameter is `error', which means that training willcontinue as long as the error of the training phasefalls below this parameter value. In our experiments, ifduring the training process, the error rate is less than1%, the prede�ned value in Figure 8, then the trainingphase will be stopped. Otherwise, it will continue untilthe number of iterations meet the epoch number. Asillustrated in Figure 8, in order to reduce the errorof the model, the ANN's iterative algorithm graduallychanges the weights of connections between neuronsbased on the epoch value. For this purpose, a randomweight is assigned to each connection at the beginningof the training phase. Afterward, the training datasetis applied to ANN. At the end of each iteration,the algorithm compares the output of the network toSUT's corresponding pass/fail labels, which exist in thetraining dataset. If the error in the comparison processis less than a default value or the number of iterationsbecomes more than a speci�c threshold, the algorithmstops and the network is considered as the oraclemodel. Otherwise, in order to improve the model,the algorithm uses the back-propagation method andchanges the weight of connections between neurons.

3.3. Evaluating the accuracy of theconstructed model

To assess the accuracy of the constructed model, wehave carried out various experiments by giving inputsfrom the parameters' domain of the SUT, which arenot included in the training set. To this end, thetraining dataset is divided into several groups. Eachgroup is selected such that the ratio of pass/fail inputdata in that group is di�erent from other groups. Thisdi�erence is required for the sensitivity analysis ofthe constructed model. The sensitivity analysis wasdone such that, in each experiment, the impact ofa particular parameter on the accuracy of the oracle

was examined, while the remaining parameters wereunchanged. In addition to the pass/fail ratio in thetraining dataset, the studied parameters are the CCPof the training dataset, the size of the training dataset,and con�guration parameters of ANN.

4. Evaluation

In this section, we have evaluated the proposed methodon di�erent types of software programs, especiallyembedded software. In the following, �rst, the experi-mental setup is described and, then, the results of theexperiments are presented and analyzed.

4.1. Experiment setupTo evaluate the proposed method, this study used theneural network toolbox of MATLAB software versionR2018a+update3 for building the neural networkmodel [49]. Experiments were conducted on \Intel(R)Core i7-7500U CPU 2.70 GHz up-to 2.90 GHz, 8.0 GBRAM", and the operating system was \Windows 10Pro 64-bit".

Five benchmarks with di�erent features and char-acteristics have been considered. For each benchmark,several faulty versions were generated. Three of �vebenchmarks, so called DES [50], ITU-T G718.0 [51],and GSAD [52], fall into the category of embeddedsoftware. The reason these benchmarks have beenconsidered as embedded programs is according to thede�nition presented by Lee [53]. According to thisde�nition, embedded programs could have interactionswith physical devices. They are not necessarily ex-ecuted on computers and can be executed on cars,airplanes, telephones, audio devices, robots, etc. Theaim of choosing these three benchmarks is to show theapplicability of the proposed method to programs withlow observability and to programs with unstructuredor semi-structured outputs. Two other benchmarks,namely Scan and TCAS [54], have numerical inputs andoutputs. These two benchmarks have been chosen tocompare the proposed method with a known ML-basedmethod, proposed in [1]. From this point forward, wecall the method in [1] the baseline method. It is worthnoting that the baseline method is not applicable to lowobservable programs and programs with non-numericaloutputs. Table 2 illustrates the features of the selected

Table 2. Features of benchmarks.

Benchmark Number ofinputs

Number oflines

Programminglanguage

Scan 8 70 JavaTCAS 12 173 CDES 2 330 C++

ITU-T G718.0 1 356 C and VHDLGSAD 1 411 C, Verilog and VHDL


benchmarks. The application of each benchmark is asfollows:

� Scan benchmark is a scheduling program for anoperating system;

� TCAS benchmark, which is an abbreviated form ofTra�c alert and Collision Avoidance System, is usedfor aircraft tra�c controlling to prevent aircraftsfrom any midair collision;

� DES benchmark, which is an abbreviated form ofData Encryption Standard, is a block cipher (a formof shared secret encryption) that was selected bythe National Bureau of Standards as an o�cialFederal Information Processing Standard (FIPS) forthe United States in 1976 [50]. It is based on asymmetric-key algorithm that uses a 56-bit key;

� ITU-T is one of the three sectors of the InternationalTelecommunication Union (ITU); it coordinatesstandards for telecommunications. ITU-T G718.0 isa lossless voice signal compression software, whichis used to compress G.711 bitstream. The purposeof the compression is mainly for transmission overIP (e.g., VoIP). The input and output of the benchmark represent a binary �le that indicates the orig-inal and compressed voice signals, respectively [51];

� GSAD benchmark, which is an abbreviated form ofGeneric Sound Activity Detector, is an independentfront-end processing module that can be used todetect whether a transmission voice line is busy ornot. In other words, it indicates whether the inputframe is a silence or an audible noise frame. Theinput format of this benchmark is also a binary�le [52].

In order to evaluate the constructed model andcompare the results with a similar approach, theAccuracy criterion is used, which is calculated throughEq. (5), where TP, TN, FP, and FN are True Positive,True Negative, False Positive, and False Negative,respectively:

Accuracy =TP + TN

TP + FP + FN + TN� 100: (5)

4.2. Experiment results and discussionIn this section, the impact of di�erent parameters onthe accuracy of the constructed oracles is investigated.These parameters include the percentage of passingtest cases in the training dataset, the CCP of thetraining dataset, the size of the training dataset, andthe con�guration parameters of ANN. Samples of inputdata, the related outcomes as training set, and theresults of the algorithm can be accessed from the link:(http://ticksoft.sbu.ac.ir/ upload/samples.zip).

4.2.1. Percentage of passing test cases in trainingdataset

The training dataset in our approach contains inputvalues and the corresponding pass/fail outcome of theSUT, which is used to train the ANN classi�er. Thenumber of passing test cases in comparison to thenumber of all test cases of the training dataset maya�ect the accuracy of the classi�er during the testingphase. The term `ratio', de�ned in Eq. (6), is used asthe under study parameter in this section.

ratio =No. of passing test cases

No. of test cases� 100: (6)

In order to carry out the experiment for each bench-mark, di�erent training datasets with di�erent ratioshave been generated. Then, for each benchmark, ANNclassi�ers using the existing training datasets have beenconstructed. In the testing phase, new test cases arerandomly generated to evaluate the accuracy of theclassi�er as a test oracle. Figure 9 illustrates theaccuracy of the test oracle for each benchmark overtraining datasets with di�erent ratios.

According to the diagrams of Figure 9, the highestaccuracy belongs to the training datasets at a ratioof 50%, which means that half of the training datasetcontains pass labels. It could be seen that if the portionof passing test cases in the training dataset is more thanfailing ones, or vice versa, the accuracy decreases. Thereason is that when the pass test cases are more thanthe fail ones, TP has a high value, but TN has a smallvalue in Eq. (5). Therefore, the accuracy decreasesoverall. The same occurs when the failing test casesare more than the passing ones.

In real-world systems, more test cases are typi-cally labeled as pass rather than fail. Therefore, inorder to show the usefulness of the proposed approachin real-world systems, the rest of the experiments havebeen carried out using training datasets that contain90% pass labels.

Figure 9. The accuracy of the constructed oracle overdi�erent ratios of the training dataset.


4.2.2. Code Coverage Percentage (CCP) of trainingdataset

CCP is a measure that shows the percentage of theexecuted code when a particular set of test data is givento the program. In test data generation, which is oneof the main activities in the software testing process,CCP is a criterion for assessing the adequacy of thegenerated test data. In this paper, CCP is de�ned asthe percentage of visited statements to all statements(see Eq. (7)). Since generating our classi�er-basedoracle severely depends on the test input dataset,it is expected that the CCP of the training datasetis an appropriate indicator of the adequacy of theconstructed oracle. In this section, the impact ofCCP of the training dataset on the accuracy of theconstructed oracles is investigated.

CCP =No. of visited statements

No. of all statements� 100: (7)

In order to do the experiment for each benchmark,di�erent training datasets with di�erent CCPs havebeen generated. Then, for each benchmark, ANNclassi�ers are created using training datasets. In thetesting phase, test data are generated randomly toevaluate the accuracy of the classi�er as a test oracle.Figure 10 illustrates the accuracy of test oracle for eachbenchmark over training dataset CCP.

It is worth noting that we are not able tohave arbitrary CCP for the training dataset, becausesome statements of the code are always executed and,consequently, for each benchmark in Figure 10, thediagrams start at di�erent points. By applying trainingdataset with CCP of 20% to 40%, the accuracy of theproposed oracle ranges between 65% and 80%, whichis reasonably acceptable.

Nevertheless, as is clear in Figure 10, the higherthe CCP of the training dataset is, the more accuratethe test oracle will be. This is because the trainingdataset covers more parts of the SUT and, conse-quently, the oracle can model more parts of the SUT.

Figure 10. The accuracy of the constructed oracle overCode Coverage Percentage (CCP) of the training dataset.

Therefore, training datasets with maximum possibleCCP measure for each benchmark are applied to therest of the experiments. This decision is acceptablebecause, in the test data generation phase of real-world projects, it is tried to generate test data withthe maximum code coverage.

4.2.3. The size of training datasetOne of the other factors that has impact on theaccuracy of the test oracle is the training dataset size,which means the number of test cases used in thetraining dataset.

In order to carry out the experiment for eachbenchmark, we have generated di�erent trainingdatasets of di�erent sizes, but with a �xed ratioand coverage according to the consequences in Sec-tions 4.2.1 and 4.2.2. The ratio has been set to 90%for all training datasets, and the CCP measure hasbeen set to 98.8%, 100%, 100%, 100%, and 100%for benchmarks Scan, TCAS [54], DES [50], ITU-TG718.0 [51], and GSAD [52], respectively. Then, foreach benchmark, ANN classi�ers are created usingtraining datasets.

In the testing phase, test cases are generatedrandomly to evaluate the accuracy of the classi�er asa test oracle. Figure 11 illustrates the accuracy ofthe test oracle for each benchmark over the trainingdataset sizes. According to the diagrams, the higherthe number of test cases is in the training dataset, themore accurate the oracle will be, because the weights ofthe ANN edges are adjusted more precisely. It is worthmentioning that when the size of the training datasetbecomes larger than a speci�c value, the accuracy of theconstructed oracle will decrease. This is because themodel is biased to the training dataset and, therefore,the testing dataset is classi�ed in a wrong way, whichis usually called `over�tting'.

4.2.4. Con�guration parameters of ANNOracles constructed by the proposed approach areANN, which model the behavior of SUTs. Therefore,the topology of ANNs has a remarkable impact on theaccuracy of the constructed oracles. The number ofhidden layers and the number of neurons in each layerare the most important parameters of ANN, which arestudied in our experiments. Choosing larger valuesfor these parameters, by considering appropriate epochvalue, may lead to more accurate ANN. However, thetraining time will increase in this condition. Therefore,it is necessary to select appropriate values for these pa-rameters in order to save time and cost while achievingthe best accuracy at the same time. Nevertheless, it isobvious that, for a more complex SUT, a larger ANN isrequired (in terms of the number of hidden layers andthe number of neurons in each layer).

By �xing the other parameters based on the


Figure 11. The impact of the size of the training dataset on the accuracy of the constructed oracle for each benchmark:(a) SCAN, (b) TCAS, (c) DES, (d) ITU-T G718.0, and (e) GSAD.

results obtained in Sections 4.2.1, 4.2.2, and 4.2.3, theexperiments with di�erent numbers of hidden layersand various number of neurons in each layer have beenconducted.

Figure 12(a) and (b) illustrate the accuracy of theconstructed oracle for each benchmark over the numberof hidden layers and the number of neurons in eachlayer, respectively. As is clear in both �gures, whenthe value of each parameter is low, the accuracy is low,because the size of the ANN is not large enough tomodel the complexity of the SUT as required; whenthe value of each parameter exceeds a speci�c point,

the accuracy approaches a certain value, which is notnecessarily maximum.

Generally, when the number of hidden layers andthe number of neurons in each layer increase, theweights of network edges do not change signi�cantlywith the epoch of 1000 (according to Section 3.2).Therefore, the accuracy decreases. This is a kind oftrade-o� between time and accuracy, as mentionedabove. The con�guration of ANN for each benchmarkis illustrated in Table 3.

It is worth mentioning that the input type of theGSAD and ITU-T G718.0 benchmarks is a binary �le.

Table 3. Con�guration of Arti�cial Neural Network (ANN) for each benchmark.

Benchmark Number of neuronsin the input layer

Number ofhidden layers

Number of neuronsin each hidden layer

Number of neuronsin the output layer

Scan 8 9 7 1TCAS 12 13 7 1DES 2 5 8 1ITU-T 121 4 11 1GSAD 145 4 11 1


Figure 12. The impact of the parameters of Arti�cialNeural Network (ANN) con�guration on the accuracy: (a)The accuracy of the constructed oracle for each benchmarkover the number of hidden layers and (b) the accuracy ofthe constructed oracle for each benchmark over thenumber of neurons in each layer of the constructed oracle.

Therefore, each binary character is considered as asingle input. In order to reduce the number of neuronsin the input layer of the ANN, the binary values areseparated 4 by 4 bytes. To this end, the binary �le isconverted into a set of integer inputs. These inputswere fed to the ANN instead of binary characters,which consequently reduced the size of the constructedmodels.

4.2.5. Comparison with the baseline methodThis section compares the accuracy of the oraclesgenerated by the proposed approach with those con-structed by a known approximation-based method,suggested by Shahamiri et al. [1], that generates oraclesby modeling the relationship between program inputsand output values using neural networks. These oraclesare only appropriate for particular types of softwarethat produce concrete numerical outputs.

To compare these two types of oracles, the experi-ments were conducted on Scan and TCAS benchmarks.The accuracy of the two types of oracles were comparedover di�erent sizes of the training dataset. Figure 13illustrates the e�ect of the dataset size on the accuracyof the two oracle types.

Figure 13. The comparison between our approach andthe method of Shahamiri et al. [1]: (a) The accuracy forthe Scan benchmark and (b) the accuracy for the TCASbenchmark.

In this �gure, the gray lines show the accuracy ofour oracles when we have the best training dataset (interms of the pass ratio and CCP), and the blue linesshow the accuracy of our oracles when the trainingdataset is the same as the one used in [1]. The blueline in Figure 13(a) shows the accuracy of oraclesconstructed by our approach when the CCP measureand the pass ratio of the training dataset are 61% and90%, respectively. The gray line is considered whenthe CCP measure and the pass ratio of the trainingdataset are 98.8% and 50%, respectively. The blueline in Figure 13(b) shows the accuracy of oraclesconstructed by our approach when the CCP measureand the pass ratio of the training dataset are 50% and90%, respectively. The gray line is considered when theCCP measure and the pass ratio of the training datasetare 100% and 50%, respectively.

The decrement of accuracy in our oracles froma speci�c point is because of the over�tting prob-lem. Nevertheless, in general, the accuracy of oraclesgenerated by the method in [1] is less than oursaccording to Figure 13. In addition, when the size


of the training dataset is zero (which means withouthaving the training phase), the proposed approachworks randomly with the accuracy about 50%, becausewe have two classes of pass and fail. In contrast, theaccuracy of the method in [1] is zero, since it is anapproximator rather than a classi�er, which means thatit is not able to produce any result even randomly.

5. Conclusion and future work

Building automated oracles is challenging for embed-ded software with low observability and/or producesun-structured or semi-structured outputs. In thispaper, a classi�er based method using Arti�cial NeuralNetworks (ANNs) was proposed, which addressed thementioned issue. In the proposed approach, oraclesneed input data tagged with two labels of \pass"and \fail" rather than outputs and any executiontrace. Moreover, unlike some oracles such as [1,13],the comparison between expected results and actualoutputs is not required in our approach. Therefore, itcan be applied to a wide range of software systems, in-cluding embedded software. The experimental resultsof �ve benchmarks, three of which are categorized inembedded software systems, manifest the capability ofthe proposed approach in constructing accurate oraclesfor such systems.

As the other results of the experiments, thefollowing items are achieved:

� When the percentage of pass labels in the trainingdataset is 50%, the accuracy of the constructedoracle reaches its highest value in comparison withother percentages. Since having training datasets ata pass ratio of 50% is unreasonable for real-worldsystems, then this study considered a pass ratio of90%. It is worth noting that, in this situation, theconstructed oracles indicated acceptable accuracy,as well;

� When the Code Coverage Percentage (CCP) ofthe training dataset is high, a more accurate testoracle is generated. The reason is that the trainingdataset covers more parts of the Software UnderTest (SUT). Fortunately, in the test data generationphase of real-world projects, it is tried to generatetest data with maximum code coverage. Therefore,the results of achieving an acceptable accuracy inreal situations are promising;

� When the size of the training dataset increases, theaccuracy of the constructed oracle also increasesbecause the weights of the edges in the ANN becomemore accurate by using a larger dataset. However,when the size becomes more than a speci�c value,the accuracy will decrease because the model isbiased to the training dataset, or in other words,over�tting occurs;

� The con�guration parameters of ANN dependon the code complexity of the SUT. The largerthe ANN is, the longer the time is needed forits convergence. Thus, selecting parameters is atradeo� between time and the accuracy of theoracle. This study used the method of `try anderror' to achieve appropriate values for the ANNparameters per benchmark program.

The proposed approach is based on the black-box test-ing. Therefore, this study does not require the programsource code, execution traces or design documents ofthe SUT to generate an automated oracle. In addition,unlike the mentioned oracles in Section 2, there is noneed to execute the SUT in order to achieve its outputs,which is advantageous, especially in testing embeddedsoftware and safety-critical applications. Moreover,unlike the majority of machine learning-based oracles,which do not consider the failing patterns of the codeduring model construction, the proposed approachconsiders both passing and failure-inducing inputs formodel construction. In this way, the proposed modelre ects the whole behavior of the SUT and, thus, showsgreater accuracy. At last, the experimental results ofthe comparison between our approach and the machinelearning based oracle proposed in [1] revealed the factthat this approach is at least as good as the approachin [1] in common cases, although our method is appli-cable to some cases to which the method in [1] is not.

In the proposed black-box approach, it was as-sumed that there was no access to the program code.Assuming that we have access to the software code, wecan construct more robust oracles. For future works,we are planning to study the impact of the analysis ofthe source code on the accuracy of the oracle. We wouldalso intend to investigate the e�ect of di�erent machinelearning techniques on the precision of the constructedoracle. As another future work, we will considerdi�erent metrics of code complexity to examine thee�ect of this complexity on the topology of ANN.

References

1. Shahamiri, S.R., Wan-Kadir, W.M., Ibrahim, S., andHashim, S.Z.M. \Arti�cial neural networks as multi-networks automated test oracle", Automated SoftwareEngineering, 19(3), pp. 303{334 (2012).

2. Valizadeh, M., Tadayon, M., and Bagheri, A. \Makingproblem: A new approach to reachability assurancein digraphs", Scientia Iranica, 25(3), pp. 1441{1455(2018).

3. Rezaee, A. and Zamani, B. \A novel approach toautomatic model-based test case generation", ScientiaIranica, 24(6), pp. 3132{3147 (2017).

4. Barr, E.T., Harman, M., McMinn, P., Shahbaz, M.,and Yoo, S. \The oracle problem in software testing: A


survey", IEEE Transactions on Software Engineering,41(5), pp. 507{525 (2015).

5. Ammann, P. and O�utt, J., Introduction to SoftwareTesting, Cambridge University Press (2016).

6. Almaghairbe, R. and Roper, M. \Automatically classi-fying test results by semi-supervised learning", In 27thIEEE Int. Symp. on Software Reliability Engineering,pp. 116{126 (2016).

7. Lo, D., Cheng, H., Han, J., Khoo, S.-C., and Sun,C. \Classi�cation of software behaviors for failuredetection: a discriminative pattern mining approach",In 15th ACM SIGKDD int. Conf. on Knowledge Dis-covery and Data Mining, pp. 557{566 (2009).

8. Yilmaz, C. and Porter, A. \Combining hardware andsoftware instrumentation to classify program execu-tions", In 18th ACM SIGSOFT Int. Symp. on Foun-dations of Software Engineering, pp. 67{76 (2010).

9. Freedman, R.S. \Testability of software components",IEEE Transactions on Software Engineering, 17(6),pp. 553{564 (1991).

10. Vliegendhart, R., Dolstra, E., and Pouwelse, J.\Crowdsourced user interface testing for multimediaapplications", In ACM Multimedia 2012 Workshop onCrowdsourcing for Multimedia, pp. 21{22 (2012).

11. Jan, S.R., Shah, S.T.U., Johar, Z.U., Shah, Y.,and Khan, F. \An innovative approach to investigatevarious software testing techniques and strategies",International Journal of Scienti�c Research in Sci-ence, Engineering and Technology, 2(2), pp. 2395{1990(2016).

12. Gholami, F., Attar, N., Haghighi, H., Vahidi-Asl, M.,Valueian, M., and Mohamadyari, S. \A classi�er-basedtest oracle for embedded software", In 2018 IEEE Real-Time and Embedded Systems and Technologies, pp.104{111 (2018).

13. Shahamiri, S.R., Kadir, W.M.N.W., Ibrahim, S., andHashim, S.Z.M. \An automated framework for soft-ware test oracle", Information and Software Technol-ogy, 53(7), pp. 774{788 (2011).

14. Peters, D.K. and Parnas, D.L. \Using test oracles gen-erated from program documentation", IEEE Trans-actions on Software Engineering, 24(3), pp. 161{173(1998).

15. Gargantini, A. and Riccobene, E. \Asm-based testing:Coverage criteria and automatic test sequence genera-tion", Journal of Universal Computer Science, 7(11),pp. 1050{1067 (2001).

16. Cadar, C., Ganesh, V., Pawlowski, P.M., Dill, D.L.,and Engler, D.R. \Exe: automatically generating in-puts of death", ACM Transactions on Information andSystem Security, 12(2), p. 10 (2008).

17. Shrestha, K. and Rutherford, M.J. \An empiricalevaluation of assertions as oracles", In 4th IEEEFourth Int. Conf. on Software Testing, Veri�cation andValidation, pp. 110{119 (2011).

18. Ricca, F. and Tonella, P. \Detecting anomaly andfailure in web applications", IEEE MultiMedia, 13(2),pp. 44{51 (2006).

19. Walsh, T.A., Kapfhammer, G.M., and McMinn, P.\Automated layout failure detection for responsive webpages without an explicit oracle", In 26th ACM SIG-SOFT Int. Symp. on Software Testing and Analysis,pp. 192{202 (2017).

20. Carver, R. and Lei, Y. \Stateless techniques forgenerating global and local test oracles for message-passing concurrent programs", Journal of Systems andSoftware, 136, pp. 237{265 (2018).

21. Zhou, Z.Q., Zhang, S., Hagenbuchner, M., Tse, T.,Kuo, F.C., and Chen, T.Y. \Automated functionaltesting of online search services", Software Testing,Veri�cation and Reliability, 22(4), pp. 221{243 (2012).

22. Simons, A.J. \Jwalk: a tool for lazy, systematic testingof java classes by design introspection and user inter-action", Automated Software Engineering, 14(4), pp.369{418 (2007).

23. Go�, A. \Automating test oracles generation", Ph.D.Thesis, Universit`a della Svizzera italiana (2018).

24. Carzaniga, A., Go�, A., Gorla, A., Mattavelli, A.,and Pezz�e, M. \Crosschecking oracles from intrinsicsoftware redundancy", In 36th Int. Conf. on SoftwareEngineering, pp. 931{942 (2014).

25. Go�, A., Gorla, A., Ernst, M.D., and Pezz�e, M.\Automatic generation of oracles for exceptional be-haviors", In 25th Int. Symp. on Software Testing andAnalysis, pp. 213{224 (2016).

26. Pezz�e, M. \Towards cost-e�ective oracles", In 10thIEEE/ACM Int. Workshop on Automation of SoftwareTest, pp. 1{2 (2015).

27. Go�, A. \Automatic generation of cost-e�ective testoracles", In 36th ACM Int. Conf. on Software Engi-neering, pp. 678{681 (2014).

28. Segura, S., Fraser, G., Sanchez, A.B., and Ruiz-Cort�es, A. \A survey on metamorphic testing", IEEETransactions on Software Engineering, 42(9), pp. 805{824 (2016).

29. Go�, A., Gorla, A., Mattavelli, A., Pezz�e, M.,and Tonella, P. \Searchbased synthesis of equivalentmethod sequences", In 22nd ACM SIGSOFT Int.Symp. on Foundations of Software Engineering, pp.366{376 (2014).

30. Mattavelli, A., Go�, A., and Gorla, A. \Synthesisof equivalent method calls in guava", In Int. Symp.on Search Based Software Engineering, pp. 248{254(2015).


31. Ernst, M.D., Cockrell, J., Griswold, W.G., and Notkin,D. \Dynamically discovering likely program invariantsto support program evolution", IEEE Transactions onSoftware Engineering, 27(2), pp. 99{123 (2001).

32. Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S.,Pacheco, C., Tschantz, M.S., and Xiao, C. \The daikonsystem for dynamic detection of likely invariants",Science of Computer Programming, 69(1{3), pp. 35{45(2007).

33. Elyasov, A., Prasetya, W., Hage, J., Rueda, U., Vos,T., and Condori-Fern�andez, N. \Ab = ba: executionequivalence as a new type of testing oracle", In 30thAnnual ACM Symp. on Applied Computing, pp. 1559{1566 (2015).

34. Feldt, R. \Generating diverse software versions withgenetic programming: an experimental study", IEEProceedings-Software, 145(6), pp. 228{236 (1998).

35. Di Lucca, G.A., Fasolino, A.R., Faralli, F., and DeCarlini, U. \Testing web applications", In Int. IEEEConf. on Software Maintenance, pp. 310{319 (2002).

36. Memon, A.M., Pollack, M.E., and So�a, M.L. \Auto-mated test oracles for GUIs", In 8th ACM SIGSOFTInt. Symp. on Foundations of Software Engineering:Twenty-First Century Applications, pp. 30{39 (2000).

37. Last, M., Friedman, M., and Kandel, A. \Using datamining for automated software testing", InternationalJournal of Software Engineering and Knowledge Engi-neering, 14(4), pp. 369{393 (2004).

38. Zheng, W., Ma, H., Lyu, M.R., Xie, T., and King,I. \Mining test oracles of web search engines", In26th IEEE/ACM Int. Conf. on Automated SoftwareEngineering, pp. 408{411 (2011).

39. Vineeta, Abhishek Singhal, Abhay Bansal \Generationof test oracles using neural network and decisiontree model", In 5th Int. Conf.- Con uence The NextGeneration Information Technology Summit, pp. 313{318 (2014).

40. Wang, F., Yao, L.-W., and Wu, J.-H. \Intelligenttest oracle construction for reactive systems withoutexplicit speci�cations", In 9th IEEE Int. Conf. onDependable, Autonomic and Secure Computing, pp.89{96 (2011).

41. Vanmali, M., Last, M., and Kandel, A. \Using aneural network in the software testing process", In-ternational Journal of Intelligent Systems, 17(1), pp.45{62 (2002).

42. Zhang, R., Wang, Y.-W., and Zhang, M.-Z. \Auto-matic test oracle based on probabilistic neural net-works", In Recent Developments in Intelligent Comput-ing, Communication and Devices, pp. 437{445 (2019).

43. Almaghairbe, R. and Roper, M. \Building test oraclesby clustering failures", In 10th IEEE Int. Workshop onAutomation of Software Test, pp. 3{7 (2015).

44. Almaghairbe, R. and Roper, M. \Separating passingand failing test executions by clustering anomalies",Software Quality Journal, 25(3), pp. 803{840 (2017).

45. Shahamiri, S.R., Kadir, W.M.N.W., and bin Ibrahim,S. \An automated oracle approach to test decision-making structures", In 3rd IEEE Int. Conf. on Com-puter Science and Information Technology, pp. 30{34(2010).

46. Shahamiri, S.R., Kadir, W.M.W., and Ibrahim, S. \Asingle-network annbased oracle to verify logical soft-ware modules", In 2nd IEEE Int. Conf. on SoftwareTechnology and Engineering, pp. 272{276 (2010).

47. Gurney, K., An Introduction to Neural Networks, CRCpress (2014).

48. Graupe, D., Principles of Arti�cial Neural Networks,World Scienti�c (2013).

49. \Mathworks", https://www.mathworks.com/ (2018).

50. Standard, D.E., Federal Information Processing Stan-dards Publication, National Bureau of Standards, USDepartment of Commerce, 4 (1977).

51. \Itu-t technical paper telecommunication standardiza-tion sector of ITU", Technical report, TransmissionSystems and Media, Digital Systems and NetworksDigital Sections and Digital Line System (2010).

52. \Generic sound activity detector (GSAD)", https://www.itu.int/pub/T-REC-USB-2020.

53. Lee, E., Embedded Software, University of Californiaat Berkeley, Berkeley (2001).

54. Livadas, L.J.C. and Lynch, N.A. \High-level modelingand analysis of tcas", In 20th IEEE Real-Time SystemsSymp., pp. 115{125 (1999).

55. Shahamiri, S.R., Kadir, W.M.N.W., and Mohd-Hashim, S.Z. \A comparative study on automatedsoftware test oracle methods", In 4th Int. Conf. onSoftware Engineering Advances, pp. 140{145 (2009).

Biographies

Meysam Valueian is a PhD student at the Depart-ment of Computer Engineering in Sharif University ofTechnology, Tehran, Iran. He received his BS andMS degrees from Faculty of Computer Science andEngineering at Shahid Beheshti University, Tehran,Iran. Software testing and repair are among hisresearch interests.

Niousha Attar received her BS and MS degreesfrom Faculty of Computer Science and Engineeringat Shahid Beheshti University, Tehran, Iran. She iscurrently a PhD student in that faculty. Her researchinterests are complex networks and software testing.

Hassan Haghighi received his PhD degree in Com-puter Engineering-Software from Sharif University ofTechnology, Tehran, Iran in 2009 and is currently


an Associate Professor in the Faculty of ComputerScience and Engineering at Shahid Beheshti University,Tehran, Iran. His main research interests includeformal methods in the software development life cycle,software testing, and software architecture.

Mojtaba Vahidi-Asl is an Assistant Professor in

the Faculty of Computer Science and Engineering atShahid Beheshti University, Tehran, Iran. He receivedhis BS in Computer Engineering from Amirkabir Uni-versity of Technology and MS and PhD degrees in Soft-ware Engineering from Iran University of Science andTechnology. His research area includes software testingand debugging and human computer interaction.

Constructing automated test oracle for low observable...

Documents

Transcript of Constructing automated test oracle for low observable...