Rapid Apperceptioninsurehub.org/sites/default/files/reports/FinalReport...Rapid Apperception Machine...
Transcript of Rapid Apperceptioninsurehub.org/sites/default/files/reports/FinalReport...Rapid Apperception Machine...
Rapid Apperception
Machine Assisted Semantic Understanding of Code
Francis V. Adkins Luke T. Jones
[email protected] [email protected]
Northeastern University
Project Report
prepared for
CSfC INSuRE
24 April, 2015
ABSTRACT
One of the greatest challenges facing source code auditors today is the sheer effort required to understand an unknown code base. However, despite significant academic research and multiple commercial products, relatively little work has been done to assess the usability of code understanding tools. In this research, we return to the basics and address the fundamental question of whether such tools truly can help an auditor. To assess this, we construct a simple greplike utility to cluster key words under a given semantic tag. We then use this tool to dissect several vulnerable specimen and compare auditor performance both with and without the tool. In this paper, we highlight our findings and discuss the differences in methodology both using and not using such tools. Ultimately, we conclude that, despite existing issues, code understanding tools can have merit, provided that usability take precedence over technical novelty.
1
INTRODUCTION
Source code auditing has long been acknowledged as an invaluable part of the software
development lifecycle, particularly as it relates to the practice of defensive programming. The
key hurdle for auditors, however, has always been the necessity to understand large and
unknown code bases in a limited amount of time. Any programmer who has been newly assigned
to an existing project can attest to the seemingly insurmountable learning curve associated with
this process.
To address this issue, a large body of research and several commercial products have
been directed to help the security auditor. However, despite the complexity of these tools,
practical surveys have found that auditors seldom use them in production. In fact, this
complexity often dissuades auditors from adopting them for everyday use and forces their
existence into obscurity.
Ultimately, this continuing lack of wellknown or applicable tools has allowed the
auditors’ issues to persist. As a realworld ramification of this, the opensource movement has
apparently failed to garner the support that it needs. One of the greatest features of the
opensource mentality states that: anyone can read the code, therefore anyone can audit it.
However, recent experience has shown that, just because someone can audit the code, does not
mean that they will . The year 2014 was popularized as the year of opensource vulnerabilities
with HeartBleed, ShellShock, POODLE, and a number of others. These vulnerabilities took
place in the opensource backbone of our modern infrastructure and affected millions of users,
yet took years to discover. The subsequent media focus on these events brought the source code
2
auditing process to the attention of the general public and has spurred both greater demand and
more accountability for their success.
Therefore, it seems that there is high demand for source code auditing but a fundamental
disconnect between the users and any existing tools that could aid in their work. This begs the
question of what purpose these tools are meant to serve and what can be done to improve them.
As a general trend, each new attempt in the academic domain to address program
comprehension has sought to surmount some technical challenge. Generally, such approaches
proclaim greater extraction of relevant features and improved higherlevel design recovery.
However, this technical success does not necessarily beget an improved outcome for the user,
particularly if there are no users of the approach. Therefore, in this research, we return the root
question of the matter and seek to identify if source code understanding tools truly can help an
auditor.
It is worth noting at this point that we confine our criticisms to those tools that are
directly intended to aid in source code understanding. That is to say that we place the more
simple code isolation and navigation tools such grep and ctags in an alternate category. These
tools may be said to increase overall understanding, but ultimately contribute only minute
portions of the overall picture. Instead, our critiques address the more robust understanding tools
that are intended to draw overarching correlations or produce some highlevel functionality or
architecture summary. It is these tools that we question and whose validity we seek to test.
To answer this question, we have implemented our own simple automated source code
understanding tool. This tool draws very minor correlations across a code base and presents this
3
information to the user. We then set out to analyze unknown code bases both with and without
the tool and gathered comparative information on the relative effectiveness of both approaches.
LITERATURE REVIEW
Machineassisted semantic understanding of code is a subset of program comprehension
which deals with automated processes designed to augment an auditor’s knowledge of a
codebase in useful ways. In this section we examine both the free and commercial tools that
already exist which attempt to assist generic analysts in source code understanding and
vulnerability analysis (the goal of many audits), the current research thrusts into the realm of
machineassisted semantic understanding of code, and the academic standards of measuring code
understanding. However, we find that both tools and research papers lack the focus we are
looking for: a simple auditorcentric algorithm implemented and evaluated for effectiveness.
After our literature review, we discovered a startling gap between axiomatically practical and
effective tools quite often simple and their much more complex and novel counterparts in the
research domain. Our research aims to fill in the gap by providing a step up from a simple tool
used everyday by most developers, grep, and demonstrating empirical evidence of its
effectiveness.
First, we dive into the domain of the pro bono toolsmith, examining some free tools
which attempt to help analysts understand source code. The first of the free tools we examine is
Source Navigator NG which enables editing code, jumping to declarations or implementations of
functions, variables and macros, and displaying both relationships between classes, functions and
4
members, and call trees [Source]. The second, GNU GLOBAL, is less like an integrated
development environment (IDE) than Source Navigator NG, and is instead a selflabeled “source
tagging system” that allows the user to quickly locate functions, macros, structs, classes, etc.,
independent of any specific editor [GNU]. This functionality differs from RA in that RA remains
agnostic concerning the semantics of the tagged language constructs, whereas GNU GLOBAL
provides strong hints about the semantics of functions. Lastly, CScout is a refactoring browser
that enables identifier changes, static call graph construction, and querying for files, identifiers
and functions based on properties, metrics and many other attributes [Spinellis]. These three free
tools, selected from among many, are comparable to IDE’s with some extra bells and whistles.
The applicability of IDE’s and IDElike programs to source code browsing is axiomatically
effective since IDE’s are used to create software and the history of their usage also supports this
assumption. But how much do extra features such as static call graph reconstruction or automatic
inference of class relationships help in auditing? Free tools implement simple algorithms and
solutions and assume that their effectiveness is apparent and do not bother to gather empirical
evidence. Indeed, in some cases, evidence of efficacy for some techniques would be extraneous
such as in the case of the IDElike tool, however, many simple techniques like static call graph
reconstruction would benefit from academic rigor in testing and evaluation.
In the commercial world, a much sparser biome than the free and opensource world,
products such as Imagix 4D are more geared towards the software auditor or reverse engineer
with automated analysis of control flow and dependencies, and visualizations of source code
aimed at improving program comprehension [Imagix]. Scitools Understand is a proprietary IDE
that provides a wide variety of information about code including dependency analysis, call
5
graphing, code standards testing, and a variety of metrics [Scitools]. It is directly aimed at
helping an analyst understand code more quickly and is therefore very closely aligned with our
own goals. The commercial world of tools, then, is very similar to equivalent free tools, except
the average level of sophistication is higher. Companies present what their tools can do, but they
do not present rigorous academic testing on whether these tools really help or not. Just as with
free and opensource tools, commercial tools stand to gain much from the empirical verification
of their effectiveness.
Next, we dive into the domain of the researcher, examining a range of papers, from
feature location to automatic summarization. Dit et al. created a taxonomy and survey of 89
articles from 25 venues on feature location in source code [Dit], a technique that is applicable to
software reverse engineering, auditing and maintenance, but pursues increased code
understanding very differently than RA. Research in the niche field of feature extraction assumes
that its methods are useful because the latest algorithms are better and faster than their
predecessors. Feature extraction is not the only field in program comprehension that makes this
assumption. Lucia et al. compared the results of automatic information retrieval (IR) methods for
software artifact labeling to manual methods for labeling and found that simpler IR methods
work better in some cases [De Lucia]. Their work serves as inspiration for RA’s utility testing
because it indicates that complex methodology does not necessarily mean better results. The
question we must ask is: can auditors more quickly understand code assisted by our tool or not?
Ning et al. created Cobol System Renovation Environment (Cobol/SRE), a tool for reusable
component recovery [Jim]. Cobol/SRE aims more at software developers understanding legacy
systems so that useful components can be extracted, although auditors or reverse engineers could
6
use it as well. Lastly, Moreno et al. created JSummarizer, an Eclipse plugin that automatically
generates natural language summaries of Java classes [Moreno]. We consider this to be one of
the most relevant research results for a software auditor trying to quickly understand code,
though it lacks shedding light on the specificities of implementations, so it would not help an
auditor find possible security vulnerabilities. RA fills this gap by providing semantic assistance,
not at the class level, but largely at the function call level. This provides enough granularity to
conduct security audits. Research papers tended to focus on incremental improvements in
algorithms instead of improvements in code auditor abilities due to better algorithms. Obviously,
this by no means invalidates nor minimizes this research, though just as in the world of tools, the
world of research into machineassisted semantic understanding of code could be vastly
improved by a thoughtful approach to testing the practical efficacy of esoteric program
comprehension algorithms. In fact, one of the most important papers we reviewed was by Maalej
et al. called “On the Comprehension of Program Comprehension” in which they report
observations of 28 developers in industry and their processes in comprehending new software.
The researchers found that comprehension tools (in the traditional sense) are almost unknown in
industry and had littletono impact in practice. The developers instead opted for more basic
strategies such as GUI tinkering, debug prints, and simply talking to the original developers.
They posit that existing tools can tend to be too esoteric and a simplified approach may have
greater impact [Maalej]. If existing tools are too esoteric, certainly many frontline research
techniques are downright obscure. RA intends to be the one of the first research techniques that
conceptually extends a familiar and intuitive base, the Unix grep utility, and backs up its tool
with quantitative and qualitative analysis of its effectiveness to the auditor.
7
Lastly, we examined previous research that evaluated the tools that test subjects used to
understand code. Most methods erect a framework and compare tools without consideration for
measuring the effect on the user’s understanding. One such method compares tools based on
“data structures”, “visualizations”, “information requesting” and “navigation features”
[Koskinen]. Another opts for evaluation based on “context”, “intent”, “users”, “input”,
“technique”, “output”, “implementation” and “tool” [Guéhéneuc]. These methods seem
reasonable for comparing tools, but not for finding the ground truth about their effectiveness.
The research on evaluating a programmer’s understanding of code is sparse because such a task
is necessarily ambiguous and hard to measure absolutely. In some sense, measuring human
understanding is more the realm of psychology than computer science. However, we take cues
from von Mayrhauser’s work in [Von Mayrhauser] and use anecdotal evidence to evaluate
human understanding. Taking it a step farther, we decided to quantify our understanding by
timing our performance on test samples with and without using RA. In this way, we extend the
current standard of academic rigor for empirical testing of human understanding and apply it to
our tool.
PROBLEM STATEMENT
In this age of digital reliance, software security is more important than ever before. With
large portions of the internet’s infrastructure based on opensource code, any optimization to the
code auditor’s workflow is eminently useful. However, modern tools are frequently discarded
due to their overwhelming complexity and poor usability. In this research, we have returned to
the root of the problem and questioned if code understanding tools have any merit whatsoever.
8
To do this, we have created a tool of our own that is designed to address criticisms against
existing approaches by being both simple and highly usable. We then evaluated our use of this
tool and gathered both quantitative and qualitative descriptions of the process. By comparing
these metrics, we can then provide support either for or against pursuing such tools to a greater
degree. Some assumptions are made during the evaluation process, and these are delved into with
greater detail in the following section.
METHODS AND PROCEDURES
To evaluate the efficacy of source code understanding tools, we first addressed the
criticisms against existing approaches. Namely, existing tools are often considered too complex
and obscure for realworld use. However, simpler utilities such as grep are used with resounding
frequency and to great success. Therefore, we sought to bridge this gap by creating a source
understanding tool of our own based on the underlying principles of grep. This tool operates on
the concept of pairing a semantic tag to relevant language keywords and then locating all
instances of these keywords within a code base. This effectively allows us to locate and visualize
all portions of the code that deal with some specific functionality such user Input/Output,
Database interactions, or many others. For the purposes of this research, this tool has been
dubbed Rapid Apperception, or RA.
The intent behind creating such a tool is to provide the minimum amount of functionality
necessary to meet the definition of “code understanding”. For our purposes, this definition
necessitates the ability to derive some higherlevel correlation among various parts of the code
9
base. Therefore, by overcoming criticisms against existing approaches, we are able to more
adequately evaluate the conceptual basis for source understanding tools in general. To conduct
this evaluation, we have constructed the following general experiment:
Given a vulnerable application and a sufficiently vague description of the vulnerability, a
security auditor has two hours to develop a patch that mitigates it. Within this experiment, the
usefulness of RA can then be measured by comparing the quantitative results achieved via
timing as well as qualitative reports produced by the auditors as they record their methodology
and impressions. If a patch could not be developed in the two hour time frame, then the tester
would compose a description of the patch that they would create if they had enough time.
Due to the limited experimentation timeline, the role of security auditors in this
experiment were performed by the authors of this paper. To derive any reasonably valid results,
the experimental process was repeated on several code bases and the use of RA was rotated
among participants. To mitigate any existing inherent speed differences, we first established a
baseline among the participants and used this as the guide for further comparison. A more
detailed description of this procedure follows.
As a first step, we prepopulated the database of tag to keyword pairings with relevant
tags for the Java language. These pairings were taken from the OWASP Code Review Guide
v1.1 [OWASP] where they were specifically identified as being relevant to the security auditing
process. We next identified a set of Java projects that are known to contain at least one
vulnerability. We verified the exploitability of these vulnerabilities by leveraging modules from
the Metasploit project.
10
To obtain a reasonably vague description of the vulnerabilities, we enlisted the assistance
of a securityknowledgeable colleague. Their assignment was to look at the respective Metasploit
modules and filter the existing vulnerability summary to remove any detailed information that
might indicate the exact location of the vulnerability or any associated functions. The intent
behind this was to narrow the scope of the security audit to a reasonable functionality subset, yet
not reveal so much as to make the test uninformative. Upon practicing on a preliminary test, not
included in the results, we also decided to allow viewing of the exploit code used by metasploit.
We retained the redacted descriptions, but allowed static analysis of the exploit in order to
mitigate our time constraints on our tests and also the enormity of the codebase of the tests.
For testing, the auditors used identical Virtual Machines running Ubuntu 10.04. Then,
given an uninterrupted time period, they were directed to complete the experiment for each Java
test subject and record their experiences as well as any timing results. Timing began after the
testers verified that the exploit worked on the application and began the task of understanding
how the application was exploited, and then timing stopped when an acceptable patch was
implemented and tested or the proposed patch was determined to have a high chance of success,
but required far more time than the allotted two hours to implement. In the case of our tests,
every single patch found was determined to have a high likelihood of success, but require far
more time to implement than two hours. For purposes of quantitative analysis, one auditor was
assigned the use of RA and the other was prohibited its use. The rules for this competition were
as follows:
Static analysis only
11
Exception: An auditor’s patch is only deemed successful by running the
respective Metasploit module and verifying nonexploitation.
Permitted applications:
Without tool: vim, grep
With tool: vim, grep, RA
Maximum auditing time: 2 hours
As researchers, we acknowledge that, despite the academic rigor that this approach
provides over previous work, this experimental process is still highly subjective and presents
room for a large margin for error. Factors that may contribute to this error include the relative
immaturity of the participants as security auditors as well as the small sample size of the Java
projects selected for experimentation. However, as the measurement of “understanding” is itself
a highly subjective concept, it is our hope that our novel testing procedures and the results
gathered therein will still serve as a benefit to future program comprehension research. The
results from these experiments have been compiled and are presented in the following section.
RESULTS
Over the course of about a week, we analyzed four test subjects. One as practice and
three as actual test candidates. First, we found metasploit exploit modules on exploitdb, and
then downloaded the source code for the projects at their respective repositories. Our findings are
tabularized as follows:
12
Table 1: Patch Design Speed with and without RA (max 120 minutes)
Tester 1 Tester 2 ratio
ElasticSearch 1.1.1 (Baseline)
wo/tool: 38 min wo/tool: 40 min 0.95
Apache Struts 2.3.16 w/tool: 120 min wo/tool: 120 min 1
Apache Roller 5.0.1 wo/tool: 120 min w/tool: 120 min 1
The first test, our baseline, was designed to mitigate any difference in speed that we as
testers would have regardless of using our tool or not. However, it turned out that we had
approximately the same speed when it came to designing a patch for ElasticSearch. We found the
issue to be arbitrary execution of Java almost certainly included as a feature to be the problem
that the metasploit module took advantage of. Tester 1 proposed sandboxing the Java execution
or implementing a domain specific language. Tester 2 proposed something quite different: an
authentication requirement to be able to use the Java execution feature of ElasticSearch. Both
testers' methodologies were very similar, involving heavy use of grep and starting by searching
for the "script_fields" parameter as seen in the exploit code. From that point, they both "hunted
and pecked" for various components, reading them in vim and finding other files to investigate
using grep. Both testers found a line by which they could disable the exploit, but proposed the
above patches as an actual viable inproduction server.
13
In the next test on Apache Struts, we introduced the use of RA. Notably, both testers ran
into the time limit for this test. They found the issue to be the use of ObjectGraph Navigation
Language (OGNL) expressions which managed to bypass security measures and execute
arbitrary Java. Both testers arrived at the same patch: filtering access to "staticMethodAccess"
field on the "OgnlUtil" object. Another, much simpler, though trivial option would be to not run
production servers in developer mode. However, we sought to fix the OGNL security bypass
even for developer mode. The tester without the use of RA used a method exactly similar to that
described above for ElasticSearch, while the tester with RA used grep to determine jumping in
points to the codebase, but then used RA as a code browser and manually added tags as needed.
This manual addition of tags was required because the prefabricated database of tags from the
OWASP document had no tags for OGNL. The fact that the test candidate's vulnerability dealt in
code that did not have premade tags made the use of RA not as easily beneficial as it could have
been. However, the tester using RA still found it to yield incremental understanding increases
much more steadily than just using grep in the baseline test. Observing the audit logs, the tester
using RA was able to find the exact location of script execution, though the tester not using RA
proposed a more exact possible solution OGNL expression execution, though the increased
preciseness was not verified to be tenable.
For the last test (Apache Roller), the testers switched who was using RA and again began
the process of trying to understand the source enough to patch the metasploit vulnerability. Both
testers also ran into the time limit for this test. Both testers soon found that the same OGNL issue
was being exploited in Roller as in Struts; furthermore, Roller actually included a version of
Struts that was being exploited by the metasploit module. However, in spite of this, the patch
14
was not any easier to construct; in fact, neither tester could find the exact location of the payload
execution, however the tester with RA demonstrated a much better understanding of the code
given their audit log.
DISCUSSION
Our first and primary goal was to assess whether a simple algorithm augmented auditor
understanding of code or not. To accomplish this goal, we made two contributions to the
program comprehension field: a novel understanding testing methodology, and a novel program
comprehension tool. We'll first discuss our proposed abstract outline of the testing framework,
and then our implementation and usage of it for our tests.
During the development of our testing methodology, we were cognizant that evaluating
human understanding of anything, even of a computer program, is more of a psychological or
social science question than computer science one. Inherently, surmounting the qualitative nature
of this kind of analysis required creativity and a willingness to investigate analyses mostly
foreign to computer science. We suspect that the difficulty of quantitative analysis of human
understanding of programs is why there is such a lack of any comprehension analysis in modern
literature. This, however, is abundantly unacceptable, because as many niche program
comprehension algorithms gain undeniable, incremental improvements, there can be no
intelligent vectoring of effort based on what niche research fields are most beneficial to
realworld auditors, reverse engineers, and software engineers. To begin addressing this issue in
the program comprehension field we propose an abstract outline of a testing framework that
15
should be applied to existing tools to evaluate their efficacy to frontline auditors, reverse
engineers and others who require code source understanding:
1. Choose qualitative and quantitative metrics
2. Choose test corpus and testers
3. Conduct tests
4. Evaluate effectiveness of metrics, corpus and testers
5. If metrics acceptably relevant, evaluate results
6. Present results, metrics and metric effectiveness
We’ll now discuss how we applied this framework for testing RA.
For our qualitative metric, we drew on von Mayrhauser’s work and adopted recorded
anecdotal evidence. For the quantitative metric, we found no precedents to draw upon, but
decided to time our usage of our tool in order to patch a vulnerability to an available exploit and
compare our timing data to hopefully objectively determine if our tool added value to the code
auditing process or not. Since we lacked the time to solicit external testers and instead had to act
as testers ourselves, timing successful patch development to known exploits gave us a viable
method of testing and comparing our understanding to each other without having to know
anything about the code beforehand or collaborate during the testing to determine our respective
levels of understanding. Next, in evaluating our metrics, corpus and testers, we find that there is
much room for improvement.
First, our quantitative metric of timing was unhelpful in determining whether our tool
was beneficial or not. This is because we set a testing time limit of two hours, and both testers hit
16
the limit for both tests. Additionally, our stopping point for the timing amounted to “breaking”
the exploit, which can be accomplished by multiple levels of complexity and time commitment,
from disabling the vulnerable service, to implementing authentication. For timing results to
really be comparable, there must be fewer solutions to the problem, preferably one. Otherwise
the testers’ paths of understanding are different and therefore not truly comparable. However,
this begs the question that if the application can only be understood in one way whether human
understanding is even needed. Our qualitative metric of recorded anecdotes was unsurprisingly
more successful, though it, by nature, requires expertise and domain knowledge to interpret. The
testers’ method of recording their findings and insights would be benefitted from standardization
which would make them more comparable.
Second, our choice of using userlevel (nonJVM/JRE) Java applications with metasploit
exploits as our corpus had two unforeseen consequences; one actually desirable and another less
so. Desirably, the exploits in our corpus were designlevel vulnerabilities which require much
more understanding of the source code than alternatives such as buffer overflows in a language
such as C or sanitization issues as in PHP. Undesirably, we found only about five test candidates
from a metasploit exploit database of 1,350 modules, therefore limiting our choices in terms of
testing.
Lastly, the choice of using ourselves as testers was probably the most detrimental factor
in our whole testing procedure. This prohibited us from constructing our own toy code for more
manageable understanding challenges. Having external testers would enable likely better
quantitative timing results and better ability to compare qualitative audit logs, especially
17
impartially. In spite of all these ways to improve our testing procedure, we still consider our
qualitative metrics effective enough to draw conclusions about RA from the results of our tests.
Qualitatively, we found RA to be somewhat useful.It easily prevented grepping for the same
strings over and over, as often happens when using grep on source code bases. From the
qualitative audit logs, the tester with RA seemed to have more success in understanding the code
than the tester without. In Apache Struts, the tester with RA found the exact place of script
execution and in the Apache Roller example, the tester with RA could construct a much more
comprehensive picture of the application and suggest a much more precise fix for the
vulnerability. Since both testers, when they used RA, added custom tags, we believe that this
indicates that the novel functionality introduced by RA is what helps, not just having a code
browser. However, since the test candidates we examined had vulnerabilities in parts of the code
not addressed by our preconstructed tag database, we cannot conclusively say whether
prepopulated tags truly help or not. It’s only reasonable to believe that they do, and are in fact
probably much more helpful than needing to construct one’s own tags. However, our tests,
unfortunately, did not provide any empirical evidence for this.
Some definite challenges we encountered during testing included the lethargy of response
when custom tags were added to large projects, the judiciousness and expertise needed for
adding custom tags, and lack of user interface sophistication. The lethargy was due largely to our
Python implementation of the tagging engine. We could replace this engine with a bash oneliner
that would be much quicker, though we have not tested this method on any test candidates. Not
only was adding custom tags was painfully slow, but also installation of the tool, as it required
18
Apache, MongoDB, PHP and Python. This complexity could be easily mitigated by creating an
aptget package.
Concerning the expertise needed to add custom tags, we discovered that adding tags to
increase understanding of an unfamiliar codebase was not as trivial in the general case as it might
seem and required understanding itself. In some cases, trivial custom tags could be eminently
useful, in the case of identifying wrapper functions or custom implementations of library
functions. But our test candidates included no such lowhanging fruits. This, again, was due to
our choice of Java projects as our test corpus. We believe that if we had used C or PHP projects
as tests the usual buffer overflow and input validation culprits would have been easier for RA to
highlight, though would require less understanding of the code.
Lastly, even though we focused on testing the usefulness of semantic tagging as an
approach, sophistication of the user interface is an instrumental force multiplier when
considering human understanding. A poor user interface can be a bottleneck between the user
and an algorithm given that the user has good comprehension ability and the algorithm is
reasonably useful. Having the ability to display multiple tags at the same time and define
subtags would greatly increase the effectiveness of RA. Even simple things like keybindings
and autocompleting tag names would greatly help the auditor using RA. These changes might
yield high increases in usefulness for what amounts to a improvement in presentation.
CONCLUSION
19
In the domain of machineassisted semantic understanding of code, a subdomain of
program understanding, there are many tools that consider their effectiveness to be apparent and
not needing verification. There are also many bleedingedge research techniques that objectively
improve upon past techniques, however do not provide evidence of actually augmenting humans’
understanding. We sought to break this mold with Rapid Apperception, a tool incrementally
more complex than the commonly accepted grep utility but empirically verified with rigorous
testing procedures.
Our hypothesis was that by prepopulating a semantic tag database with related keywords
and making a simple user interface that would highlight these tags that a code auditor’s job could
be made noticeably easier. To test this hypothesis, we constructed a basic tool and established a
novel testing methodology to evaluate both the tool and the methodology itself. The crux of our
testing ended up depending on analyzing recorded anecdotes of understanding, an approach we
validated via von Mayrhauser. Though we found that our testing procedures could use significant
improvement, we derived enough qualitative evidence from our tests to determine that semantic
tagging is a useful technique to employ in code auditing. Although, its usefulness is directly
connected with the amount of expertise and applicability embodied by the semantic tag database.
If custom tagging is required and no prepopulated tags apply, then understanding starts at square
one, but can be more quickly gained than without our tool. Also, with a team of auditors, once
custom tags are added, the understanding process is bootstrapped and no longer needs to be
repeated.
20
Lastly, though the underlying algorithm of a machineassisted semantic understanding of
code technique needs to be viable for any utility to be derived from its use, the user interface
which presents the results of the algorithm can be a bottleneck or force multiplier depending on
its design and features. We consider our work to only be the first step in evaluating a field of
research that has long needed more robust empirical evaluation, and more effort towards
bringing the best of program comprehension research to the fingertips of code auditors.
21
REFERENCES
Dit, B., Revelle, M., Gethers, M. and Poshyvanyk, D. (2013), Feature location in source code:a taxonomy and survey. J. Softw. Evol. and Proc., 25: 53–95.
De Lucia, A.; Di Penta, M.; Oliveto, R.; Panichella, A.; Panichella, S., "Using IR methods for labeling source code artifacts: Is it worthwhile?," Program Comprehension (ICPC), 2012 IEEE 20th International Conference on , vol., no., pp.193,202, 1113 June 2012
"GNU GLOBAL Source Code Tagging System." GNU Project . N.p., n.d. Web. 07 Feb. 2015. <http://www.gnu.org/software/global/>.
Guéhéneuc, Yanngaël. "A Comparative Framework for Design Recovery Tools." Proceedings of the Conference on Software Maintenance and Reengineering (n.d.): n. pag. Web. 24 Apr. 2015.
Imagix Corp. "Analyze Your Source Code." Reverse Engineering and Source Code Analysis Tools . N.p., n.d. Web. 07 Feb. 2015. <http://imagix.com/>.
Jim Q. Ning, Andre Engberts, and W. Voytek Kozaczynski. 1994. Automated support for legacy code understanding. Commun. ACM 37, 5 (May 1994), 5057.
Koskinen, Jossi, and Tero Lehmonen. "Analysis of Ten Reverse Engineering Tools." Advanced Techniques in Computing Sciences and Software Engineering (n.d.): n. pag. Web. 24 Apr. 2015.
Moreno, L.; Marcus, A.; Pollock, L.; VijayShanker, K., "JSummarizer: An automatic generator of natural language summaries for Java classes," Program Comprehension (ICPC), 2013 IEEE 21st International Conference on, vol., no., pp.230,232, 2021 May 2013
Maalej, Walid, et al. "On the comprehension of program comprehension." ACM Transactions on Software Engineering and Methodology (TOSEM) 23.4 (2014): 31.
OWASP, “OWASP Code Review Guide V1.1”. Web. 2008. <https://www.owasp.org/images/2/2e/OWASP_Code_Review_GuideV1_1.pdf>
"Source Navigator NG." N.p., n.d. Web. 07 Feb. 2015. <http://sourcenav.sourceforge.net/>.
Spinellis, Diomidis. "The CScout Refactoring Browser." Department of Management Science and Technology, Athens University of Economics and Business, n.d. Web. <http:// www.spinellis.gr/cscout/doc/index.html>.
"Understand Your Code | SciTools.com." SciTools.com . N.p., n.d. Web. 10 Feb. 2015. <https://scitools.com/>.
Von Mayrhauser, A. "From Code Understanding Needs to Reverse Engineering Tool Capabilities." Computer-Aided
Software Engineering (n.d.): n. pag. Web. 24 Apr. 2015.
22