TRUST Autumn 2008 Conference: November 11-12, 2008 Comparison of Blackbox and Whitebox Fuzzers in...

TRUST Autumn 2008 Conference: November 11-12, 2008

Comparison of Blackbox and Whitebox Fuzzers in Finding Software Bugs

Marjan Aslani, Nga Chung, Jason Doherty, Nichole Stockman, and William Quach

Summer Undergraduate Program in Engineering Research at Berkeley

(SUPERB) 2008

Team for Research in Ubiquitous Secure Technology


"Comparison of Blackbox and Whitebox Fuzzers in Finding Software Bugs", Marjan Aslani 2

Overview

Introduction to Fuzz testing Our research Result



What Is Fuzzing?

A method of finding software holes by feeding purposely invalid data as input to a program.

– B. Miller et al.; inspired by line noise

– Apps: image processors, media players, OS– Fuzz testing is generally automated– Finds many problems related to reliability; many of which are

potential security holes.



BlackBox : Randomly generated data is fed to a program as input to see if it crashes.

– Does not require knowledge of the program source code/ deep code inspection.

– A quick way of finding defects without knowing details of the application.

WhiteBox: Creates test cases considering the target program's logical constraints and data structure.

– Requires knowledge of the system and how it uses the data. – Deeper penetration into the program.

Types of Fuzz Testing



Zzuf - Blackbox Fuzzer

Finds bugs in applications by corrupting random bits in user-contributed data.

To make new test cases, Zzuf uses a range of seeds and fuzzing ratios (corruption ratio).



Catchconv - WhiteBox Fuzzer

To create test cases, CC starts with a valid input, observes the program execution on this input, collects the path condition followed by the program on that sample, and attempts to infer related path conditions that lead to an error, then uses this as the starting point for bug-finding.

CC has has some downtime when it only traces a program and is not generating new fuzzed files.



Valgrind

A tool for detecting memory management errors.

Reports the line number in the code where the program error occurred.

Helped us find and report more errors than we would if we focused solely on segmentation faults.



Types of errors reported by Valgrind

By tracking a program’s execution of a file, Valgrind determines the types of errors that occur which may include:

Invalid writes Double free - Result 256 Invalid reads Double free Uninitialized values Syscal Pram Memory leak



Program run under Valgrind



Methodology



Metafuzz

All of the test files that triggered bugs were uploaded on Metafuzz.com. – The webpage contained:

Link to the test file Bug type Program that the bug was found in Stack hash number where the bug was located



Metafuzz webpage



Target applications

Mplayer, Antiword, ImageMagick Convert and Adobe Flash Player

MPlayer the promary target:– OS software – Preinstalled on many Linux distributions– Updates available via subversion– Convenient to file a bug report– Developer would get back to us!

Adobe bug reporting protocol requires a certain bug to receive a number of votes form users before it will be looked at by Flash developers.

VLC requires building subversions from nightly shots.



Research Highlights

In 6 weeks, generated more than 1.2 million test cases.

We used UC Berkeley PSI-cluster of computers, which consists of 81 machines (270 processors).– Zzuf, MPlayer, and CC were installed on them.

Created a de-duplication script to find the unique bugs.

Reported 89 unique bugs; developers have already eliminated 15 of them.



Result

To provide assessments for the two fuzzers, we gathered several metrics:

– Number of test cases generated– Number of unique test cases generated– Total bugs and total unique bugs found by each

fuzzer.



Result con’t

Generated 1.2 million test cases – 962,402 by Zzuf.– 279,953 by Catchconv.

From the test cases:– Zzuf found 1,066,000 errors.– Catchconv reported 304,936.

Unique (nonduplicate) errors found:– 456 by Zzuf– 157 by Cachconv



Result con’t

Zzuf reports a disproportionately larger amount of errors than CC. Is Zzuf better than CC?

No! The two fuzzers generated different numbers of test cases.

How could we have a fair comparison of the fuzzers’ efficiency?– Need to gauge the amount of duplicate work

performed by each fuzzer.– Find how many of these test cases were unique.



Average Unique Errors per 100 Unique Test Cases

First, we compared performance of the fuzzers by the average number of unique bugs found per 100 test cases.

– Zzuf: 2.69– CC : 2.63

Zzuf’s apparent superiority diminishes.



Unique Errors as % of Total Errors

Next, we analyzed fuzzers’

performance based on the percentage of unique

errors found out of the total errors.

– Zzuf: .05%– CC: .22%

Less than a quarter percent difference between the fuzzers.



Types of Errors (as % of Total Errors)

Also considered analyzing the fuzzer based on bug types found by the fuzzers.

Zzuf performed better in finding “invalid write”, which is a more important security bug type.

Not an accurate comparison, since we couldn’t tell which bug specifically caused a crash.



Conclusion

We were not able to make a solid conclusion about the superiority of either fuzzer based on the metric we gathered.

Knowing which fuzzer is able to find serious errors more quickly would allow us to make a more informed conclusion about their comparative efficiencies.



Conclusion con’t

Need to record the amount of CPU clock cycles required to execute test cases and find errors.

Unfortunately we did not record this data during our research, we are unable to make such a comparison between the fuzzers.



Guides for Future Research

To perform a precise comparison of Zzuf and CC:

1. The difference between the number of test cases generated by Zzuf and CC for a given seed file and specific time frame should be recorded.

2. Measure CPU time to compare the number of unique test cases generated by each fuzzer for a given time.

3. Need a new method to identify unique errors avoid reporting duplicate bugs:

Need to use automatically generate a unique hash for each reported error that can then be used to identify duplicate errors.



Guides for Future Research con’t

4. Use a more robust data collection infrastructure that could accommodate the massive amount of data colected.

– Our ISP shut Metafuzz down due to excess server load.

– Berkeley storage full.

5. Include an internal issue tracker that keeps track of whether or not a bug has been reported, to avoid reporting duplicate bugs.



WhiteBox or BlackBox??

With lower budget/ less time: use Blackbox Once low-hanging bugs are gone, fuzzing

must become smarter: use whitebox In practice, use both.



Acknowledgment

National Science Foundation (NSF) for funding this project through the SUPERB-TRUST (Summer Undergraduate Program in Engineering Research at Berkeley - Team for Research in Ubiquitous Secure Technology) program

Kristen Gates (Executive Director for Education for the TRUST Program)

Faculty advisor David Wagner Graduate mentors Li-Wen Hsu, David Molner,

Edwardo Segura, Alex Fabrikant, and Alvaro Cardenas.



Questions?

Thank you

TRUST Autumn 2008 Conference: November 11-12, 2008 Comparison of Blackbox and Whitebox Fuzzers in...

Documents

Transcript of TRUST Autumn 2008 Conference: November 11-12, 2008 Comparison of Blackbox and Whitebox Fuzzers in...