Software Bertillonage: Finding the Provenance of an Entity

25
Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle

description

Slides from the paper presented at the 2011 IEEE Intl Conf on Mining Software Repositories, by Julius Davies, Daniel German, Mike Godfrey, and Abram Hindle

Transcript of Software Bertillonage: Finding the Provenance of an Entity

Page 1: Software Bertillonage: Finding the Provenance of an Entity

Software Bertillonage: Finding the Provenance of an Entity

Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle

Page 2: Software Bertillonage: Finding the Provenance of an Entity
Page 3: Software Bertillonage: Finding the Provenance of an Entity

“Provenance”

• Originally, used for art / antiques, but now used in science and IT:– Data provenance / audit trails– Component provenance for security– … but what about source code artifacts?

A set of documentary evidence pertaining to the origin, history, or ownership of an artifact.

[From “provenir”, French for “to come from”]

Page 4: Software Bertillonage: Finding the Provenance of an Entity

Who are you?

Alphonse Bertillon(1853-1914)

Page 5: Software Bertillonage: Finding the Provenance of an Entity

Bertillonage metrics

1. Height2. Stretch: Length of body from left

shoulder to right middle finger when arm is raised

3. Bust: Length of torso from head to seat, taken when seated

4. Length of head: Crown to forehead5. Width of head: Temple to temple6. Length of right ear7. Length of left foot8. Length of left middle finger9. Length of left cubit: Elbow to tip of

middle finger10. Width of cheeks

Page 6: Software Bertillonage: Finding the Provenance of an Entity

Forensic Bertillonage

• “Quick and dirty”, and a giant step forward– Can narrow huge set of candidate mugshots down to a

small handful!

• Some problems:– Equipment + training required– Very sensitive to measurement error– The metrics are not independent!– … and fingerprints are much more accurate

Page 7: Software Bertillonage: Finding the Provenance of an Entity

Software Bertillonage• We want quick & dirty ways investigating the provenance of a function

(file, library, binary, etc.)– Who are you, really? – Where did you come from?– Does your mother know you’re here?

• It’s not fingerprinting or DNA analysis!– You may be looking for a cousin or ancestor

• A good software Bertillonage metric should:– be computationally inexpensive– be applicable to the desired level of granularity / prog. language– catch most of the bad guys (recall)– significantly reduce the search space (precision)

Page 8: Software Bertillonage: Finding the Provenance of an Entity

Bertillonage meta-techniques

1. Count based size, LOC, fan-in, McCabe

2. Set based contained string literals [TU Delft], method names

3. Relationship based call sets [“origin analysis”], throws sets, libraries included / used

4. Sequence based methods in order, token-based clone detection

5. Graph based AST and PDG clone detection

Page 9: Software Bertillonage: Finding the Provenance of an Entity

Software Bertillonage

A motivating example

Page 10: Software Bertillonage: Finding the Provenance of an Entity

A real-world problem

• Software packages often bundle in third-party libraries to avoid “DLL-hell” [Di Penta-10]

– In Java world, jars may include library source code or just byte code– Included libs may include other libs too!

• Payment Card Industry Data Security Std (PCI-DSS), Req #6:– “All critical systems must have the most recently released, appropriate

software patches to protect against exploitation and compromise of cardholder data.”

What if a financial software package doesn’t explicitly list the version IDs of its included libraries?

Page 11: Software Bertillonage: Finding the Provenance of an Entity

Identifying included libraries

• The version ID may be embedded in the name of the component!e.g., commons-codec-1.1.jar– … but often the version info is simply not there, or it’s wrong, ….

• Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10]

– Will return only product, not version

• Compare against all known compiled binaries– But compilers, build-time compilation options may differ

Page 12: Software Bertillonage: Finding the Provenance of an Entity

commons-codec-1.1.jarpublic class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) throws DecoderException public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[]) throws EncoderException

Page 13: Software Bertillonage: Finding the Provenance of an Entity

commons-codec-1.2.jarpublic class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) static byte[] discardNonBase64(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[])

Page 14: Software Bertillonage: Finding the Provenance of an Entity

Anchored class signatures• Idea: Compile / acquire all known lib versions but extract only the

signatures, then compare against target binary– Shouldn’t vary by compiler/build settings

• For a class C with methods M1, … , Mn, we define its anchored class signature as:

θ(C) = σ(C), σ(M⟨ ⟨ 1), ..., σ(Mn)⟩⟩

• For an archive A composed of classes C1,…,Ck, we define its anchored class signature as

θ(A) = {θ(C1 ), ..., θ(Ck )}

θ(C) = σ(C), σ(M⟨ ⟨ 1), ..., σ(Mn)⟩⟩

θ(A) = {θ(C1 ), ..., θ(Ck )}

Page 15: Software Bertillonage: Finding the Provenance of an Entity

// This is **decompiled** source!!package a.b;

public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac }

synchronized static int a (java.lang.String s) throws a.b.E {

// decompiled byte code omitted }}

σ(C) = public class a.b.C extends Object implements I σ(M1 ) = public C()σ(M2 ) = default synchronized static int a(String) throws E

θ(C) = σ(C), σ(M⟨ ⟨ 1 ), σ(M2 )⟩⟩

Page 16: Software Bertillonage: Finding the Provenance of an Entity

Archive similarity

• We define the similarity index of two archives as their Jaccard coefficient:

• We define the inclusion index of two archives as:

Page 17: Software Bertillonage: Finding the Provenance of an Entity

Maven 2

Page 18: Software Bertillonage: Finding the Provenance of an Entity

Implementation

• Created byte code (bcel5) and source code signature extractors

• Used SHA1 hash for class signatures to improve performance– We don’t care about near misses at the method or class level!

• Built corpus from Maven2 jar repository– Maven is unversioned + volatile!– 150 GB of jars, zips, tarballs, etc., – 130,000 binary jars (75,000 unique)– 26M .class files, 4M .java source files (incl. duplicates)– Archives contain archives: 75,000 classes are nested 4 levels deep!

Page 19: Software Bertillonage: Finding the Provenance of an Entity

An example

Looking for cewolf-1.0 in Maven2

Page 20: Software Bertillonage: Finding the Provenance of an Entity

Investigation

Target system: An industrial e-commerce app containing 84 jars

RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive?

RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?

Page 21: Software Bertillonage: Finding the Provenance of an Entity

Investigation

RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive?

• 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0.– 48 exact, 3 correct product (target version not in Maven)

• 20 / 84 we found multiple matches with simIndex = 1.0– 19 exact, 1 correct product

• 12 / 84 we found matches with 0 < simIndex < 1.0– 1 exact, 9 correct product, 2 incorrect product

• 1 / 84 we found no match (product not in Maven as a binary)

More data here: http://juliusdavies.ca/uvic/jarchive/

Page 22: Software Bertillonage: Finding the Provenance of an Entity

Further uses

1. Used the version info from extracted from e-commerce app to perform audits for licensing and security– One jar changed open source licenses (GNU Affero, LGPL) – One jar version was found to have known security bugs

2. When did Google Android developers copy-paste httpclient.jar classes into android.jar?– And how much work would it be to include a newer version?– We narrowed it down to two likely candidates, one of which turned

out to be correct.

Page 23: Software Bertillonage: Finding the Provenance of an Entity

Summary

• Who are you?– Determining the provenance of software entities is a growing and

important problem

• Software Bertillonage:– Quick and dirty techniques applied widely, then expensive techniques

applied narrowly

• Identifying version IDs of included Java libraries is an example of the software provenance problem– And our solution of anchored signature matching is an example of

software Bertillonage

Page 24: Software Bertillonage: Finding the Provenance of an Entity

Production babies(born the week the paper was due)

Dad Julius and Naoki

Uncle Mike and Lilia Biswas (who was born in Mike’s basement)

Page 25: Software Bertillonage: Finding the Provenance of an Entity

Software Bertillonage: Finding the Provenance of an Entity

Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle