Used to be applicable to literary corpus/ academia only Source code similarity/plagiarism detection...

17

description

 Generally not true  In the android apps domain, it can be!  86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “ Dissecting android malware:characterization and evolution”)  Similarity detection is crucial

Transcript of Used to be applicable to literary corpus/ academia only Source code similarity/plagiarism detection...

Page 1: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.
Page 2: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Used to be applicable to literary corpus/ academia only

Source code similarity/plagiarism detection is very important

“Moss” is the most widely known s/w similarity detection tool

Can provide valuable insight into malware detection

Page 3: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Generally not true

In the android apps domain, it can be!

86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “Dissecting android malware:characterization and evolution”)

Similarity detection is crucial

Page 4: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Each android app is an apk file, ends with a .apk extension

Each apk file has .dex file which is a dalvik executable file and is executed by the dalvik virtual machine

Fingerprint the apk using bithashing

Page 5: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.
Page 6: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.
Page 7: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation

Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains 35517 opcodes

Page 8: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets

30000 apps were used to determine m.

m = N90 x 9 = 240,007, a prime number

Page 9: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Given two bitvector representations of two apps A and B, their similarity is computed by the given formula:

J(A,B) = |A ∧ B| / |A ⋁ B|

This formula Is a variation of the original Jaccard similarity.

Page 10: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

If the app is heavily obfuscated, then juxtapp may not perform well

Use of third-party libraries can add a lot of noise and adversely affect the similarity score

Page 11: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Who wrote it?

Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship

Primary application has shifted from literary domain to forensics : terrorist threats, harassment

Page 12: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

2.4 million posts from 100,000 blogs (almost a billion words)

Stylometry : Identify author based on writing style

Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author

Page 13: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Prepare test set and training set

Build a classifier with the training set

Test the classifier with the test set

Which features should be considered?

Page 14: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.
Page 15: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

Syntax tree by Stanford parser Yule’s K

k = 10000*(M-N)/(N*N)

N= Total number of words in the text

M = ∑ i * i * Vi

where Vi is the number of words that occur i times

Page 16: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.

In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors

In 35% of cases the correct author is one of the top 20 guesses

Page 17: Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  Moss is the most widely.