Dec 29, 2015...
Transcript of Dec 29, 2015...
![Page 1: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/1.jpg)
Stylistic fingerprints• Stylometry has been applied to:– Fine-‐art– Music– Unconventional text– Translated text– Source code
Dec 29, 2015
1 of 60
![Page 2: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/2.jpg)
Supervised stylometry• Given a set of documents of known authorship, classify a document of unknown authorship– Classifier trained on undisputed text
• Scenario: Alice the Anonymous Blogger vs. Bob the Abusive Employer– Alice blogs about abuses in Bob’s company
• Blog posted anonymously (Tor, pseudonym, etc)– Bob obtains 5,000 words of each employee’s writing
• Bob uses stylometry to identify Alice as the blogger
Dec 29, 2015
2 of 60
![Page 3: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/3.jpg)
channel partner advocate for Cisco Alert
Dec 29, 2015
3 of 60
![Page 4: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/4.jpg)
• Stylistic fingerprints
– Use stylometric fingerprints to find who “theconnor” is.– Collect the rest of the tweets in the timeline, compare to the cover
letters submitted to Cisco and identify theconnor.
Frequency of function words
Frequency of punctuation
Fingerprints in textual data
tweets of
theconnor
Ciscocover letter
from Person A B C D …modelfor
A, B, C,…
Who istheconnor?
extract features
extract features
Dec 29, 2015
4 of 60
![Page 5: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/5.jpg)
theconnor made her Twitter profile private and deleted all information on her homepageright after the event but it was too late since search engines cache search results which can lead to old information.
Dec 29, 2015
5 of 60
![Page 6: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/6.jpg)
theconnor à Connor Riley à
Dec 29, 2015
6 of 60
![Page 7: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/7.jpg)
What about fingerprints in source code?Dec 29, 2015
7 of 60
![Page 8: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/8.jpg)
DE-‐ANONYMIZING PROGRAMMERS VIA
CODE STYLOMETRY
De-anonymizing Programmers via Code Stylometry. Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt.Usenix Security Symposium, 2015
![Page 9: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/9.jpg)
Source code stylometry
• Everyone learns coding on an individual basis, as a result code in a unique style, which makes de-‐anonymization possible.
• Software engineering insights – programmer style changes while implementing sophisticated
functionality – differences in coding styles of programmers with different skill sets
• Identify malicious programmers.
Dec 29, 2015
9 of 60
![Page 10: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/10.jpg)
Source code stylometry:Who wrote this code?
• Scenario 1:Alice analyzes a library with malicious source code.Bob has a source code collection with known authors.Bob will search his collection to find Alice’s adversary.
Dec 29, 2015
10 of 60
![Page 11: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/11.jpg)
Source code stylometry:Who wrote this code?
• Scenario 2:Alice got an extension for her programming assignment. Bob, the professor has everyone else’s code. Bob wants to see if Alice plagiarized.
Dec 29, 2015
11 of 60
![Page 12: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/12.jpg)
Source code stylometryIran confirms death sentence for 'porn site' web programmer
No technical difference between security-enhancing and privacy-infringing…
Dec 29, 2015
12 of 60
![Page 13: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/13.jpg)
Comparison to related workRelatedWork
Author
Size
Instances AverageLOC
Language Fetaures Method Result
MacDonellet al.
7 351 148 C++ lexical &layout
Case-‐based reasoning
88.0%
Frantzeskou et al. 8 107 145 Java lexical &layout
Nearestneighbor
100.0%
Elenbogen and Seliya
12 83 100 C++ lexical &layout
C4.5 decision tree
74.7%
Shevertalov et. al. 20 N/A N/A Java lexical &layout
Genetic algorithm
80%
Frantzeskou et al. 30 333 172 Java lexical &layout
Nearest neighbor
96.9%
Ding andSamadzadeh
46 225 N/A Java lexical &layout
Nearest neighbor
75.2%
Ours 35 315 68 C++ lexical &layout &syntactic
Randomforest
100.0%
Ours 250 2,250 77 C++ 98.0%
Ours 1,600 14,400 70 C++93.6%
Dec 29, 2015
13 of 60
![Page 14: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/14.jpg)
Comparison to related workRelatedWork
AuthorSize
Instances AverageLOC
Language Fetaures Method Result
Frantzeskouet al.
30 333 172 Java lexical &layout
Nearest neighbor
96.9%
Ding andSamadzadeh
46 225 N/A Java lexical &layout
Nearest neighbor
75.2%
Ours 35 315 68 C++lexical &layout &syntactic
Randomforest
100.0%
Ours 250 2,250 77 C++ 98.0%Ours 1,600 14,400 70 C++
93.6%
Dec 29, 2015
14 of 60
![Page 15: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/15.jpg)
Comparison to related work
RelatedWork
AuthorSize
Instances
AverageLOC
Language Fetaures Method Result
Frantzeskouet al.
30 333 172 Java lexical &layout
Nearest neighbor
96.9%
Ding andSamadzadeh
46 225 N/A Java lexical &layout
Nearest neighbor
75.2%
Ours 35 315 68 C++lexical &layout &syntactic
Randomforest
100.0%Ours 250 2,250 77 C++ 98.0%Ours 1,600 14,400 70 C++
93.6%
Dec 29, 2015
15 of 60
![Page 16: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/16.jpg)
Source code stylometryMachine learning workflow
Dataset in CPP ~100,000 users
preprocessing
Fuzzy AST parser
Extract features
Random Forest
classificationmajority
vote
A B C D
Dec 29, 2015
16 of 60
![Page 17: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/17.jpg)
• 2008-‐2014• Same problems• Limited time• Problems get harder• C++ most common
Dec 29, 2015
17 of 60
![Page 18: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/18.jpg)
Source code stylometry
• Stylometry can be used in source code to identify the author of a program.
• Extract layout and lexical features from source code.
• Abstract syntax trees (AST) in code represent the structure of the program.
• Preprocess source code to obtain AST.
• Parse AST to extract coding style features.
Source CodeAST
Dec 29, 2015
18 of 60
![Page 19: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/19.jpg)
Random Forests are made of decision treesDec 29, 2015
19 of 60
![Page 20: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/20.jpg)
Decision Trees• Representation– Each internal node tests an attribute– Each branch is an attribute value– Each leaf assigns a classification
Dec 29, 2015
20 of 60
![Page 21: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/21.jpg)
Choosing an attributeDec 29, 2015
21 of 60
Which is better?
![Page 22: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/22.jpg)
Dec 29, 2015
22 of 60
Information gain
• A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values.
• Information Gain (IG) or reduction in entropy from the attribute test:
• remainder(A) is the remaining uncertainty after splitting on the attribute
• Choose the attribute with the largest IG
![Page 23: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/23.jpg)
Dec 29, 2015
23 of 60
Information gainFor the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
![Page 24: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/24.jpg)
934 important features of code stylometry by information gainOUT OF 120,000 FEATURES
FeatureType Percentage CountWord Unigram Frequency 55% 517AST Node-‐Bigram Frequency 31% 291AST Node AverageDepth 5% 48AST Node Frequency 4% 38AST Node TFIDF 2% 19C++ Keywords 2% 15Layout Features 1% 6
Dec 29, 2015
24 of 60
![Page 25: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/25.jpg)
When machine learning goes wrong
• Bias vs variance
Dec 29, 2015
25 of 60
![Page 26: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/26.jpg)
Random Forest• Individual Trees have low bias, but high variance
• Problem: Overfitting• Solution: Forest not a tree, build trees on subsets of the training data, using subsets of the features
Dec 29, 2015
26 of 60
![Page 27: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/27.jpg)
Source code stylometryMethod①Use random forest as the machine learning classifier,
①avoid over-‐fitting②multi-‐class classifier by nature
②K-‐fold cross validation③Validate method on a different dataset
Dec 29, 2015
27 of 60
![Page 28: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/28.jpg)
Case 1: Authorship attribution• Who is this anonymous programmer?• Who is Satoshi?
Dec 29, 2015
28 of 60
![Page 29: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/29.jpg)
Case 1: Authorship attribution
Train on 1,600 authorsto identify the authors of
14,400 files train
test
94% accuracy
• 94% accuracy in identifying 1,600 authors of 14,400 anonymous program files.
Dec 29, 2015
29 of 60
![Page 30: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/30.jpg)
Case 1: Authorship attribution
Train on the suspect setto de-anonymize theinitial Bitcoin author
train
test
Satoshi = git contributor
• If only we had a suspect set for Satoshi…
Dec 29, 2015
30 of 60
![Page 31: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/31.jpg)
Case 2: Obfuscation• Who is the programmer of this obfuscated source code?
• Code is obfuscated to become unrecognizable.• Our authorship attribution technique is impervious to common off-‐the-‐shelf source code obfuscators.
Dec 29, 2015
31 of 60
![Page 32: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/32.jpg)
Case 2: C++ Obfuscation -‐ STUNNIXDec 29, 2015
32 of 60
![Page 33: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/33.jpg)
Case 2: C++ Obfuscation -‐ STUNNIXDec 29, 2015
33 of 60
![Page 34: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/34.jpg)
Case 2: C++ Obfuscation -‐ STUNNIX
Same set of 20 authorswith 180 program files
Classification Accuracy
Original source code 99%STUNNIX-‐Obfuscated source code 99%
Dec 29, 2015
34 of 60
![Page 35: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/35.jpg)
Case 2: C Obfuscation -‐ TIGRESSDec 29, 2015
35 of 60
![Page 36: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/36.jpg)
Case 2: C Obfuscation -‐ TIGRESSDec 29, 2015
36 of 60
![Page 37: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/37.jpg)
Case 2: C Obfuscation -‐ TIGRESS
Same set of 20 authorswith 180 program files
Classification Accuracy
Original C source code 96%TIGRESS-‐Obfuscated source code 67%Random chance of correct de-‐anonymization 5%
Dec 29, 2015
37 of 60
![Page 38: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/38.jpg)
Case 3: Coding style throughout years• Is programming style consistent?• If yes, we can use code from different years for authorship attribution.
Dec 29, 2015
38 of 60
![Page 39: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/39.jpg)
Case 3: Coding style throughout years• Coding style is preserved up to some degree throughout years.
train
Train on 25 authors from 2012to identify the author of
25 files in 2014test
96% accuracy
Dec 29, 2015
39 of 60
![Page 40: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/40.jpg)
Case 3: Coding style throughout years• 98% accuracy, train and test in 2014• 96% accuracy, train on 2012, test on 2014
train
Train on 25 authors from 2012to identify the author of
25 files in 2014test
96% accuracy
Dec 29, 2015
40 of 60
![Page 41: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/41.jpg)
Feature set: Using ‘only’ the Python equivalents of syntactic featuresApplication Programmers Instances Result
Python programmer de-‐anonymization 229 2,061 53.9%
Top-‐5 relaxed classification 229 2,061 75.7%
Python programmer de-‐anonymization 23 207 87.9%
Top-‐5 relaxed classification 23 207 99.5%
Case 4: Generalizing the approach -‐ pythonDec 29, 2015
41 of 60
![Page 42: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/42.jpg)
Results
Application Classes Instances Accuracy
Stylometric plagiarism detection 250 class 2,250 98.0%
Large scale de-‐anonymization 1,600 class 14,400 93.6%
Copyright investigation Two-‐class 1,080 100.0%
Authorship verification Two-‐class/One-‐class 2,240 91.0%
Open world problem Multi-‐class 420 96.0%
A new principled method with a robust syntactic feature set for de-anonymizing programmers.
Dec 29, 2015
42 of 60
![Page 43: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/43.jpg)
Git Blame Who?• A lot of code is collaborative• > 70% accuracy for individual attribution for a single git commit, higher for multiple commits/account
• For accounts, we can attribute single commits and ensemble them (errors are uncorrelated, accuracy quickly reaches close to 100%)
43 of 66
![Page 44: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/44.jpg)
Attributing Code FragmentsDec 29, 2015
44 of 60
![Page 45: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/45.jpg)
Open World AttributionDec 29, 2015
45 of 60
![Page 46: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/46.jpg)
How much data for attribution?Dec 29, 2015
46 of 60
![Page 47: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/47.jpg)
How much code?Dec 29, 2015
47 of 60
![Page 48: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/48.jpg)
Aggregated SamplesDec 29, 2015
48 of 60
![Page 49: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/49.jpg)
What about executable binaries?Compiled code looks cryptic
00100000 00000000 00001000 00000000 00101000 00000000 00000000 00000000 00110100 00000000 00000000 00000000 00000100 00001000 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000000 00000101 00000000 00000000 00000000 00000100 00000000 00000000 00000000 00000011 00000000 00000000 00000000 00110100 00000001 00000000 00000000 00110100 10000001 00000100 00001000 00000000 00000000 00010011 00000000 00000000 00000000 00000100 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 10000000 00000100 00001000 00000000 10000000 00000100 00001000 11001000 00010111 00000000 00000000 11001000 00010111 00000000 00000000 00000101 00000000 00000000 00000000 00000000 00010000 00000000 00000000 00000001 00000000 00000000 00000000 11001000 00010111 00000000 00000000 11001000 10100111 00000100 00001000 11001000 10100111 00000100 00001000 00101100 00000001 00000000 00000000 00000000 00000000 00000000 00010000 00000000 00000000 00000010 00000000 00000000 00000000 11011100 00010111
Source Code#include <cstdio>#include <algorithm>using namespace std;#define For(i,a,b) for(int i = a; i < b; i++)#define FOR(i,a,b) for(int i = b-‐1; i >= a; i-‐-‐)double nextDouble() {
double x;scanf("%lf", &x);return x;}
int nextInt() {int x;scanf("%d", &x);return x; }
int n;double a1[1001], a2[1001];int main() {
freopen("D-‐small-‐attempt0.in", "r", stdin);freopen("D-‐small.out", "w", stdout);int tt = nextInt();For(t,1,tt+1) {
int n = nextInt(); . . . . . .
Dec 29, 2015
49 of 60
![Page 50: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/50.jpg)
Can we identify the author of this binary?00100000 00000000 00001000 00000000 00101000 00000000 00000000 00000000 00110100 00000000 00000000 00000000 00000100 00001000 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000000 00000101 00000000 00000000 00000000 00000100 00000000 00000000 00000000 00000011 00000000 00000000 00000000 00110100 00000001 00000000 00000000 00110100 10000001 00000100 00001000 00000000 00000000 00010011 00000000 00000000 00000000 00000100 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 10000000 00000100 00001000 00000000 10000000 00000100 00001000 11001000 00010111 00000000 00000000 11001000 00010111 00000000 00000000 00000101 00000000 00000000 00000000 00000000 00010000 00000000 00000000 00000001 00000000 00000000 00000000 11001000 00010111 00000000 00000000 11001000 10100111 00000100 00001000 11001000 10100111 00000100 00001000 00101100 00000001 00000000 00000000 00000000 00000000 00000000 00010000 00000000 00000000 00000010 00000000 00000000 00000000 11011100 00010111
. . .
Dec 29, 2015
50 of 60
![Page 51: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/51.jpg)
WHEN CODING STYLE SURVIVES COMPILATION:DE-‐ANONYMIZING PROGRAMMERS FROM
EXECUTABLE BINARIES
When Coding Style Survives Compilation: De-anonymizing Programmers from Executable BinariesAylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan.Under Submission, 2016 available on arXiv
![Page 52: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/52.jpg)
Finding the author of an executable binary?
• Coding style in compiled code
• Threat to privacy and anonymity
• Malware classification?
Dec 29, 2015
52 of 60
![Page 53: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/53.jpg)
Related workDec 29, 2015
53 of 60
![Page 54: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/54.jpg)
Our workflow
54 of 60
![Page 55: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/55.jpg)
FeaturesDec 29, 2015
55 of 60
![Page 56: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/56.jpg)
Feature set from 100 programmersDec 29, 2015
56 of 60
![Page 57: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/57.jpg)
Large Scale Programmer DeanonymizationDec 29, 2015
57 of 60
![Page 58: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/58.jpg)
Dec 29, 2015
58 of 60
![Page 59: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/59.jpg)
Compiler Optimization
The drop in accuracy is not tragic!
Dec 29, 2015
59 of 60
![Page 60: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/60.jpg)
Reconstructing original features
• Original vs predicted features– Average cos distance: 0.81
• Original vs decompiled features– Average cos distance: 0.35
Dec 29, 2015
60 of 60
![Page 61: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/61.jpg)
Reconstructing original features• Original vs predicted features– Average cos distance: 0.81
• This suggests that original features are transformed but not entirely lost in compilation.
Dec 29, 2015
61 of 60
![Page 62: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/62.jpg)
Insights
More advanced programmers are easier to de-‐anonymize
Dec 29, 2015
62 of 60
![Page 63: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/63.jpg)
Real World Binary Attribution• Code repositories from GitHub– 65% accuracy (less code to train on)
• Binaries mined from leaked Nulled.io forum– 4 users, 3 with enough data to train a model. 3 correctly identified, 4th identified as not one of the other 3.
63 of 66
![Page 64: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/64.jpg)
Future work• Anonymizing executable binaries– optimizations is not the answer
• De-‐anonymizing collaborative binaries• Malware family classification
Dec 29, 2015
64 of 60
![Page 65: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/65.jpg)
Available tools• Programmer de-‐anonymization– https://github.com/calaylin
• Jstylo– prose authorship attribution framework
• Anonymouth– writing anonymization
Dec 29, 2015
65 of 60
![Page 66: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music](https://reader033.fdocuments.in/reader033/viewer/2022042918/5f5d116ca7acec74fa653d46/html5/thumbnails/66.jpg)
THANKS CO-AUTHORS Edwin Dauber and Dr. Richard Harang Dr. Konrad Rieck Dr. Arvind Narayanan
Dr. Clare Voss Dr. Fabian Yamaguchi Dr. Aylin Caliskan-Islam