Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label...
Transcript of Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label...
![Page 1: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/1.jpg)
Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines
Shuofei Zhu1, Jianjun Shi1,2, Limin Yang3
Boqin Qin1,4, Ziyi Zhang1,5, Linhai Song1, Gang Wang3
1The Pennsylvania State University2Beijing Institute of Technology
3University of Illinois at Urbana-Champaign4Beijing University of Posts and Telecommunications
5University of Science and Technology of China
1
![Page 2: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/2.jpg)
VirusTotal
• The largest online anti-malware scanning service– Applies 70+ anti-malware engines
– Provides analysis reports and rich metadata
• Widely used by researchers in the security community
Submitter Sample
Metadata
Reports
scan API
report API
2
![Page 3: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/3.jpg)
Challenges of Using VirusTotal
• Q1: When VirusTotal labels are trustworthy?
Sample
✔️ ⚠️ ⚠️ ✔️ ✔️
TimeDay 1 2 3 4 5 6
McAfee
3
![Page 4: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/4.jpg)
Challenges of Using VirusTotal
• Q1: When VirusTotal labels are trustworthy?
• Q2: How to aggregate labels from different engines?
• Q3: Are different engines equally trustworthy?
Engines Results
McAfee ⚠️
Microsoft ✔️
Kaspersky ✔️
Avast ⚠️
…
⚠️
✔️
3
![Page 5: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/5.jpg)
Challenges of Using VirusTotal
• Q1: When VirusTotal labels are trustworthy?
• Q2: How to aggregate labels from different engines?
• Q3: Are different engines equally trustworthy?
3
Equally trustworthy?
![Page 6: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/6.jpg)
Literature Survey on VirusTotal Usages
• Surveyed 115 top-tier conference papers that use VirusTotal
• Our findings: – Q1: rarely consider label changes
– Q2: commonly use threshold-based aggregation methods
– Q3: often treat different VirusTotal engines equally
Yes
No
Threshold-Based Method
Yes
No
Reputable Engines
Yes
No
Consider Label Changes
4
![Page 7: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/7.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Overview
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
5
Equally trustworthy?
××
××
has ground-truthno ground-truth
PE
APK
![Page 8: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/8.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Outline
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
6
Equally trustworthy?
××
××
PE
APK
Main Dataset
has ground-truthno ground-truth
![Page 9: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/9.jpg)
Data Collection of the Main Dataset
• We chose “fresh” files without prior VirusTotal history– Sampled 14,423 files submitted for the first-time on 08/31/2018
– Roughly half were labeled as “benign” by all engines on day-1
– The rest were labeled as "malicious" by at least 1 engine on day-1
• We collected “daily” VirusTotal labels over one year– Use rescan API to force VirusTotal to scan the samples everyday
– Data collection window: 08/31/2018 – 09/30/2019
• Data Preprocessing– 341+ million data points from 65 engines
7
![Page 10: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/10.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Outline
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
8
Equally trustworthy?
××××
PE
APK
has ground-truthno ground-truth
Main Dataset
![Page 11: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/11.jpg)
Label Change or Flip
• We model the label dynamics by sequences of “0” and “1” –✔️(benign): 0⚠️(malicious):1
• A Flip: 0 → 1 or 1 → 0– hazard flip: temporary, lasts only one day
– non-hazard flip: long term, lasts at least two days
a hazard flip a non-hazard flip
0
1
days1 5 10 15
⚠️✔️ ✔️✔️✔️✔️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️
9
![Page 12: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/12.jpg)
Characteristics of Flips
normalized flips per file
% o
f fi
les
weeks
No
. of
flip
s (1
0K
)
normalized flips per engine
% o
f en
gin
eshazard flips
all flips
10
9%
Both flips and hazard flips widely exist across scan dates, engines and files.
![Page 13: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/13.jpg)
% o
f st
able
d f
iles
• How long to wait for a file’s labels to become stable?
• Stable file: all engines' labels on the file do not change any more
Individual Label Stabilization
65 engines
35 engines
20 engines
11Waiting for longer time does not guarantee to have more stable files.
![Page 14: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/14.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Outline
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
12
Equally trustworthy?
××××
PE
APK
has ground-truthno ground-truth
Main Dataset
![Page 15: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/15.jpg)
Aggregated Label Stabilization
McAfee 1 1 1 1 0 …
Microsoft 1 1 1 0 0 …
Kaspersky 0 1 0 0 0 …
… (62 engines)
Aggregated labels 1 1 1 0 0 …13
Day 1 2 3 4 5 …
(t = 2)
• Many researchers use a threshold (t) to aggregate engines' labels– A file is considered as malicious, when ≥ t engines detect the file
• How flips impact this aggregation policy?– Influenced files: files with both benign and malicious aggregated labels
– Measure % of influenced files for different t
… (all 0)
![Page 16: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/16.jpg)
Aggregated Label Stabilization
benign +
malicious
only benign
only malicious
14
• Many researchers use a threshold (t) to aggregate engines' labels– A file is considered as malicious, when ≥ t engines detect the file
• How flips impact this aggregation policy?– Influenced files: files with both benign and malicious aggregated labels
– Measure % of influenced files for different t
Flips can heavily influence labeling aggregation results when threshold t is too small or too large.
![Page 17: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/17.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Outline
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
15
Equally trustworthy?
××××
PE
APK
has ground-truthno ground-truth
Main Dataset
![Page 18: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/18.jpg)
• How to compute the similarity between engines A and B?– Compute the similarity between the two labeling sequences for each file
– Compute the average sequence-level similarity over all the files
• An example for sequence-level similarity
Temporary Labeling Similarity
• Divide time sequence into bins (size=?)
0100000000000001000…
(1, 1, 1, 5)(1, 1, 1, 5, 0, 0, 0, 7, …)
engine A on file X:
engine B on file X: 0000000001000000100…
(0, 0, 0, 7, 1, 1, 1, 4, …)
0.87
cosine
16
( )
(0, 0, 0, 7)
( )
![Page 19: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/19.jpg)
Label Correlations Between Engines
AvastAVG K7GW
K7AntiVirus
GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan
17
![Page 20: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/20.jpg)
Label Correlations Between Engines
AvastAVG K7GW
K7AntiVirus
GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan
MicroWorld-eScan
Emsisoft
GData
Ad-Aware
BitDefender
17
![Page 21: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/21.jpg)
Label Correlations Between Engines
AvastAVG K7GW
K7AntiVirus
GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan
MicroWorld-eScan
Emsisoft
GData
Ad-Aware
BitDefender
There are groups of engines with strong correlations in labeling decisions.
17
![Page 22: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/22.jpg)
• Q1: the impact of label changes (label flips)
• Q2: threshold-based label aggregation methods
• Q3: the correlation between VirusTotal engines
Outline
⚠️
✔️
✔️
⚠️
✔️
✔️ ⚠️ ⚠️
18
Equally trustworthy?
× ×××
PE
APK
has ground-truthno ground-truth
Ground-Truth Dataset
![Page 23: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/23.jpg)
• How we create “fresh” ground-truth samples?– Obfuscating ransomware to create malware
– Obfuscation + compiling open-source software to create goodware
• Findings: – Obfuscation brings many false positives
• Even for high-reputation engines
– 3 ≤ 𝑡 ≤ 15 can produce good aggregation results• As long as the benign files are not obfuscated
– Inconsistency exists between the desktop and the VirusTotal versions
Ground Truth Dataset
19
More results in our paper…
![Page 24: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/24.jpg)
Conclusion and Takeaways
• A paper survey on how researchers use VirusTotal
• Data-driven methods to validate labeling methodologies
• Takeaways and suggestions – Data preprocessing
• Submit the same files in 3 consecutive days to detect hazards
• No need to wait over long time
– Threshold-based label aggregation• Stable: when t is within a reasonable range (2-20)
• Correctness: t = 3 to 15 when benign files are not obfuscated
– Correlation and causality exists between engines
– High-reputation Engines are not always accurate20
![Page 25: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï](https://reader035.fdocuments.in/reader035/viewer/2022071215/6046840e5a95b1220c28283f/html5/thumbnails/25.jpg)
Thank you!
• Also thanks to my collaborators
• Contact– [email protected]
• Artifact– https://sfzhu93.github.io/projects/vt/index.html
21