Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label...

25
Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu 1 , Jianjun Shi 1,2 , Limin Yang 3 Boqin Qin 1,4 , Ziyi Zhang 1,5 , Linhai Song 1 , Gang Wang 3 1 The Pennsylvania State University 2 Beijing Institute of Technology 3 University of Illinois at Urbana-Champaign 4 Beijing University of Posts and Telecommunications 5 University of Science and Technology of China 1

Transcript of Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label...

Page 1: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines

Shuofei Zhu1, Jianjun Shi1,2, Limin Yang3

Boqin Qin1,4, Ziyi Zhang1,5, Linhai Song1, Gang Wang3

1The Pennsylvania State University2Beijing Institute of Technology

3University of Illinois at Urbana-Champaign4Beijing University of Posts and Telecommunications

5University of Science and Technology of China

1

Page 2: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

VirusTotal

• The largest online anti-malware scanning service– Applies 70+ anti-malware engines

– Provides analysis reports and rich metadata

• Widely used by researchers in the security community

Submitter Sample

Metadata

Reports

scan API

report API

2

Page 3: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Challenges of Using VirusTotal

• Q1: When VirusTotal labels are trustworthy?

Sample

✔️ ⚠️ ⚠️ ✔️ ✔️

TimeDay 1 2 3 4 5 6

McAfee

3

Page 4: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Challenges of Using VirusTotal

• Q1: When VirusTotal labels are trustworthy?

• Q2: How to aggregate labels from different engines?

• Q3: Are different engines equally trustworthy?

Engines Results

McAfee ⚠️

Microsoft ✔️

Kaspersky ✔️

Avast ⚠️

⚠️

✔️

3

Page 5: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Challenges of Using VirusTotal

• Q1: When VirusTotal labels are trustworthy?

• Q2: How to aggregate labels from different engines?

• Q3: Are different engines equally trustworthy?

3

Equally trustworthy?

Page 6: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Literature Survey on VirusTotal Usages

• Surveyed 115 top-tier conference papers that use VirusTotal

• Our findings: – Q1: rarely consider label changes

– Q2: commonly use threshold-based aggregation methods

– Q3: often treat different VirusTotal engines equally

Yes

No

Threshold-Based Method

Yes

No

Reputable Engines

Yes

No

Consider Label Changes

4

Page 7: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Overview

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

5

Equally trustworthy?

××

××

has ground-truthno ground-truth

PE

APK

Page 8: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Outline

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

6

Equally trustworthy?

××

××

PE

APK

Main Dataset

has ground-truthno ground-truth

Page 9: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Data Collection of the Main Dataset

• We chose “fresh” files without prior VirusTotal history– Sampled 14,423 files submitted for the first-time on 08/31/2018

– Roughly half were labeled as “benign” by all engines on day-1

– The rest were labeled as "malicious" by at least 1 engine on day-1

• We collected “daily” VirusTotal labels over one year– Use rescan API to force VirusTotal to scan the samples everyday

– Data collection window: 08/31/2018 – 09/30/2019

• Data Preprocessing– 341+ million data points from 65 engines

7

Page 10: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Outline

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

8

Equally trustworthy?

××××

PE

APK

has ground-truthno ground-truth

Main Dataset

Page 11: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Label Change or Flip

• We model the label dynamics by sequences of “0” and “1” –✔️(benign): 0⚠️(malicious):1

• A Flip: 0 → 1 or 1 → 0– hazard flip: temporary, lasts only one day

– non-hazard flip: long term, lasts at least two days

a hazard flip a non-hazard flip

0

1

days1 5 10 15

⚠️✔️ ✔️✔️✔️✔️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️

9

Page 12: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Characteristics of Flips

normalized flips per file

% o

f fi

les

weeks

No

. of

flip

s (1

0K

)

normalized flips per engine

% o

f en

gin

eshazard flips

all flips

10

9%

Both flips and hazard flips widely exist across scan dates, engines and files.

Page 13: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

% o

f st

able

d f

iles

• How long to wait for a file’s labels to become stable?

• Stable file: all engines' labels on the file do not change any more

Individual Label Stabilization

65 engines

35 engines

20 engines

11Waiting for longer time does not guarantee to have more stable files.

Page 14: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Outline

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

12

Equally trustworthy?

××××

PE

APK

has ground-truthno ground-truth

Main Dataset

Page 15: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Aggregated Label Stabilization

McAfee 1 1 1 1 0 …

Microsoft 1 1 1 0 0 …

Kaspersky 0 1 0 0 0 …

… (62 engines)

Aggregated labels 1 1 1 0 0 …13

Day 1 2 3 4 5 …

(t = 2)

• Many researchers use a threshold (t) to aggregate engines' labels– A file is considered as malicious, when ≥ t engines detect the file

• How flips impact this aggregation policy?– Influenced files: files with both benign and malicious aggregated labels

– Measure % of influenced files for different t

… (all 0)

Page 16: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Aggregated Label Stabilization

benign +

malicious

only benign

only malicious

14

• Many researchers use a threshold (t) to aggregate engines' labels– A file is considered as malicious, when ≥ t engines detect the file

• How flips impact this aggregation policy?– Influenced files: files with both benign and malicious aggregated labels

– Measure % of influenced files for different t

Flips can heavily influence labeling aggregation results when threshold t is too small or too large.

Page 17: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Outline

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

15

Equally trustworthy?

××××

PE

APK

has ground-truthno ground-truth

Main Dataset

Page 18: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• How to compute the similarity between engines A and B?– Compute the similarity between the two labeling sequences for each file

– Compute the average sequence-level similarity over all the files

• An example for sequence-level similarity

Temporary Labeling Similarity

• Divide time sequence into bins (size=?)

0100000000000001000…

(1, 1, 1, 5)(1, 1, 1, 5, 0, 0, 0, 7, …)

engine A on file X:

engine B on file X: 0000000001000000100…

(0, 0, 0, 7, 1, 1, 1, 4, …)

0.87

cosine

16

( )

(0, 0, 0, 7)

( )

Page 19: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Label Correlations Between Engines

AvastAVG K7GW

K7AntiVirus

GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan

17

Page 20: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Label Correlations Between Engines

AvastAVG K7GW

K7AntiVirus

GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan

MicroWorld-eScan

Emsisoft

GData

Ad-Aware

BitDefender

17

Page 21: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Label Correlations Between Engines

AvastAVG K7GW

K7AntiVirus

GdataESET-NOD32BitDefenderAd-AwareEmsisoftMicroWorld-eScan

MicroWorld-eScan

Emsisoft

GData

Ad-Aware

BitDefender

There are groups of engines with strong correlations in labeling decisions.

17

Page 22: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• Q1: the impact of label changes (label flips)

• Q2: threshold-based label aggregation methods

• Q3: the correlation between VirusTotal engines

Outline

⚠️

✔️

✔️

⚠️

✔️

✔️ ⚠️ ⚠️

18

Equally trustworthy?

× ×××

PE

APK

has ground-truthno ground-truth

Ground-Truth Dataset

Page 23: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

• How we create “fresh” ground-truth samples?– Obfuscating ransomware to create malware

– Obfuscation + compiling open-source software to create goodware

• Findings: – Obfuscation brings many false positives

• Even for high-reputation engines

– 3 ≤ 𝑡 ≤ 15 can produce good aggregation results• As long as the benign files are not obfuscated

– Inconsistency exists between the desktop and the VirusTotal versions

Ground Truth Dataset

19

More results in our paper…

Page 24: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Conclusion and Takeaways

• A paper survey on how researchers use VirusTotal

• Data-driven methods to validate labeling methodologies

• Takeaways and suggestions – Data preprocessing

• Submit the same files in 3 consecutive days to detect hazards

• No need to wait over long time

– Threshold-based label aggregation• Stable: when t is within a reasonable range (2-20)

• Correctness: t = 3 to 15 when benign files are not obfuscated

– Correlation and causality exists between engines

– High-reputation Engines are not always accurate20

Page 25: Measuring the Label Dynamics of Online Anti-Malware Engines · Measuring and Modeling the Label Dynamics of Online Anti-Malware Engines Shuofei Zhu í, JianjunShi í,, LiminYang ï

Thank you!

• Also thanks to my collaborators

• Contact– [email protected]

• Artifact– https://sfzhu93.github.io/projects/vt/index.html

21