A statistical approach to reduce malware inside an ...

Gustavo A Valencia-Zapata M.Sc. Candidate in Statistics, Juan C Salazar-Uribe, Ph.D.

School of Statistics, Universidad Nacional de Colombia-Sede Medellín

[email protected] * www.gustavovalencia.com, [email protected]

1

A statistical approach to reduce malware inside

an Information System in Banking Sector

WORLDCOMP´12 2

Paper: A statistical approach to reduce malware inside an Information System in Banking Sector

Paper: CART for Handling Missing Values in a CMBD. Application Malware inside an Information System in Banking Sector

The research question

¿How malware incidence can be decreased in an Information System (IS)?

As in human epidemiologic context is necessary to apply treatments (medicine,

vaccines, therapies, etc.), on a computer environment would be the application

of antivirus scan.

¿How antivirus scans (medical tests) can be dosed,

in our population (computer network), for the

reduction of malware (diseases) incidence in

banking IS?

Currently the bank scans all the computers once a week. This research was

intended to change this policy. For example some computer will be scan once

a week, others twice a week or once a month.

2

In this research the first stages to build the model are: information extraction

(IE), handling missing values, and statistics analysis. The main information

source is the bank antivirus software. Secondary information sources are: web

filtering, HCM (Human Capital/Resource Management ), and CMDB

(Configuration Management Database).

First stages 3

CMDB 4

TABLE I

CMDB PARAMETERS

Variable Meaning/value Type Unit

Class Laptop, Desktop or server Nominal NA

Brand Computer brand Nominal NA

Computer_Age Operating time Scale Week

Processor_Type Type of computer processor Nominal NA

Processor_Clock The speed of a computer processor Scale GHz

Processors Number of processors Integer Count

Memory (RAM) Memory size Scale GB

Operation_System Operation System (OS) Nominal NA

Service_Pack Updates to a OS Nominal NA

Hard_Disk Hard disk size Scale GB

CMDB

Around 18.22% of CMDB data (infected computers) are missing values.

Classification and Regression Trees (CART) are used for handling missing values

(imputation) to avoid losing valuable information.

5

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

CART (The classification and regression

trees) method was suggested by Breiman[1].

The decision trees produced by CART are

strictly binary, containing exactly two

branches for each decision node.

CART recursively partitions the records with

similar values for the target attribute.

6

[1 ]Breiman Wデ ;ﾉくさCﾉ;ゲゲｷaｷI;デｷﾗﾐ ;ﾐS RWｪヴWゲゲｷﾗﾐ TヴWWゲざがヱΓΒヴ

1

Table II shows the variables for the computer number 0022. We can identify

three out of ten variables with missing values

CART

TABLE II

COMPUTER 0022 – CMDB PARAMETERS

Variable Meaning/value Units

Class Missing NA

Brand Missing NA

Computer_Age Missing Week

Processor_Tipe P27 NA

Processor_Clock 2.19 GHz

Processors 2 Count

Memory (RAM) 2.14 GB

Operation_System SO_7 NA

Service_Pack SP_3 NA

Hard_Disk 80.02 GB

7

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Variable: Class

Node 0 indicates that CPU (desktop)

category has the higher probability (0.7) to

be selected if a random imputation is

conducted. On the other hand, Laptop

category has a smaller probability (0.28)

than the first one, and the Server category

has null probability (0.0)

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Class

8

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Variable: Class

Node 1 indicates that Laptop category has

The higher probability (0.83) to be selected

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

Variable: Class

9

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Node 3 indicates that Laptop category has

The higher probability (0.989) to be selected

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

10

Variable: Class

As a consequence, for computer number

0022 the Class variable will be imputed as

being Laptop.

Evaluating model prediction

The formulated hypotheses for McNemar test (2-sided) were[2]:

TABLA III

CHI-SQUARE TEST

Value Exact Sig.

(Two-sided)

McNemar Test 1.058 0.392

Nº Valid Cases 7049

Use binomial distribution

According to this analysis we cannot reject the null hypothesis, that is, CART

SﾗWゲﾐげデ change Class values after imputation (p-value=0.396)

11

ぷヲへ Cﾗﾐﾗ┗WヴがさPヴ;IデｷI;ﾉ Nﾗﾐヮ;ヴ;ﾏWデヴｷI “デ;デｷゲデｷIゲざがヱΓΓΓ

Evaluating model prediction

In this case E_Class is the imputed value and Class is the real value. For

instance, 5013 (99.6%) computers with Class equal to CPU (Desktop) were

Classified correctly by CART, and 2002 (99.3%) computers with Class equal to

Laptop were classified correctly by the same CART.

TABLA IV

CONTINGENCY TABLE CLASS

E_Class

Total CPU Laptop

Class CPU 5013 20 5033

Laptop 14 2002 2016

Total 5027 2022 7049

12

Antivirus Scanning Dosage Statistics Model

For example, according to Chi-Squared Test for Independence, Malware_Level

and USB are independent. However, for both situations (disable or enable USB

ports) the computers have the same levels of malware in our case.

Notwithstanding, as a recommendation to improve security, disabling USB ports

is an effective strategy for preventing information leakage.

The Kaplan-Meier method is used for estimating the survival function from life-

Time Data. To use this strategy we define the following outcome: Elapsed time

to first malware infection in a computer.

Survival curves show, for each time plotted on the X axis, the portion of all

computers surviving at that time.

13

Hosmer Jr, D.W. and LemeshowくさAヮヮﾉｷWS “┌ヴ┗ｷ┗;ﾉ Aﾐ;ﾉ┞ゲｷゲぎ RWｪヴWゲゲｷﾗﾐ MﾗSWﾉｷﾐｪﾗa TｷﾏW デﾗ E┗Wﾐデ D;デ;ざがヱΓΓΓく


Week

Survival Function Week に Kaplan-Meier Curves

Cu

mS

urv

iva

l

Group 0 = computers with USB disabled

Group 1 = computers with USB enabled

14

The log-Rank Test said that there

are not important differences

between those groups, it means

that for both situations (disable or

enable USB ports) the computer

have the same levels of malware


Week

Survival Function Week に Kaplan-Meier Curves

Cu

mS

urv

iva

l

Group 0 = さYﾗ┌ﾐｪざ computers (1 to 165 weeks)

Group 1 = さAS┌ﾉデざ computers (166 to 248 weeks)

Group 2 = さOﾉSざ computers (> 248 weeks)

Computer_Age

15

We can see that group 2

(Computers with more of 248

week of operating time) showed

statistical differences when they

were compared with the others

groups. That means the さOﾉSざ

computers were infected slower

than others groups.

Conclusion and Future work

In this study we believe that malware level depends on variables such as:

Processors (number of processor in the computer)

Computer_Age (Operating time)

Browse_Time. (Web surfing time)

Class (Laptop, desktop or server)

Future directions of this work include performing additional statistics analysis

such as recurrence analysis and formulation of survival models through

Cox-Models. This also will allow identifying significant variables to optimize the

malware scanning policy in an IS as well as measure its effect size.

16

Acknowledgment

The authors thank Juan Carlos Correa from School of Statistics of the

Universidad Nacional de Colombia at Medellín for helpful feedback that

contributed to improve this research. Also the authors thank the Security Team

of the Bank Company for their continuous encouragement and support

Many thanks to Universidad Nacional de Colombia-Sede Medellín for helping us

to achieve these goals.

17

References

Weiguo JがさAヮヮﾉ┞ｷﾐｪ EヮｷSWﾏｷﾗﾉﾗｪ┞ ｷﾐ Cﾗﾏヮ┌デWヴ Vｷヴ┌ゲ PヴW┗Wﾐデｷﾗﾐぎ PヴﾗゲヮWIデゲ ;ﾐS Lｷﾏｷデ;デｷﾗﾐゲざがヲヰヱヰく TｴWゲｷゲが Cﾗﾏヮ┌デWヴ “IｷWﾐIWが Uﾐｷ┗Wヴゲｷデ┞ ﾗa A┌Iﾆﾉ;ﾐSく

B;ｷﾉW┞が NくJくTがさTｴW M;デｴWﾏ;デｷI;ﾉ TｴWﾗヴ┞ ﾗa IﾐaWIデｷﾗ┌ゲ DｷゲW;ゲWゲ ;ﾐS Iデゲ AヮヮﾉｷI;デｷﾗﾐゲざ 1975, New York: Oxford University Press.

Kephart Jが ;ﾐS WｴｷデW “がさDｷヴWIデWS-Graph Epidemiological Models of Computer Viruses",

IEEE Computer Symposium on Research in Security and Privacy, Proceedings, pp. 343に359, May 1991.

Kephart Jが ;ﾐS WｴｷデW “がさMW;ゲ┌ヴｷﾐｪ ;ﾐS MﾗSWﾉｷﾐｪ Cﾗﾏヮ┌デWヴ Vｷヴ┌ゲ PヴW┗;ﾉWﾐIWが RWゲW;ヴIｴｷﾐ “WI┌ヴｷデ┞ ;ﾐS Pヴｷ┗;I┞ざがヱΓΓンが PヴﾗIWWSｷﾐｪゲがヱΓΓン IEEE Computer Society Symposium on,

pp. 2に15, May 1993.

Kephartが JがさHﾗ┘ Tﾗヮﾗﾉﾗｪ┞ AaaWIデゲ Pﾗヮ┌ﾉ;デｷﾗﾐ D┞ﾐ;ﾏｷIゲざｷﾐ Langton, C.G. (ed.) Artificial

Life III. Reading, MA: Addison-Wesley, 1994.

17

References

Pastor-Satorras, R. and Vespignaniが AがさEヮｷSWﾏｷI D┞ﾐ;ﾏｷIゲ ;ﾐS EﾐSWﾏｷI “デ;デWゲｷﾐ CﾗﾏヮﾉW┝ NWデ┘ﾗヴﾆゲざく B;ヴIWﾉﾗﾐ;が “ヮ;ｷﾐぎ Universitat Politecnica de Catalunya, 2001.

Rishikesh PがさUゲｷﾐｪ Pﾉ;ﾐデ EヮｷSWﾏｷﾗﾉﾗｪｷI;ﾉ MWデｴﾗSゲ Tﾗ Tヴ;Iﾆ Cﾗﾏヮ┌デWヴ NWデ┘ﾗヴﾆ Wﾗヴﾏゲざが 2004. Thesis, Computer Science, Virginia Polytechnic and State University.

D;ﾐｷWﾉ Tく L;ヴﾗゲWがさDｷゲIﾗ┗Wヴｷﾐｪ Kﾐﾗ┘ﾉWSｪW ｷﾐ D;デ;く AﾐｷﾐデヴﾗS┌Iデｷﾗﾐデﾗ S;デ; ﾏｷﾐｷﾐｪざヲヰヰヵく John Wiley & Sons, Inc

Leo Breiman, Jerome Friedman, Richard Olshenが ;ﾐS Cｴ;ヴﾉWゲ “デﾗﾐWがさCﾉ;ゲゲｷaｷI;デｷﾗﾐ ;ﾐS RWｪヴWゲゲｷﾗﾐ TヴWWゲざがヱΓΒヴく Cｴ;ヮﾏ;ﾐわ H;ﾉﾉっCRC PヴWゲゲく

Vipin K┌ﾏ;ヴがさTｴW Tﾗヮ TWﾐ Aﾉｪﾗヴｷデｴﾏゲｷﾐ D;デ; MｷﾐｷﾐｪざがヲヰヰΓく Cｴ;ヮﾏ;ﾐわ H;ﾉﾉっCrc.

Q┌ｷﾐﾉ;ﾐが RがさUﾐﾆﾐﾗ┘ﾐ ;デデヴｷH┌デW ┗;ﾉ┌WゲｷﾐｷﾐS┌Iデｷﾗﾐざく In Proceedings of the Sixth

International Workshop on Machine Learning, 1989 pp. 164に168.

Cﾗﾐﾗ┗WヴがさPヴ;IデｷI;ﾉ Nﾗﾐヮ;ヴ;ﾏWデヴｷI “デ;デｷゲデｷIゲざがヱΓΓΓく Jﾗｴﾐ WｷﾉW┞ わ “ﾗﾐゲが IﾐI

Hosmer Jr, D.W. and LemeshowくさAヮヮﾉｷWS “┌ヴ┗ｷ┗;ﾉ Aﾐ;ﾉ┞ゲｷゲぎ RWｪヴWゲゲｷﾗﾐ MﾗSWﾉｷﾐｪﾗa TｷﾏW デﾗ E┗Wﾐデ D;デ;ざがヱΓΓΓく Jﾗｴﾐ WｷﾉW┞ Sons,

17

References

Cﾗﾐﾗ┗WヴがさPヴ;IデｷI;ﾉ Nﾗﾐヮ;ヴ;ﾏWデヴｷI “デ;デｷゲデｷIゲざがヱΓΓΓく Jﾗｴﾐ WｷﾉW┞ わ “ﾗﾐゲが IﾐI

Hosmer Jr, D.W. and LemeshowくさAヮヮﾉｷWS “┌ヴ┗ｷ┗;ﾉ Aﾐ;ﾉ┞ゲｷゲぎ RWｪヴWゲゲｷﾗﾐ MﾗSWﾉｷﾐｪﾗa TｷﾏW デﾗ E┗Wﾐデ D;デ;ざがヱΓΓΓく Jﾗｴﾐ WｷﾉW┞ Sons,

17

1

NKS THANKS THANKS THANKS

A statistical approach to reduce malware inside an ...

Documents

Transcript of A statistical approach to reduce malware inside an ...