A statistical approach to reduce malware inside an ...
Transcript of A statistical approach to reduce malware inside an ...
Gustavo A Valencia-Zapata M.Sc. Candidate in Statistics, Juan C Salazar-Uribe, Ph.D.
School of Statistics, Universidad Nacional de Colombia-Sede Medellín
[email protected] * www.gustavovalencia.com, [email protected]
1
A statistical approach to reduce malware inside
an Information System in Banking Sector
WORLDCOMP´12 2
Paper: A statistical approach to reduce malware inside an Information System in Banking Sector
Paper: CART for Handling Missing Values in a CMBD. Application Malware inside an Information System in Banking Sector
The research question
¿How malware incidence can be decreased in an Information System (IS)?
As in human epidemiologic context is necessary to apply treatments (medicine,
vaccines, therapies, etc.), on a computer environment would be the application
of antivirus scan.
¿How antivirus scans (medical tests) can be dosed,
in our population (computer network), for the
reduction of malware (diseases) incidence in
banking IS?
Currently the bank scans all the computers once a week. This research was
intended to change this policy. For example some computer will be scan once
a week, others twice a week or once a month.
2
In this research the first stages to build the model are: information extraction
(IE), handling missing values, and statistics analysis. The main information
source is the bank antivirus software. Secondary information sources are: web
filtering, HCM (Human Capital/Resource Management ), and CMDB
(Configuration Management Database).
First stages 3
CMDB 4
TABLE I
CMDB PARAMETERS
Variable Meaning/value Type Unit
Class Laptop, Desktop or server Nominal NA
Brand Computer brand Nominal NA
Computer_Age Operating time Scale Week
Processor_Type Type of computer processor Nominal NA
Processor_Clock The speed of a computer processor Scale GHz
Processors Number of processors Integer Count
Memory (RAM) Memory size Scale GB
Operation_System Operation System (OS) Nominal NA
Service_Pack Updates to a OS Nominal NA
Hard_Disk Hard disk size Scale GB
CMDB
Around 18.22% of CMDB data (infected computers) are missing values.
Classification and Regression Trees (CART) are used for handling missing values
(imputation) to avoid losing valuable information.
5
CART
0.00 0
CPU 70.74 2229
Laptop 28.94 912
Server 0.32 10
Total 100.00 3151
Node 0
Nominal % n
Processor_Clock
Improvement=0.31
0.00 0
CPU 16.41 178
Laptop 83.50 906
Server 0.09 1
Total 34.43 1085
Node 1
Nominal % n
<=2,56
0.00 0
CPU 99.27 2051
Laptop 0.29 6
Server 0.44 9
Total 65.57 2066
Node 2
Nominal % n
>2,56
Processor_Type
Improvement=0.06
0.00 0
CPU 1.06 9
Laptop 98.94 841
Server 0.00 0
Total 26.98 850
Node 3
Nominal % n
P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,
P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,
P26, P27, P28, P30, P31, P32, P33, P34, P35, P36
0.00 0
CPU 71.91 9
Laptop 27.66 65
Server 0.43 1
Total 7.46 235
Node 4
Nominal % n
P08, P10, P23, P29, P37
Memory
Improvement=0.03
0.00 0
CPU 98.83 169
Laptop 0.58 1
Server 0.58 1
Total 5.43 171
Node 9
Nominal % n
<=3.36
0.00 0
CPU 0.00 0
Laptop 100.00 64
Server 0.00 0
Total 2.03 64
Node 10
Nominal % n
>3.36
Class
CART (The classification and regression
trees) method was suggested by Breiman[1].
The decision trees produced by CART are
strictly binary, containing exactly two
branches for each decision node.
CART recursively partitions the records with
similar values for the target attribute.
6
[1 ]Breiman Wデ ;ノく さCノ;ゲゲキaキI;デキラミ ;ミS RWェヴWゲゲキラミ TヴWWゲざが ヱΓΒヴ
1
Table II shows the variables for the computer number 0022. We can identify
three out of ten variables with missing values
CART
TABLE II
COMPUTER 0022 – CMDB PARAMETERS
Variable Meaning/value Units
Class Missing NA
Brand Missing NA
Computer_Age Missing Week
Processor_Tipe P27 NA
Processor_Clock 2.19 GHz
Processors 2 Count
Memory (RAM) 2.14 GB
Operation_System SO_7 NA
Service_Pack SP_3 NA
Hard_Disk 80.02 GB
7
CART
0.00 0
CPU 70.74 2229
Laptop 28.94 912
Server 0.32 10
Total 100.00 3151
Node 0
Nominal % n
Processor_Clock
Improvement=0.31
0.00 0
CPU 16.41 178
Laptop 83.50 906
Server 0.09 1
Total 34.43 1085
Node 1
Nominal % n
<=2,56
0.00 0
CPU 99.27 2051
Laptop 0.29 6
Server 0.44 9
Total 65.57 2066
Node 2
Nominal % n
>2,56
Processor_Type
Improvement=0.06
0.00 0
CPU 1.06 9
Laptop 98.94 841
Server 0.00 0
Total 26.98 850
Node 3
Nominal % n
P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,
P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,
P26, P27, P28, P30, P31, P32, P33, P34, P35, P36
0.00 0
CPU 71.91 9
Laptop 27.66 65
Server 0.43 1
Total 7.46 235
Node 4
Nominal % n
P08, P10, P23, P29, P37
Memory
Improvement=0.03
0.00 0
CPU 98.83 169
Laptop 0.58 1
Server 0.58 1
Total 5.43 171
Node 9
Nominal % n
<=3.36
0.00 0
CPU 0.00 0
Laptop 100.00 64
Server 0.00 0
Total 2.03 64
Node 10
Nominal % n
>3.36
Class
Variable: Class
Node 0 indicates that CPU (desktop)
category has the higher probability (0.7) to
be selected if a random imputation is
conducted. On the other hand, Laptop
category has a smaller probability (0.28)
than the first one, and the Server category
has null probability (0.0)
0.00 0
CPU 70.74 2229
Laptop 28.94 912
Server 0.32 10
Total 100.00 3151
Node 0
Nominal % n
Class
8
CART
0.00 0
CPU 70.74 2229
Laptop 28.94 912
Server 0.32 10
Total 100.00 3151
Node 0
Nominal % n
Processor_Clock
Improvement=0.31
0.00 0
CPU 16.41 178
Laptop 83.50 906
Server 0.09 1
Total 34.43 1085
Node 1
Nominal % n
<=2,56
0.00 0
CPU 99.27 2051
Laptop 0.29 6
Server 0.44 9
Total 65.57 2066
Node 2
Nominal % n
>2,56
Processor_Type
Improvement=0.06
0.00 0
CPU 1.06 9
Laptop 98.94 841
Server 0.00 0
Total 26.98 850
Node 3
Nominal % n
P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,
P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,
P26, P27, P28, P30, P31, P32, P33, P34, P35, P36
0.00 0
CPU 71.91 9
Laptop 27.66 65
Server 0.43 1
Total 7.46 235
Node 4
Nominal % n
P08, P10, P23, P29, P37
Memory
Improvement=0.03
0.00 0
CPU 98.83 169
Laptop 0.58 1
Server 0.58 1
Total 5.43 171
Node 9
Nominal % n
<=3.36
0.00 0
CPU 0.00 0
Laptop 100.00 64
Server 0.00 0
Total 2.03 64
Node 10
Nominal % n
>3.36
Class
Variable: Class
Node 1 indicates that Laptop category has
The higher probability (0.83) to be selected
Processor_Clock
Improvement=0.31
0.00 0
CPU 16.41 178
Laptop 83.50 906
Server 0.09 1
Total 34.43 1085
Node 1
Nominal % n
<=2,56
Variable: Class
9
CART
0.00 0
CPU 70.74 2229
Laptop 28.94 912
Server 0.32 10
Total 100.00 3151
Node 0
Nominal % n
Processor_Clock
Improvement=0.31
0.00 0
CPU 16.41 178
Laptop 83.50 906
Server 0.09 1
Total 34.43 1085
Node 1
Nominal % n
<=2,56
0.00 0
CPU 99.27 2051
Laptop 0.29 6
Server 0.44 9
Total 65.57 2066
Node 2
Nominal % n
>2,56
Processor_Type
Improvement=0.06
0.00 0
CPU 1.06 9
Laptop 98.94 841
Server 0.00 0
Total 26.98 850
Node 3
Nominal % n
P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,
P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,
P26, P27, P28, P30, P31, P32, P33, P34, P35, P36
0.00 0
CPU 71.91 9
Laptop 27.66 65
Server 0.43 1
Total 7.46 235
Node 4
Nominal % n
P08, P10, P23, P29, P37
Memory
Improvement=0.03
0.00 0
CPU 98.83 169
Laptop 0.58 1
Server 0.58 1
Total 5.43 171
Node 9
Nominal % n
<=3.36
0.00 0
CPU 0.00 0
Laptop 100.00 64
Server 0.00 0
Total 2.03 64
Node 10
Nominal % n
>3.36
Class
Node 3 indicates that Laptop category has
The higher probability (0.989) to be selected
Processor_Type
Improvement=0.06
0.00 0
CPU 1.06 9
Laptop 98.94 841
Server 0.00 0
Total 26.98 850
Node 3
Nominal % n
P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,
P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,
P26, P27, P28, P30, P31, P32, P33, P34, P35, P36
10
Variable: Class
As a consequence, for computer number
0022 the Class variable will be imputed as
being Laptop.
Evaluating model prediction
The formulated hypotheses for McNemar test (2-sided) were[2]:
TABLA III
CHI-SQUARE TEST
Value Exact Sig.
(Two-sided)
McNemar Test 1.058 0.392
Nº Valid Cases 7049
Use binomial distribution
According to this analysis we cannot reject the null hypothesis, that is, CART
SラWゲミげデ change Class values after imputation (p-value=0.396)
11
ぷヲへ Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓ
Evaluating model prediction
In this case E_Class is the imputed value and Class is the real value. For
instance, 5013 (99.6%) computers with Class equal to CPU (Desktop) were
Classified correctly by CART, and 2002 (99.3%) computers with Class equal to
Laptop were classified correctly by the same CART.
TABLA IV
CONTINGENCY TABLE CLASS
E_Class
Total CPU Laptop
Class CPU 5013 20 5033
Laptop 14 2002 2016
Total 5027 2022 7049
12
Antivirus Scanning Dosage Statistics Model
For example, according to Chi-Squared Test for Independence, Malware_Level
and USB are independent. However, for both situations (disable or enable USB
ports) the computers have the same levels of malware in our case.
Notwithstanding, as a recommendation to improve security, disabling USB ports
is an effective strategy for preventing information leakage.
The Kaplan-Meier method is used for estimating the survival function from life-
Time Data. To use this strategy we define the following outcome: Elapsed time
to first malware infection in a computer.
Survival curves show, for each time plotted on the X axis, the portion of all
computers surviving at that time.
13
Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく
Antivirus Scanning Dosage Statistics Model
Week
Survival Function Week に Kaplan-Meier Curves
Cu
mS
urv
iva
l
Group 0 = computers with USB disabled
Group 1 = computers with USB enabled
14
The log-Rank Test said that there
are not important differences
between those groups, it means
that for both situations (disable or
enable USB ports) the computer
have the same levels of malware
Antivirus Scanning Dosage Statistics Model
Week
Survival Function Week に Kaplan-Meier Curves
Cu
mS
urv
iva
l
Group 0 = さYラ┌ミェざ computers (1 to 165 weeks)
Group 1 = さAS┌ノデざ computers (166 to 248 weeks)
Group 2 = さOノSざ computers (> 248 weeks)
Computer_Age
15
We can see that group 2
(Computers with more of 248
week of operating time) showed
statistical differences when they
were compared with the others
groups. That means the さOノSざ
computers were infected slower
than others groups.
Conclusion and Future work
In this study we believe that malware level depends on variables such as:
Processors (number of processor in the computer)
Computer_Age (Operating time)
Browse_Time. (Web surfing time)
Class (Laptop, desktop or server)
Future directions of this work include performing additional statistics analysis
such as recurrence analysis and formulation of survival models through
Cox-Models. This also will allow identifying significant variables to optimize the
malware scanning policy in an IS as well as measure its effect size.
16
Acknowledgment
The authors thank Juan Carlos Correa from School of Statistics of the
Universidad Nacional de Colombia at Medellín for helpful feedback that
contributed to improve this research. Also the authors thank the Security Team
of the Bank Company for their continuous encouragement and support
Many thanks to Universidad Nacional de Colombia-Sede Medellín for helping us
to achieve these goals.
17
References
Weiguo Jが さAヮヮノ┞キミェ EヮキSWマキラノラェ┞ キミ Cラマヮ┌デWヴ Vキヴ┌ゲ PヴW┗Wミデキラミぎ PヴラゲヮWIデゲ ;ミS Lキマキデ;デキラミゲざが ヲヰヱヰく TエWゲキゲが Cラマヮ┌デWヴ “IキWミIWが Uミキ┗Wヴゲキデ┞ ラa A┌Iニノ;ミSく
B;キノW┞が NくJくTが さTエW M;デエWマ;デキI;ノ TエWラヴ┞ ラa IミaWIデキラ┌ゲ DキゲW;ゲWゲ ;ミS Iデゲ AヮヮノキI;デキラミゲざ 1975, New York: Oxford University Press.
Kephart Jが ;ミS WエキデW “が さDキヴWIデWS-Graph Epidemiological Models of Computer Viruses",
IEEE Computer Symposium on Research in Security and Privacy, Proceedings, pp. 343に359, May 1991.
Kephart Jが ;ミS WエキデW “が さMW;ゲ┌ヴキミェ ;ミS MラSWノキミェ Cラマヮ┌デWヴ Vキヴ┌ゲ PヴW┗;ノWミIWが RWゲW;ヴIエ キミ “WI┌ヴキデ┞ ;ミS Pヴキ┗;I┞ざが ヱΓΓンが PヴラIWWSキミェゲが ヱΓΓン IEEE Computer Society Symposium on,
pp. 2に15, May 1993.
Kephartが Jが さHラ┘ Tラヮラノラェ┞ AaaWIデゲ Pラヮ┌ノ;デキラミ D┞ミ;マキIゲざ キミ Langton, C.G. (ed.) Artificial
Life III. Reading, MA: Addison-Wesley, 1994.
17
References
Pastor-Satorras, R. and Vespignaniが Aが さEヮキSWマキI D┞ミ;マキIゲ ;ミS EミSWマキI “デ;デWゲ キミ CラマヮノW┝ NWデ┘ラヴニゲざく B;ヴIWノラミ;が “ヮ;キミぎ Universitat Politecnica de Catalunya, 2001.
Rishikesh Pが さUゲキミェ Pノ;ミデ EヮキSWマキラノラェキI;ノ MWデエラSゲ Tラ Tヴ;Iニ Cラマヮ┌デWヴ NWデ┘ラヴニ Wラヴマゲざが 2004. Thesis, Computer Science, Virginia Polytechnic and State University.
D;ミキWノ Tく L;ヴラゲWが さDキゲIラ┗Wヴキミェ Kミラ┘ノWSェW キミ D;デ;く Aミ キミデヴラS┌Iデキラミ デラ S;デ; マキミキミェざ ヲヰヰヵく John Wiley & Sons, Inc
Leo Breiman, Jerome Friedman, Richard Olshenが ;ミS Cエ;ヴノWゲ “デラミWが さCノ;ゲゲキaキI;デキラミ ;ミS RWェヴWゲゲキラミ TヴWWゲざが ヱΓΒヴく Cエ;ヮマ;ミ わ H;ノノっCRC PヴWゲゲく
Vipin K┌マ;ヴが さTエW Tラヮ TWミ Aノェラヴキデエマゲ キミ D;デ; Mキミキミェざが ヲヰヰΓく Cエ;ヮマ;ミ わ H;ノノっCrc.
Q┌キミノ;ミが Rが さUミニミラ┘ミ ;デデヴキH┌デW ┗;ノ┌Wゲ キミ キミS┌Iデキラミざく In Proceedings of the Sixth
International Workshop on Machine Learning, 1989 pp. 164に168.
Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓく Jラエミ WキノW┞ わ “ラミゲが IミI
Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく Jラエミ WキノW┞ Sons,
17
References
Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓく Jラエミ WキノW┞ わ “ラミゲが IミI
Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく Jラエミ WキノW┞ Sons,
17
1
NKS THANKS THANKS THANKS