A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks...

10
A Graph Based Framework for Malicious Insider Threat Detection Anagi Gamachchi, Li Sun and Serdar Boztas ¸ School of Science, RMIT University, Melbourne, Victoria, Australia (anagi.gamachchi, li.sun, serdar.boztas)@rmit.edu.au Abstract While most security projects have focused on fend- ing off attacks coming from outside the organizational boundaries, a real threat has arisen from the people who are inside those perimeter protections. Insider threats have shown their power by hugely affecting national security, financial stability, and the privacy of many thousands of people. What is in the news is the tip of the iceberg, with much more going on under the radar, and some threats never being detected. We propose a hybrid framework based on graphical analysis and anomaly detection approaches, to com- bat this severe cyber security threat. Our framework analyzes heterogeneous data in isolating possible ma- licious users hiding behind others. Empirical results reveal this framework to be effective in distinguishing the majority of users who demonstrate typical behav- ior from the minority of users who show suspicious behavior. 1. Introduction The battle between malicious but trusted insiders and organizations in safeguarding information assets is the biggest and fastest growing cyber security threat in this digital age. Insider threat, or the threat from a malicious insider who is defined as “a current or former employee, contractor or business partner who has, or had, authorized access to an organization’s network, system or data and intentionally exceeded or misused that access in a manner that negatively affected the confidentiality, integrity or availability of the organiza- tion’s information or information systems” [7] has been identified as a primary concern within the cybersecu- rity community. Financial loss and reputation damage caused by this “known unknown” cybersecurity threat far outweighs that caused by external attacks. Thus, the majority of private and governmental organizations have identified the severity of the threat posed by insiders and focused on security control improvements. However, the seriousness of this problem is still rising at an alarming rate threatening most critical infrastruc- ture segments. One of the most recent articles from CSO magazine [1] compared the cost between external and internal attacks and noted that while it takes about 50 days to fix a data breach caused by an internal attack, it only takes 2 to 5 days in the case of external at- tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial of service ($127,000) and Web-based attacks ($96,000)”, indicating the severity of this problem. The unpredictable nature of human behavior makes this complex issue much more complicated than ex- pected. This is aggravated by the mobility and hyper- connectivity of people. Insider threat research and sur- veys suggest this problem cannot be considered only as a data driven problem; it needs to be considered as data and behavior driven problem [5]. A close examination of user behavior can spot trends and such informa- tion can be used in tightening radars on suspicious users. Different parameters govern users’ day to day actions, and behavioral changes exposed in workplace environments will extend the possibilities of isolating suspicious users from the rest of the employees. Orga- nizations can suffer after effects such as unmotivated employees, inefficient work behavior, if an innocent user is classified as suspicious. Thus, the decision of naming a person as suspicious should be a smooth but complex process. Obviously, insider threat detection will focus on isolating suspicious users from the others; but it may not be practical to point an employee as a malicious attacker. Also, the effectiveness of the process totally depends on the ability of analysis of many parameters as possible. Consideration of above facts led us to think of an insider threat detection framework as described in the rest of the paper. Our Contribution: In this paper, we propose a frame- work for isolation of malicious users based on graphi- cal and anomaly detection techniques. The proposed architecture is given in Figure 1, and has two ma- jor components “Graphical Processing Unit” (GPU) and “Anomaly Detection Unit” (ADU). Data from multidimensional sources of an enterprise network is formatted and fed into the GPU, which generates a graph which represents interrelationships between in- formational assets of the network. These input streams can be from different, i.e., heterogeneous, informa- tional sources with different data formats. These data streams can be from event logs (logon/logoff), email logs, HTTP records, social network access data and 2638 Proceedings of the 50th Hawaii International Conference on System Sciences | 2017 URI: http://hdl.handle.net/10125/41475 ISBN: 978-0-9981331-0-2 CC-BY-NC-ND

Transcript of A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks...

Page 1: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

A Graph Based Framework for Malicious Insider Threat Detection

Anagi Gamachchi, Li Sun and Serdar BoztasSchool of Science, RMIT University, Melbourne, Victoria, Australia

(anagi.gamachchi, li.sun, serdar.boztas)@rmit.edu.au

Abstract

While most security projects have focused on fend-ing off attacks coming from outside the organizationalboundaries, a real threat has arisen from the peoplewho are inside those perimeter protections. Insiderthreats have shown their power by hugely affectingnational security, financial stability, and the privacyof many thousands of people. What is in the newsis the tip of the iceberg, with much more going onunder the radar, and some threats never being detected.We propose a hybrid framework based on graphicalanalysis and anomaly detection approaches, to com-bat this severe cyber security threat. Our frameworkanalyzes heterogeneous data in isolating possible ma-licious users hiding behind others. Empirical resultsreveal this framework to be effective in distinguishingthe majority of users who demonstrate typical behav-ior from the minority of users who show suspiciousbehavior.

1. Introduction

The battle between malicious but trusted insidersand organizations in safeguarding information assets isthe biggest and fastest growing cyber security threatin this digital age. Insider threat, or the threat from amalicious insider who is defined as “a current or formeremployee, contractor or business partner who has, orhad, authorized access to an organization’s network,system or data and intentionally exceeded or misusedthat access in a manner that negatively affected theconfidentiality, integrity or availability of the organiza-tion’s information or information systems” [7] has beenidentified as a primary concern within the cybersecu-rity community. Financial loss and reputation damagecaused by this “known unknown” cybersecurity threatfar outweighs that caused by external attacks. Thus,the majority of private and governmental organizationshave identified the severity of the threat posed byinsiders and focused on security control improvements.However, the seriousness of this problem is still risingat an alarming rate threatening most critical infrastruc-ture segments.

One of the most recent articles from CSO magazine[1] compared the cost between external and internal

attacks and noted that while it takes about 50 daysto fix a data breach caused by an internal attack, itonly takes 2 to 5 days in the case of external at-tacks. Moreover, “attacks by malicious insiders are alsothe costliest to fix ($145,000), followed by denial ofservice ($127,000) and Web-based attacks ($96,000)”,indicating the severity of this problem.

The unpredictable nature of human behavior makesthis complex issue much more complicated than ex-pected. This is aggravated by the mobility and hyper-connectivity of people. Insider threat research and sur-veys suggest this problem cannot be considered only asa data driven problem; it needs to be considered as dataand behavior driven problem [5]. A close examinationof user behavior can spot trends and such informa-tion can be used in tightening radars on suspicioususers. Different parameters govern users’ day to dayactions, and behavioral changes exposed in workplaceenvironments will extend the possibilities of isolatingsuspicious users from the rest of the employees. Orga-nizations can suffer after effects such as unmotivatedemployees, inefficient work behavior, if an innocentuser is classified as suspicious. Thus, the decision ofnaming a person as suspicious should be a smooth butcomplex process. Obviously, insider threat detectionwill focus on isolating suspicious users from the others;but it may not be practical to point an employee asa malicious attacker. Also, the effectiveness of theprocess totally depends on the ability of analysis ofmany parameters as possible. Consideration of abovefacts led us to think of an insider threat detectionframework as described in the rest of the paper.Our Contribution: In this paper, we propose a frame-work for isolation of malicious users based on graphi-cal and anomaly detection techniques. The proposedarchitecture is given in Figure 1, and has two ma-jor components “Graphical Processing Unit” (GPU)and “Anomaly Detection Unit” (ADU). Data frommultidimensional sources of an enterprise network isformatted and fed into the GPU, which generates agraph which represents interrelationships between in-formational assets of the network. These input streamscan be from different, i.e., heterogeneous, informa-tional sources with different data formats. These datastreams can be from event logs (logon/logoff), emaillogs, HTTP records, social network access data and

2638

Proceedings of the 50th Hawaii International Conference on System Sciences | 2017

URI: http://hdl.handle.net/10125/41475ISBN: 978-0-9981331-0-2CC-BY-NC-ND

Page 2: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

Logon/Logoff Records

Removable Media Usage Records

Psychometric Data

HTTP Access Records

Graphical Processing Unit

Graph/Sub graph Attribute Extraction

Generate Original Graph

Induced Sub Graphs

Anomaly Detection Unit

Isolation Forest Algorithm

Anomaly Scores for Individual Users

Graph/Sub graph Attributes

Figure 1. Framework

various HR records such as psychometric data. Oncethe informational assets are mapped into a network,several graph parameters are calculated for each user.Since the final goal is to isolate the most anomaloususers from the rest, all the attributes are computedfor individual users. The next task of the GPU is togenerate induced subgraphs of each user for differ-ent levels of neighborhoods. Several relevant subgraphproperties (vertex count, edge count, density, diameterand number of peers) are calculated for each level ofsubgraphs. Calculated graph and subgraph parametersare fed into the ADU. In parallel to the above process,time-varying data also fed into the ADU. The isolationforest algorithm is executed for isolating anomaloususers within the ADU unit. Anomaly scores for eachuser is generated as the output of the ADU. Thesevalues are used in identifying and separating possiblemalicious users from the rest of the workforce.

The remainder of this paper is organized as follows.Section 2 describes mostly related work contributedto insider threat detection research. Section 3 is thedataset we used in this research while section 4 de-scribed the adopted methodology. Experimental resultsare discussed in section 5. Section 6 concludes thepaper indicating conclusions and future directions.

2. Related Work

Intrusion Detection System (IDS) based ap-proaches, visualization strategies, honeypot/honeynetapproaches and system call based methods are severaltechniques adopted from external threat detection infinding solutions for insider problem [17]. The workpresented here is focused on the combination of graph-based techniques and anomaly detection approaches.This section will only discuss the most related workunder the above two techniques specifically in theinsider threat domain.

The specialized network anomaly detection(SNAD) model proposed by [6] is a graph basedapproach used for detecting insider actions in

collaborative information systems. In their modelaccess logs are mapped into a bipartite graph. Thesimilarity of users are compared based on the numberof subjects a user accesses from a collaborativeinformation system using the cosine similaritymeasure. In order to determine if a particular access isanomalous or not, the authors considered the influenceof a user on the similarity of the access network bysuppressing each user at a time. Even though theyachieved better performance over their competitors(spectral anomaly detection model), they identifieddifficulties in implementation on real world networks.

Another study [4] proposed a proactive insiderthreat detection by graph learning and psychologicalmodeling of users. The proposed model is a combi-nation of structural anomaly detection and psycholog-ical profiling and explored the possibility of includingdynamical properties of nodal attributes. They haveevaluated results based on a publicly available gamingdata set which might not be very similar to enterprisesystem and network data. Althebyn and Panda [2] havealso suggested the use of graph theory to formalize twocomponents, knowledge graph and object dependencygraphs. A knowledge graph represents knowledge unitsfor a given insider and they are updated over the time.Dependency graph is a global hierarchical graph thatshows all dependencies among various objects. Eventhough this model tries to include accumulated knowl-edge of the insider over time on systems and objects itcan be improved by including several other parameterssuch as user’s behavioral patterns and psychologicalaspects.

Another study performed by Nance and Marty [12]introduced the use of bipartite graphs for identifyingand visualizing insider threat. They tried to establishacceptable insider behavior patterns based on work-group role classifications. Although this approach isquite useful for detecting certain insider threats, ithas the limitation of a high false positive rate. Theframework suggested by [8] is another graph-based ap-proach for malicious insider threat detection which usesthe minimum description length principle for detectinganomalous activities. A recent paper [10] proposed theuse of authentication subgraphs for analyzing usersbehavior within an enterprise network utilizing a setof subgraph attributes in user profiling. Time seriesanalysis of subgraphs and use of bipartite graphs arealso introduced in their work, targeting on a muchcomprehensive analysis in their ongoing work. Anotherrelated framework, the BANDIT (Behavioral AnomalyDetection for Insider Threat) [3] also proposed a two-stage anomaly detection approach in managing insiderattacks which compared user’s behaviour based on ownand peer baselines.

2639

Page 3: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

3. The Data Set

Due to the lack of availability of proper insiderthreat datasets we have utilized the insider threatdataset published by CERT Carnegie Mellon Universityfor this research [15]. The dataset “R4.2.tar.bz” hasbeen used for this analysis. According to the datasetowners, this is a “dense needle” dataset with a fairamount of red team scenarios. This dataset consists ofsix broad types of data records (HTTP, logon, device,file, email and psychometric) of 1000 employees overa 17 months period. All HTTP records contain user,PC, URL and web page content with time stamps.“Logon.csv” consists of user logon/logoff activitieswith the corresponding PC with timestamps. “Logon”and “Logoff” are the two types of activities can befound in data records. “Logon” activity corresponds toeither a user login event or a screen unlock event, whilethe “Logoff” event corresponds to user logoff event.Screen locks are not recorded in this dataset. The thirddata file “device.csv” is a collection of data recordsof removable media usage. It indicates insert/removeactions with the relevant user, PC, and timestamp. De-tails of file copies are stored in “file.csv” file with date,user, PC, filename, and content. To get the friendshipnetwork of users, the CERT dataset provides emailcommunication records including from, to, cc andbcc fields. “psychometric.csv” provides psychometricscores based on big five personality traits or five-factormodel (FFM) for the definition of personality.

Among the different work roles which in thedataset, we performed our analysis with three job roles.Table 1 shows the statistics of selected data records.

4. Methodology

The goal of this paper is to introduce a frameworkfor mitigating the insider threat problem using a combi-nation of graph-based approach and an anomaly detec-tion technique. This framework will utilize multidimen-sional inputs such as user interactions with hardwareassets, web access records, email correspondences andpsychometric figures. The graph-based approach is aprominent method of identifying inter-relationships be-tween multidimensional entities. A user’s interactionswith devices are illustrated in a weighted, undirectedlarge scale bipartite graph G = (V,E,W ), where Vis the set of vertices (users), E is the set of edges,and W is the set of edge weights. Set of verticescomprises of two types of entities, users and deviceswhile edges represent user’s interaction with the device.Edge weights correspond to the number of “Logoff” ac-tivities which appeared during the whole time durationof the dataset between an individual user and a device.Graph visualization was carried out using NodeXL [14]and all the other calculations were done using theR statistical computing language [13]. The following

Figure 2. First order subgraph topology for allusers

subsections describe the theoretical background and theimplemented methodology in detail.

Even though the dataset comprises of both “logon”and “logoff” records for individual users, we utilizeonly the “logoff” events for network mapping. Thereason behind this is that we cannot distinguish logonactivities, and screen unlocks as they both recordedas “logon” events. However, the screen locks are notrecorded, and only the logoff events are recorded as“logoff” events. The representation of the use of remov-able media also can be represented by an edge betweenthe corresponding user and the device. However, thiswill convert the graph into a multigraph where the exis-tence of multiple edges is possible among two vertices.To keep the simplicity of the first phase of analysis,we shall exclude the representation of such edges.Integration of the friendship network into the samenetwork would change the structure of the network, andit will lose the multipartite property. Therefore, inter-user relationships based on email data has not beenconsidered for graphical representation. The followinggraph attributes for individual users are captured fromthe above graph G for further analysis.

4.1. Graph Parameters

User’s vertex degree (du): The degree of a vertexis the number of edges connected to it. In the contextof this analysis, we calculated the degree only for theusers. Therefore this value represents the number ofdevices accessed by an individual user.

4.2. User Subgraph (USG) Parameters

As the next step of the analysis, the focus is onthe construction of subgraphs for each user. A deepdive into the different order of subgraphs is how weextract relevant graph parameters in graph analysis.In this work user subgraphs are constructed until thefifth order of neighborhood. We define the user sub-graph (USG) as a weighted undirected graph Gu =(Vu, Eu,Wu) for the entire period of the dataset. Vu

represents the vertex set of Gu while Eu represents

2640

Page 4: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

TABLE 1. Data statistics

Functional Unit Department Numberof Users

HTTP Logon Device File Psychometric

Research And Engineering Engineering 129 4,196,817 101,782 67,916 75,335 129Research And Engineering Software Management 101 3,295,774 82,187 44,049 58,173 101Research And Engineering Research 101 3,332,576 79,362 30,906 41,292 101

the edge set of Gu. Link weights are similar to thatof the original graph. We have noticed that all thefirst order subgraphs have a star topology as shown inFigure 2. The number of external nodes can be within(1 : total number of devices). This factor is becauseof the disjoint nature of the two type of vertices. Eventhough we cannot extract much information on deviceaccess similarities with first order subgraphs, we willcontinue to use following subgraph properties of firstorder subgraphs in addition to the higher order sub-graph properties for the completeness of this analysis.

1) Vertex count vujfor j = 1 : 5

2) Edge count eujfor j = 1 : 5

3) Density pujfor j = 1 : 5

The density of a graph is the ratio of edges to allpossible edges given the number of vertices.

4) Diameter dujfor j = 1 : 5

The diameter of a graph is the largest shortest pathbetween any two vertices.

5) Number of peers pujfor j = 1 : 5

Since the main focus of this analysis is to identifythe most anomalous users, we have chosen thisparameter for evaluation in addition to the basicgraph properties.

The distribution of the above properties has beenillustrated as histograms and further discussed in the“Experimental Results” section of this paper.

4.3. Time Dependent Parameters

We believe that it is not sufficient to consideronly the above parameters in identifying maliciousinsiders without the temporal properties. To carry outa complete and comprehensive analysis, the follow-ing time-based parameters and personality values havebeen identified as important input parameters for theanomaly detection algorithm.

4.3.1. Individual logon logoff events. This parametercan be used in identifying users abnormal logon/logoffactivities as most disgruntled insiders tend to commitmalicious activities after hours [5]. Identifying users’baseline behavior on system/device access is an essen-tial part of malicious insider threat detection problem.For each user, four parameters (minimum, maximum,mean and mode) logon and logoff values have beencalculated. Those four parameters are also fed as aninput parameters to the anomaly detection unit.

4.3.2. Removable media usage events. Removablemedia is among the most popular method used in theftof Intellectual Property (IP) in extracting confidentialinformation from organizations [5]. Tracking the useof removable media can be an excellent informationsource for identifying suspicious events by trusted in-siders . Baseline behavior of removable media usageis captured by the minimum, maximum, mean andmode time of “Insert” and “Remove” activities as inthe logon/logoff event analysis. Time gap betweenconsecutive “Insert - Remove” action has also beenidentified as a good source of information to capturelarge file downloads. The daily number of files copiedby an individual is also used in this analysis.

4.3.3. Web Access Patterns. We can think of users’online behavior as a reflection of their offline behavior,as they tend to publish their feelings, thoughts, likesand dislikes through social media. In addition to theabove fact, web access patterns is also a good indi-cation of their online behavior. Disgruntled insiderstend to access competitors websites and recruitmentagency websites to understand and gather informationon potential opportunities. We have identified that theusers online behavior analysis will be comprehensiveif we include multiple social media data sources andweb access records. However due to the limitations ofdata availability on all these domains we will be strictonly to web access records for this research work. Also,it is evident that the content of web pages may havea direct link with users suspicious behavior. However,we will again restrict content analysis of web pages inthis analysis and will consider it as a future directionof our continuing work.

4.4. Personality Parameters

Psychometric data : Psychological behavior is oneof the other aspects linked to insider attacks. Suddenbehavioral changes can be indications of misuse ofprivileges. Verbal behavior, personality traits, unautho-rized absences, aggressive behavior, are a few indica-tors which can be considered as small markers whichcome before the big attack. Therefore nontechnical be-havioral changes are captured through the psychometricdata provided in the dataset.

Table 2 is a summary of all properties we haveidentified in this analysis to consider as input param-eters in to the isolation forest algorithm to separate

2641

Page 5: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

TABLE 2. Selected parameter set

Module ParameterGraph Degree of vertex

Sub Graphs

Vertex CountEdge CountDensityWeighted DiameterNumber of Peers

Logon/LogoffEvents

Minimum/Maximum Logon TimeMean/Mode Logon TimeMinimum/Maximum Logoff TimeMean/Mode Logoff Time

Removable Media

Minimum/Maximum Insert TimeMean/Mode Insert TimeMinimum/Maximum Remove TimeMean/Mode Remove TimeMaximum number of daily file copiesMode of number of daily file copies

Web Access Patterns Number of Unique URLs

PsychometricObservations

O (Openness to experience)C (Conscientiousness)E (Extroversion)A (Agreeableness)N (Neuroticism)

most anomalous users. Due to space limitations, wehave listed some pairs of parameters in the same linein the table, e.g., the line ”Minimum/Maximum InsertTime” covers two parameters, ”Minimum Insert Time”,and ”Maximum Insert Time”.

4.5. Anomaly Detection

Due to the complex nature of insider threat prob-lem, it is extremely hard to pinpoint a user as amalicious insider. Therefore, the first step should bethe identification of possible malicious insiders who aremaximally deviating from peers as well as their normalbehavior. Therefore, as the second stage of our analysis,we will focus on implementing an anomaly detectionalgorithm based on the important graphical propertiesand time dependent properties identified at the previousstage of this analysis. The anomaly detection algo-rithm adopted in this analysis is the “Isolation forest”algorithm, which stands out in effectively separatinganomalous events from the rest of the instances [11].

Isolation Forest Algorithm - iForest: The isola-tion forest algorithm is a model-based approach whichexplicitly isolates anomalies without constructing atypical profile instance. Linear time complexity with alow constant and low memory requirements drives usto use it in our experiments as the enormous amount ofinformation need to be analyzed in the field of insiderthreat. The use of the isolation forest algorithm for thiswork is part of the overall research effort within ourresearch group at RMIT University and CA Pacific,with the details as presented in [16], where it is applied

TABLE 3. Summary of number of parameters

Property Number of Inputparameters

Graph properties 1Sub graph properties 25Logon/Logoff behavior 8Removable media usage 10Web access patterns 1Psychometric observations 5

to a very large enterprise system for anomaly detection.This algorithm also performs well with a large

number of irrelevant attributes and instances wheretraining data set does not contain any anomalies. Thismethod generates an ensemble of iTrees for a givendataset and the instances with the short average path ofiTrees are considered to be anomalies. If the calculatedanomaly score value, s is very close to 1 it can beregarded as a definite anomaly. Instances with s muchsmaller than 0.5 can be considered normal situations.If all the instances return s ≈ (0.5), then the entiresample deemed to be not having any distinct anomalies.

Based on the above-described algorithm, anomalyscores were calculated for each user, for each orderof subgraph for (1:5) separately, based on the fivegraph properties identified in subsection 4.2. In thiscase the iForest algorithm is executed considering 5input parameters. We believe it would be much ef-fective if we incorporate parameter values calculatedfor the different order of subgraphs when calculatinganomaly scores. Therefore anomaly scores correspond-ing to subgraph properties have been computed using25 distinct values obtained for 5 various parameters of1st, 2nd, 3rd, 4thand 5th order of subgraphs. Similarly,anomaly scores correspond to graph parameters, time-dependent parameters and personality parameters (assummarized in Table 2) were calculated independentlyusing iForest algorithm. Finally, anomaly scores corre-spond to each user is calculated as a combination ofall the parameters described in Table 2, in which casethe number of input parameters for the algorithm was50. Breakdown of the number of parameters has beensummarized in Table 3.

5. Experimental Results

This section is dedicated to a comprehensive dis-cussion of results obtained through our analysis. Thediscussion is based on the “Research and Engineer-ing/Engineering” work role, and final results for allthree work roles considered in this analysis are sum-marized in Table 4.

Figure 3 is an illustration of the user’s device accessnetwork. Users are represented in spheres while devicesare represented by triangles. Vertex size correspondsto the degree of the vertex, which is an indication of

2642

Page 6: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

Figure 3. Users’ device access network

Figure 4. Degree distribution of users

the number of devices accessed by a particular user.The width of all edges corresponds to the numberof “Logoff” events which occurred during the entireperiod of the dataset. We continue to use the graphicalrepresentation of informational assets as in our previ-ous work [9], as it can be used to precisely indicateinterrelationships between informational assets. Also,it is an efficient means of extracting basic essentialparameters of massively dense log data.

5.1. Graph Parameters

The individual degree distribution is illustrated asa histogram in Figure 4. The degree of a user is thenumber of devices a user access in this analysis. Thisfigure reveals that the majority of users have a lowerdegree while the minority of users have the larger de-gree compared to others. We can think of two possiblereasons behind the few number of users in the tail ofthe distribution. Either these employees are assigned tomultiple devices to perform their day to day operationsor an anomalous behavior. As previously mentioned,we should not directly conclude any of the insiders inthe tail of the distribution as suspicious just by lookingat the number of devices they accessed. However, wecan think them as high-risk profiles among the others.

5.2. Subgraph Parameters

Figure 5(a) illustrates the histogram of vertex countfor the different order of user subgraphs. These his-tograms show the majority of users have a small num-ber of vertices in their subgraphs while the minorityof users have a larger number of vertices in their sub-graphs resulting much complex user subgraphs. Figure5(b) is an illustration of the distribution of edge countacross the different order of subgraphs. These valuesalso follow a distribution which is very similar to vertexcount. The subgraph density and the weighted diameterfor all USGs also shown in figure 5(c) and figure5(d) respectively. Density histograms show a similarpattern for most of the cases. Even though we couldnot find any obvious reason for this nature of distribu-tion, we think that subgraph density is an importantattribute for this kind of evaluations. Therefore, wecontinued to use that parameter as an input for theADU. Diameter distribution for lower order subgraphsshows similar behavior with a single peak data barwhile higher order subgraphs show similar behaviorwith two distinct peaks. These peaks are an indicationof cluster/clusters of users who have similar behavioralpatterns. The other subgraph property, number of peersalso indicates two broader groups of users correspondsto two significant data bars and few other small groupsof users. These results indicate the significance ofhigher order subgraph analysis in finding little cluesamong the enormous amount of data.

5.3. Time Dependent Parameters

This subsection discusses the results obtained forother time-varying properties, which were identifiedas some of the other governing parameters of insiderthreat problem.

5.3.1. Individual Logon-Logoff Behavior. Figure6(a) is an illustration of users logon behavior for theentire period of the dataset. By looking at this graph,it is evident that the majority of logon activities occurduring early office hours, which can be interpretedas the first logon event of the day. There are someother logon events, especially when we consider the“maximum” logon time which occurs during regularworking hours, which can be treated as logins followedby screen locks during the day. The logon times whichwe need to pay more attention are the events whichhappen during after office hours. We can identify afew users who have minimum and maximum logontimes occurred during the late night, which might beunusual for normal operations. In real world enterprisenetworks, we can expect system user logon activitiesduring this type of time periods for scheduled jobs suchas backups, log rotations and routine activities. But ifwe find such logon activities for non-system users, that

2643

Page 7: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

Order 1

Vertex Count

Fre

quen

cy

0 40 80

020

4060

8010

012

0 122

222100000

Order 2

Vertex Count

Fre

quen

cy

0 40 80

020

4060

8010

012

0

92

02

29

231000

Order 3

Vertex Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

83

8

1

23

823100

Order 4

Vertex Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

81

9

1

24

244400

Order 5

Vertex Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

78

000345

1

38

0

(a) Vertex count

Order 1

Edge Count

Fre

quen

cy

0 40 80

020

4060

8010

012

0 122

223000000

Order 2

Edge Count

Fre

quen

cy

0 40 80

020

4060

8010

012

0

92

02

11

1

20

2100

Order 3

Edge Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

83

8

10

25

71220

Order 4

Edge Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

81

821

23

25

223

Order 5

Edge Count

Fre

quen

cy

0 100 250

020

4060

8010

012

0

78

00015

15

1

38

(b) Edge count

Order 1

Density

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

7

011

18

0

11

00

91

Order 2

Density

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

37

00110

13

00

77

Order 3

Density

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

51

00100000

77

Order 4

Density

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

51

00100000

77

Order 5

Density

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

51

00100000

77

(c) Density

Order 1

Diameter

Fre

quen

cy

0 400 800

020

4060

8010

012

0

123431

79

1214

62002000000

Order 2

Diameter

Fre

quen

cy

0 400 800

020

4060

8010

012

0

112345

77

13

107431132001

Order 3

Diameter

Fre

quen

cy

0 500 1500

020

4060

8010

012

0

112331

64

1200004371402000

29

00000100

Order 4

Diameter

Fre

quen

cy

0 400 800

020

4060

8010

012

0

112331

64

12000013734

0

33

Order 5

Diameter

Fre

quen

cy

0 400 10000

2040

6080

100

120

112331

64

12000000038

0

39

0001

(d) Weighted diameter

Order 1

Peers

Fre

quen

cy

0.0 0.4 0.8

020

4060

8010

012

0

129

Order 2

Peers

Fre

quen

cy

0 10 30

020

4060

8010

012

0

92

3 2 0 0 0

31

1

Order 3

Peers

Fre

quen

cy

0 10 30

020

4060

8010

012

0

92

3 2 0 0 0

31

1

Order 4

Peers

Fre

quen

cy

0 20 40

020

4060

8010

012

0

78

650001

25

77

Order 5

Peers

Fre

quen

cy

0 20 40

020

4060

8010

012

0

78

650001

25

77

(e) Number of peers

Figure 5. Histograms of subgraph properties

is for human users, it needs to be further investigated todifferentiate between a genuine or a suspicious activity.One of the other critical parameters of insider threatdetection, the “logoff” behavior of users are illustratedin Figure 6(b). This graph also shows the mean, mode,minimum and maximum logoff times of each user forthe entire period. As can be seen on the graph majorityof “logoff” events happen during late office hours. Asin the case of “logon” behavior we are concern aboutafter hours logoff events which are abnormal comparedto the majority events happen during regular businesshours.

5.3.2. Removable media usage. Figure 7 is anillustration of users’ removable media usage statistics.Similar to logon/logoff analysis, time dependenciesof removable media usage has also been investigated.Figure 7(a) and (b) shows the maximum, minimum,mean and mode times for USB connect and disconnectevents respectively. One important factor noticedthrough this analysis is only a 20% (37/129) ofemployees from the selected designation used USBfile copies, which can not be considered as a typicalbehavioral pattern among the chosen group. Inthis case also events which occur during regularoffice hours can be regarded as normal while eventshappened after hours can be either suspicious or workrelated. Also, we have to be vigilant about large filecopying during after hours as well as normal businesshours. This property can be yield by analysis of thetime gap between consecutive connect and disconnectevents, which we have not computed in this exercise.Figure 7(c) and (d) demonstrate the variation of usersdaily number of file accesses. To identify suspicious

file copies we have considered only the maximumand mode of the number of file copies per day by anindividual. If the difference between the maximumand the mode of the number of file copies is unusual,it can be considered as a suspicious file download.By looking at the histogram, it is clear that theabove difference is less than four for the majority ofemployees while a few of the users deviated from thispattern.

Web Access Patterns. For the completeness of thiswork, we have selected a single parameter, based onindividual users’ web access patterns. Distribution ofthe unique number of URLs accessed by individualusers is illustrated in Figure 8. This results also illus-trate few outliers from rest of the group, which can bedirected for further investigations. We will be exploringthe means of integration of our previous work [9] to getmore input parameters based on web access patterns inthe continuation of this work.

5.4. Anomaly Detection

Figure 9(a) illustrates the anomaly score distribu-tion of the user base for the different order of sub-graphs and the combination of all subgraph properties.Anomaly score distribution clearly indicates major twotypes of users based on the above-proposed subgraphproperties. The users in the tail of the distribution havesmall anomaly score values, and they do not changewith the order of subgraph. However, the other set ofusers who are in the main segment of the distributionhave higher anomaly scores and vary based on the orderof subgraph. Based on the Isolation Forest algorithm,

2644

Page 8: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

00:00:0001:00:0002:00:0003:00:0004:00:0005:00:0006:00:0007:00:0008:00:0009:00:0010:00:0011:00:0012:00:0013:00:0014:00:0015:00:0016:00:0017:00:0018:00:0019:00:0020:00:0021:00:0022:00:0023:00:0000:00:00

AA

S04

42A

BC

0253

AH

M04

10A

JD00

74A

MD

0077

AN

T04

30A

RA

0252

ATF

0274

BB

S04

22B

IS02

47B

JM01

11B

LB01

14B

MM

0112

BS

A04

31B

SH

0445

CA

M01

02C

EJ0

109

CF

W02

64C

IM02

71C

JV02

86C

KB

0427

CN

E03

52C

SB

0080

DB

G02

61D

GF

0443

DIB

0081

DIB

0285

DLM

0051

DM

K02

57D

PM

0423

DP

S02

62D

TT

0076

DV

M02

65D

VW

0103

EG

D04

09E

HB

0420

EH

R00

96E

MW

0105

EN

L010

0E

PW

0434

ES

H02

83E

TR

0428

FT

M04

06G

KO

0078

HA

C02

82H

AD

0246

HA

H02

48H

BO

0413

HC

L011

3H

CM

0267

HC

W04

19H

GN

0287

HIA

0414

HK

B04

18H

MM

0108

HP

H00

75H

RB

0351

HW

W04

36IA

T02

63IIA

0092

IIW02

49IJ

M04

44IU

Y02

79JC

V02

45JI

M00

95JM

K00

99K

AT02

70K

BL0

260

KJW

0416

KQ

S04

15K

TG

0091

KV

S04

25K

WE

0082

LDB

0090

LOM

0086

MB

G04

38M

ES

0407

MH

G02

44M

IL02

55M

KW

0441

MO

H02

73N

BC

0412

NB

H02

56N

CE

0097

NC

K04

21N

DM

0417

NJB

0243

NM

S02

80N

WT

0098

OK

B04

39O

KC

0107

OM

S01

01O

NL0

276

ON

W00

83P

DW

0259

PE

S00

85P

GO

0433

PID

0426

PK

B02

81P

KB

0429

PP

F04

35Q

CH

0110

RA

G02

75R

AH

0440

RA

S04

32R

DT

0424

RG

N04

08R

HY

0079

RJB

0088

RM

S02

66R

OB

0272

RS

C00

89R

SM

0277

SD

B01

06S

GS

0268

SJF

0094

SM

W00

84S

OG

0411

SS

G02

50S

SR

0278

TE

B02

51T

EM

0093

TS

B04

37V

AA

0087

WA

S02

54X

AA

0104

ZJS

0269

ZM

C02

84Z

SK

0258

user

Tim

e

variable

max

mean

min

mode

(a) Logon behavior

00:00:0001:00:0002:00:0003:00:0004:00:0005:00:0006:00:0007:00:0008:00:0009:00:0010:00:0011:00:0012:00:0013:00:0014:00:0015:00:0016:00:0017:00:0018:00:0019:00:0020:00:0021:00:0022:00:0023:00:0000:00:00

AA

S04

42A

BC

0253

AH

M04

10A

JD00

74A

MD

0077

AN

T04

30A

RA

0252

ATF

0274

BB

S04

22B

IS02

47B

JM01

11B

LB01

14B

MM

0112

BS

A04

31B

SH

0445

CA

M01

02C

EJ0

109

CF

W02

64C

IM02

71C

JV02

86C

KB

0427

CN

E03

52C

SB

0080

DB

G02

61D

GF

0443

DIB

0081

DIB

0285

DLM

0051

DM

K02

57D

PM

0423

DP

S02

62D

TT

0076

DV

M02

65D

VW

0103

EG

D04

09E

HB

0420

EH

R00

96E

MW

0105

EN

L010

0E

PW

0434

ES

H02

83E

TR

0428

FT

M04

06G

KO

0078

HA

C02

82H

AD

0246

HA

H02

48H

BO

0413

HC

L011

3H

CM

0267

HC

W04

19H

GN

0287

HIA

0414

HK

B04

18H

MM

0108

HP

H00

75H

RB

0351

HW

W04

36IA

T02

63IIA

0092

IIW02

49IJ

M04

44IU

Y02

79JC

V02

45JI

M00

95JM

K00

99K

AT02

70K

BL0

260

KJW

0416

KQ

S04

15K

TG

0091

KV

S04

25K

WE

0082

LDB

0090

LOM

0086

MB

G04

38M

ES

0407

MH

G02

44M

IL02

55M

KW

0441

MO

H02

73N

BC

0412

NB

H02

56N

CE

0097

NC

K04

21N

DM

0417

NJB

0243

NM

S02

80N

WT

0098

OK

B04

39O

KC

0107

OM

S01

01O

NL0

276

ON

W00

83P

DW

0259

PE

S00

85P

GO

0433

PID

0426

PK

B02

81P

KB

0429

PP

F04

35Q

CH

0110

RA

G02

75R

AH

0440

RA

S04

32R

DT

0424

RG

N04

08R

HY

0079

RJB

0088

RM

S02

66R

OB

0272

RS

C00

89R

SM

0277

SD

B01

06S

GS

0268

SJF

0094

SM

W00

84S

OG

0411

SS

G02

50S

SR

0278

TE

B02

51T

EM

0093

TS

B04

37V

AA

0087

WA

S02

54X

AA

0104

ZJS

0269

ZM

C02

84Z

SK

0258

user

Tim

e

variable

max

mean

min

mode

(b) Logoff behavior

Figure 6. Users’ logon and logoff behavior

00:00:0001:00:0002:00:0003:00:0004:00:0005:00:0006:00:0007:00:0008:00:0009:00:0010:00:0011:00:0012:00:0013:00:0014:00:0015:00:0016:00:0017:00:0018:00:0019:00:0020:00:0021:00:0022:00:0023:00:0000:00:00

AH

M04

10A

JD00

74B

BS

0422

BIS

0247

BJM

0111

CE

J010

9C

IM02

71D

IB00

81D

IB02

85D

LM00

51D

MK

0257

DP

M04

23E

SH

0283

FT

M04

06G

KO

0078

HA

D02

46H

BO

0413

HC

M02

67H

MM

0108

HP

H00

75H

RB

0351

HW

W04

36IIW

0249

IJM

0444

JMK

0099

KW

E00

82LO

M00

86M

KW

0441

MO

H02

73N

DM

0417

NM

S02

80N

WT

0098

PP

F04

35R

DT

0424

RS

C00

89R

SM

0277

XA

A01

04

user

Tim

e

variable

max

mean

min

mode

(a) USB connect

00:00:0001:00:0002:00:0003:00:0004:00:0005:00:0006:00:0007:00:0008:00:0009:00:0010:00:0011:00:0012:00:0013:00:0014:00:0015:00:0016:00:0017:00:0018:00:0019:00:0020:00:0021:00:0022:00:0023:00:0000:00:00

AH

M04

10A

JD00

74B

BS

0422

BIS

0247

BJM

0111

CE

J010

9C

IM02

71D

IB00

81D

IB02

85D

LM00

51D

MK

0257

DP

M04

23E

SH

0283

FT

M04

06G

KO

0078

HA

D02

46H

BO

0413

HC

M02

67H

MM

0108

HP

H00

75H

RB

0351

HW

W04

36IIW

0249

IJM

0444

JMK

0099

KW

E00

82LO

M00

86M

KW

0441

MO

H02

73N

DM

0417

NM

S02

80N

WT

0098

PP

F04

35R

DT

0424

RS

C00

89R

SM

0277

XA

A01

04

user

Tim

e

variable

max

mean

min

mode

(b) USB disconnect

0

10

20

AH

M04

10A

JD00

74B

BS

0422

BIS

0247

BJM

0111

CE

J010

9C

IM02

71D

IB00

81D

IB02

85D

LM00

51D

MK

0257

DP

M04

23E

SH

0283

FT

M04

06G

KO

0078

HA

D02

46H

BO

0413

HC

M02

67H

MM

0108

HP

H00

75H

RB

0351

HW

W04

36IIW

0249

IJM

0444

JMK

0099

KW

E00

82LO

M00

86M

KW

0441

MO

H02

73N

DM

0417

NM

S02

80N

WT

0098

PP

F04

35R

DT

0424

RS

C00

89R

SM

0277

XA

A01

04

user

coun

t variable

MAX

MODE

(c) USB file transfer statistics

Differnce between Maximum and Mode number of File Copies

Num

ber

of U

sers

0 2 4 6 8 10

05

1015

20

17

8

4

2

0

2

0

2 2

(d) Difference between maximum and mode number of USBfile transfers

Figure 7. Removable media usage behavior

users with anomalous scores very close to 1 can beconsidered as definite anomalies while the instanceswith anomaly scores much smaller than 0.5 are safe toconsider as typical cases.

5.5. Parameter Dependency

Figure 9(b) is an indication of how the anomalyscores are distributed on different input parameterschosen in this analysis. The dense graph indicated fewpoints above the “Red” color horizontal line which isequivalent to an anomaly score of 0.8. Users belongto those points can be considered as anomalous users.

Figure 8. Distribution of unique URLs accessed byusers

2645

Page 9: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

0.3

0.4

0.5

0.6

0.7

0.8

IIW02

49P

ES

0085

ND

M04

17A

BC

0253

DV

W01

03A

JD00

74S

GS

0268

PD

W02

59LO

M00

86S

SG

0250

TS

B04

37N

CE

0097

IIA00

92G

KO

0078

EM

W01

05IA

T02

63Q

CH

0110

DP

M04

23N

MS

0280

EN

L010

0P

KB

0429

DM

K02

57B

JM01

11X

AA

0104

KW

E00

82F

TM

0406

HA

D02

46B

LB01

14H

PH

0075

AA

S04

42H

BO

0413

SD

B01

06AT

F02

74P

PF

0435

MIL

0255

ZM

C02

84D

IB00

81E

TR

0428

SS

R02

78K

TG

0091

CA

M01

02H

IA04

14K

JW04

16S

MW

0084

HM

M01

08E

GD

0409

HC

M02

67S

OG

0411

CN

E03

52R

OB

0272

CIM

0271

EH

B04

20K

QS

0415

MO

H02

73S

JF00

94M

BG

0438

MH

G02

44R

SC

0089

RS

M02

77E

SH

0283

HA

C02

82K

VS

0425

LDB

0090

PK

B02

81A

HM

0410

AM

D00

77A

NT

0430

AR

A02

52B

BS

0422

BIS

0247

BM

M01

12B

SA

0431

BS

H04

45C

EJ0

109

CF

W02

64C

JV02

86C

KB

0427

CS

B00

80D

BG

0261

DG

F04

43D

IB02

85D

LM00

51D

PS

0262

DT

T00

76D

VM

0265

EH

R00

96E

PW

0434

HA

H02

48H

CL0

113

HC

W04

19H

GN

0287

HK

B04

18H

RB

0351

HW

W04

36IJ

M04

44IU

Y02

79JC

V02

45JI

M00

95JM

K00

99K

AT02

70K

BL0

260

ME

S04

07M

KW

0441

NB

C04

12N

BH

0256

NC

K04

21N

JB02

43N

WT

0098

OK

B04

39O

KC

0107

OM

S01

01O

NL0

276

ON

W00

83P

GO

0433

PID

0426

RA

G02

75R

AH

0440

RA

S04

32R

DT

0424

RG

N04

08R

HY

0079

RJB

0088

RM

S02

66T

EB

0251

TE

M00

93V

AA

0087

WA

S02

54Z

JS02

69Z

SK

0258

User

Ano

mal

y_S

core

variable

All

order1

order2

order3

order4

order5

(a) Sub graph properties

0.4

0.6

0.8

BJM

0111

IIW02

49E

SH

0283

HA

D02

46N

BC

0412

FT

M04

06H

PH

0075

PP

F04

35M

OH

0273

EH

B04

20H

RB

0351

NM

S02

80P

DW

0259

DIB

0285

RS

C00

89D

LM00

51M

KW

0441

DP

S02

62P

ID04

26H

KB

0418

CK

B04

27C

FW

0264

EG

D04

09N

JB02

43C

JV02

86A

BC

0253

XA

A01

04H

MM

0108

DM

K02

57H

CM

0267

IAT

0263

BB

S04

22N

DM

0417

AA

S04

42A

JD00

74LO

M00

86S

SR

0278

CE

J010

9H

CW

0419

BIS

0247

KW

E00

82A

MD

0077

QC

H01

10R

GN

0408

TE

M00

93B

LB01

14IJ

M04

44H

BO

0413

PE

S00

85S

JF00

94E

NL0

100

ET

R04

28R

HY

0079

HW

W04

36D

TT

0076

CIM

0271

HG

N02

87R

AS

0432

RS

M02

77C

SB

0080

MB

G04

38O

NL0

276

DP

M04

23IU

Y02

79E

MW

0105

BS

H04

45E

PW

0434

AN

T04

30P

KB

0281

GK

O00

78K

JW04

16JM

K00

99C

NE

0352

RD

T04

24K

BL0

260

RJB

0088

SM

W00

84N

WT

0098

LDB

0090

SD

B01

06H

CL0

113

ON

W00

83S

SG

0250

TE

B02

51H

AC

0282

BS

A04

31R

AH

0440

AR

A02

52W

AS

0254

VA

A00

87IIA

0092

DV

M02

65AT

F02

74O

MS

0101

KAT

0270

RA

G02

75S

GS

0268

AH

M04

10D

IB00

81S

OG

0411

OK

B04

39O

KC

0107

EH

R00

96M

IL02

55N

BH

0256

ZM

C02

84JI

M00

95M

ES

0407

MH

G02

44T

SB

0437

KV

S04

25B

MM

0112

KT

G00

91R

MS

0266

PG

O04

33H

IA04

14N

CK

0421

PK

B04

29D

BG

0261

ZS

K02

58K

QS

0415

DV

W01

03R

OB

0272

ZJS

0269

CA

M01

02N

CE

0097

DG

F04

43H

AH

0248

JCV

0245

User

Ano

mal

y_S

core

Variable

All

Graph

Logon/Logoff

Psychometry

SubGraph

USB

Web

(b) All properties

Figure 9. Anomaly score distribution

To get a better understanding of results illustratedin Figure 9(b), the distribution of users with respectto anomaly scores computed based on all identifiedparameters is shown as a histogram in Figure 10. Wefind that the majority of users have anomaly scores inthe interval [0.4, 0.7], which can be considered to benormal, while a minority of users have anomaly scorevalues above 0.7 which can be regarded as suspicious.The single outlier with anomaly score value higher than0.8 can be tagged for further investigations.

In the case of insider threat detection and preven-tion, the priority is on the isolation of suspicious usersfrom the rest of workforce. Since it was the majorintention of this work, we have used the followingtechnique to validate above results by computing thepercentage of suspicious users based on the differentparameters identified. To perform that, the calculatedanomaly scores are mapped into a binary vector (0, 1)based on a predefined threshold value for each param-eter. The threshold value for each parameter is selectedas the (maximum value − 0.1), keeping a marginof 10% as in most of the experimental cases. In theparticular case of anomaly scores of ≈ 0.5 for the entiredataset, we will exclude that parameter in the validationprocess as that parameter do not contribute much infinding suspicious activities. For all the other casesanomaly scores are mapped as described above. Wepropose the following two methods to check parameterdependencies as discussed below.

Case I : In this approach percentage of suspicioususers are calculated considering the predicted num-ber of anomalous parameters at a time from 1 to 6.Calculated percentages are summarized in Table 4.Calculated percentage values reveal more than 79% ofusers can be considered to have normal behavior, whilethe others have suspicious behavior. Also noted thereare no users who are suspicious when considered morethan three parameters together. The complex nature ofinsider threat problem can govern this type of results.

Case II : In this method, we proposed looking at

the all possible combinations of parameters from thesix main categories selected for this analysis. Afterconsidering the results of above Case I, we could notexpect a significant change in results. However, we be-lieve this would be a possible approach in identificationof parameter dependencies.

6. Conclusions and Future Work

In this paper, we have introduced a frameworkbased on graphical and anomaly detection approachesfor identifying potential malicious insiders. This modelgenerates anomaly scores based on different inputparameters for each user. Considering the nature ofinsider attacks a user can be deemed to be suspiciouseven if a single parameter has been found to be sus-picious. We have adopted graph, subgraph propertiesand statistical methods in generating input parametersfor the anomaly detection algorithm through multi-domain real world information. Empirical results revealthe importance of selected properties in combating thispatient and smart attack. We also found that morethan 79% of users with common behavioral patternswhile the rest of the group shows suspicious behaviorbased on different parameters. Users belong to theminority group can be tagged and directed for furtherinvestigation.

In the continuation of this work, we will focus onintegrating as many as possible input parameters toimprove the effectiveness of the proposed framework.It would include social network data which can be con-sidered as a good source of online behavior and otherstatistical inputs from email and instant messagingcommunications. We believe this framework would bemuch more useful if we can include data from contentanalysis of other sources such as websites access byusers, emails sent/received by users and file transfers toremovable media. We will focus on integrating severalother graphical parameters, few other statistical param-eters such as time gap between USB insert/remove

2646

Page 10: A Graph Based Framework for Malicious Insider Threat Detection€¦ · tacks. Moreover, “attacks by malicious insiders are also the costliest to fix ($145,000), followed by denial

TABLE 4. Statistics on parameter dependency

Functional unit DepartmentPercentage of Users Corresponds to Number of Anomalous Parameters (%)0 1 2 3 4 5 6

Research And Engineering Engineering 83.72 16.28 0.00 0.00 0.00 0.00 0.00Research And Engineering Software Management 82.18 15.84 0.99 0.99 0.00 0.00 0.00Research And Engineering Research 79.21 17.82 2.97 0.00 0.00 0.00 0.00

Figure 10. Anomaly score distribution of users

actions in expanding this framework. Also, we willexplore the possibilities of integration of similaritybased clustering mechanism which we have introducedin our previous work for web access pattern analysis[9], with the proposed model. Another focus is onextending the analysis for the entire dataset and validateresults with the 2 insider threat cases simulated in thedataset. The other main aspect we are looking towardsis the use of temporal properties in graph analysis toincorporate time factor instead of statistical analysis.Finally, the main goal of this work, as well as ourprevious and future work is to formulate an efficient,scalable and automated insider threat detection andprediction framework.

Acknowledgments

The authors were supported by Australian Re-search Council (ARC) linkage grants, LP110200321and LP140100698. In addition, Anagi Gamachchi issupported by an ATN (Australian Technological Net-work) Industrial Doctoral Training Center (IDTC) PhDscholarship.

References

[1] “By the numbers: Cyber attack costs compared,”2016, accessed on 31/05/2016. [Online]. Avail-able: http://www.csoonline.com/article/3074826/security/by-the-numbers-cyber-attack-costs-compared.html

[2] Q. Althebyan and B. Panda, “A Knowledge-Base Model for In-sider Threat Prediction,” in Information Assurance and SecurityWorkshop, 2007. IAW ’07. IEEE SMC, 2007, pp. 239–246.

[3] V. H. Berk, G. Cybenko, I. G. d. Souza, and J. P. Murphy,“Managing malicious insider risk through bandit,” in SystemScience (HICSS), 2012 45th Hawaii International Conferenceon, Jan 2012, pp. 2422–2430.

[4] O. Brdiczka, L. Juan, B. Price, S. Jianqiang, A. Patil, R. Chow,E. Bart, and N. Ducheneaut, “Proactive Insider Threat Detec-tion through Graph Learning and Psychological Context,” in

Security and Privacy Workshops (SPW), 2012 IEEE Symposiumon, 2012, pp. 142–149.

[5] D. Cappelli, A. Moore, and R. Trzeciak, The CERT Guideto Insider Threats: How to Prevent, Detect, and Respondto Information Technology Crimes (Theft, Sabotage, Fraud).Addison-Wesley Professional, 2012.

[6] Y. Chen, S. Nyemba, W. Zhang, and B. Malin, “Specializingnetwork analysis to detect anomalous insider actions,” SecurityInformatics, vol. 1, no. 1, p. 5, 2012.

[7] A. M. Dawn Cappelli Randall Trzeciak,Timothy J. Shimeall,“Common Sense Guide to Prevention and Detection of InsiderThreats , 3rd Edition,” 2009.

[8] W. Eberle and L. Holder, “Applying graph-based anomalydetection approaches to the discovery of insider threats,” inIntelligence and Security Informatics, 2009. ISI ’09. IEEEInternational Conference on, 2009, pp. 206–208.

[9] A. Gamachchi and S. Boztas, “Web access patterns revealinsiders behavior,” in 2015 Seventh International Workshopon Signal Design and its Applications in Communications(IWSDA), Sept 2015, pp. 70–74.

[10] A. D. Kent, L. M. Liebrock, and J. C. Neil, “Authenticationgraphs: Analyzing user behavior within an enterprise network,”Computers & Security, vol. 48, pp. 150–166, feb 2015.

[11] F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” in2008 Eighth IEEE International Conference on Data Mining,Dec 2008, pp. 413–422.

[12] K. Nance and R. Marty, “Identifying and Visualizing theMalicious Insider Threat Using Bipartite Graphs,” in SystemSciences (HICSS), 2011 44th Hawaii International Conferenceon, 2011, pp. 1–9.

[13] R Core Team, R: A Language and Environment for StatisticalComputing, R Foundation for Statistical Computing, Vienna,Austria, 2013. [Online]. Available: http://www.R-project.org/

[14] M. Smith, N. Milic Frayling, B. Shneiderman, E. Mendes Ro-drigues, J. Leskovec, and D. C, “Nodexl: a free and opennetwork overview, discovery and exploration add-in for excel2007/2010.” [Online]. Available: http://nodexl.codeplex.com

[15] “CERT Insider Threat Data Set,” Software Engineering Insti-tute, Carnegie Mellon University, CERT Division and ExactData LLC. [Online]. Available: https://www.cert.org/insider-threat/tools/

[16] L. Sun, S. Versteeg, S. Boztas, and A. Rao, “Detecting anoma-lous user behaviour using an extended Isolation Forest algo-rithm: An enterprise case study,” Sept 2016, arXiv:1609.06676.

[17] S. Zeadally, B. Yu, D. H. Jeong, and L. Liang, “DetectingInsider Threats: Solutions and Trends,” Information SecurityJournal: A Global Perspective, vol. 21, no. 4, pp. 183–192,2012.

2647