An ontology-based approach for the reconstruction and analysis of digital … · 2016-02-23 · An...

ilable at ScienceDirect

Digital Investigation 15 (2015) 83e100

Contents lists ava

Digital Investigation

journal homepage: www.elsevier .com/locate/d i in

An ontology-based approach for the reconstructionand analysis of digital incidents timelines

Yoan Chabot a, b, *, Aur�elie Bertaux a, Christophe Nicolle a, Tahar Kechadi b

a CheckSem Team, Laboratoire Le2i, UMR CNRS 6306, Facult�e des Sciences Mirande, Universit�e de Bourgogne, BP47870, 21078 Dijon,Franceb School of Computer Science & Informatics, University College Dublin, Belfield, Dublin 4, Ireland

a r t i c l e i n f o

Article history:Received 15 March 2015Received in revised form 13 July 2015Accepted 14 July 2015Available online 3 August 2015

Keywords:Digital forensicsEvent reconstructionForensic ontologyKnowledge extractionOntology populationTimeline analysis

* Corresponding author. CheckSem Team, Laborat6306, Facult�e des Sciences Mirande, Universit�e de21078 Dijon, France.

E-mail address: [email protected] (Y. Chab

http://dx.doi.org/10.1016/j.diin.2015.07.0051742-2876/© 2015 Elsevier Ltd. All rights reserved.

a b s t r a c t

Due to the democratisation of new technologies, computer forensics investigators have todeal with volumes of data which are becoming increasingly large and heterogeneous.Indeed, in a single machine, hundred of events occur per minute, produced and logged bythe operating system and various software. Therefore, the identification of evidence, andmore generally, the reconstruction of past events is a tedious and time-consuming task forthe investigators. Our work aims at reconstructing and analysing automatically the eventsrelated to a digital incident, while respecting legal requirements. To tackle those threemain problems (volume, heterogeneity and legal requirements), we identify sevennecessary criteria that an efficient reconstruction tool must meet to address these chal-lenges. This paper introduces an approach based on a three-layered ontology, called ORD2I,to represent any digital events. ORD2I is associated with a set of operators to analyse theresulting timeline and to ensure the reproducibility of the investigation.

© 2015 Elsevier Ltd. All rights reserved.

Introduction

Nowadays, digital investigations require the analysis ofa large amount of heterogeneous data. The study of thesevolumes of data is a tedious task which leads to cognitiveoverload due to the very large amount of information to beprocessed. To help the investigators, many tools have beendeveloped. Most of them extract unstructured data fromvarious sources without bridging the gap of semantic het-erogeneity and without addressing the problem of cogni-tive overload. To resolve these issues, a promisingperspective consists of using a precise and reliable repre-sentation allowing to structure data on the one hand, andto standardise their representation on the other hand. A

oire Le2i, UMR CNRSBourgogne, BP47870,

ot).

structured and formal knowledge representation has twogoals: 1) to build automated processes more easily bymaking information understandable by machines and 2) togive to investigators an easy way to query, analyse andvisualise information. Computer forensics investigationsalso have to fulfil a set of legal and juridical rules to ensurethe admissibility of results in a court. It is particularlynecessary to ensure that all evidence presented at a trial arecredible and that the methods used to produce evidenceare reproducible and did not alter the objects found in thecrime scene. Problems of traceability and reproducibility ofreasoning are widely discussed in the literature and prov-enance is particularly relevant to be applied to digital in-vestigations. Indeed, as defined by (Gil andMiles, 2013), theprovenance of a resource is a record describing the entitiesand the processes involved in the creation, dispersion orothers activities that affect that resource. The provenanceprovides a fundamental basis for assessing the authenticityand the truth value of a resource and its reproducibility.

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1016/j.diin.2015.07.005&domain=pdf

www.sciencedirect.com/science/journal/17422876

http://www.elsevier.com/locate/diin

http://dx.doi.org/10.1016/j.diin.2015.07.005



Y. Chabot et al. / Digital Investigation 15 (2015) 83e10084

In our work, we propose an innovative digital forensicapproach based on a knowledge model allowing to repre-sent accurately a digital incident and all the steps usedduring an investigation to produce each result. In addition,a full set of operators to manipulate the content of thisontology is proposed. We introduce extraction andinstantiation operators to build automatically the knowl-edge base using digital traces extracted from disk images.Then, automatic analysis operators taking advantage of thisontology are also proposed. In particular, we focus on anoperator used to identify potential correlations betweenevents.

This paper is structured as follow: Section 2 gives acomprehensive state of the art of event reconstructionapproaches for digital forensics. Section 3 introduces theSADFC approach. In particular, the structure of the three-layered ontology is detailed and its various operators aredescribed. Section 4 evaluates the performance of ourapproach and illustrates its capabilities on a case study.

State of the art

Event reconstruction is a complex process because ofthree main issues: the large volume of data, the heteroge-neity of information due to the use of a large number ofsources and the legal requirements that the results have tomeet. This section aims to review the existing solutionsdescribed in the literature to reconstruct and analyse pastevents. The quality and the relevance of nine reconstruc-tion approaches are reviewed based on the previous threeissues and their limitations. These answers are then syn-thesised and used to guide the development of ourapproach to ensure that the three main issues of eventreconstruction are taken into account.

Data volume

Currently, investigators are facing huge data volumes ona digital crime scene. The growth of the data size is causedby several factors. Digital devices are more and more pre-sent in our daily lives. This increases the number of devicesowned by each person and therefore the number of devicesfound on the crime scenes. In addition, the frequency of useand the increase of storage capacity of the digital deviceshave caused the increase in the quantity of data stored byeach device. The very large amount of data makes theanalysis very complex and tedious, even causing cognitiveoverload. Information that is potentially relevant to reachthe objectives of an investigation is diluted in the amountof data, making the investigation difficult. For example, thePlaso toolbox (Gudhjonsson, 2010), which produces time-line from hard disk image, can identify thousands of eventsfrom a wide range of sources (Apache logs, Skype conver-sations, Google Chrome history, Windows event logs, etc.)from an image of only a few gigabytes. In conclusion, anevent reconstruction approach should be able to efficientlyprocess information and retrieve and visualise the resultsin a clear and intuitive way.

In the literature, a large number of approaches providetools to automatically extract the information and populatea central storage constituting the timeline. ECF (Chen et al.,

2003) is an approach aiming to extract, store in a databaseand manipulate events from heterogeneous sources. It in-troduces a set of extractors to collect events and store themin a database, which quickly generates a temporal orderedsequence of events. These automatic extractors, a widelyused concept, can also generate the timeline as in FORE(Schatz et al., 2004), FACE (Case et al., 2008), CyberForensicTimeLab (Olsson and Boldt, 2009), Plaso and PyDFT(Hargreaves and Patterson, 2012). However, in some ap-proaches including (Gladyshev and Patel, 2004) and (Jameset al., 2010), the lack of automation seems difficult toaddress and they present very high complexity (combina-torial explosion).

The automated extraction of events from a large numberof sources leads to the creation of large timeline that isdifficult to read, interpret and analyse. To assist the in-vestigators during this phase, event reconstruction ap-proaches should provide analysis tools to carry out all orpart of the reasoning and visualisation of the data in a clearand intuitive way. The use of a textual representation or adatabase is not suited to build high-performance analysisprocess. In approaches, such as ECF, the reasoning capa-bilities are limited by the data structure used for the storageof information.

ECF stores data in a database consisting of a table ofcommon information about events and tables of informa-tion specific to each type of event. One of the objectives ofECF is to provide a canonical form for representing eventsuniformly regardless of their sources. Adopting a genericinformation level and a specialised information level is alsoused in CybOX (Barnum, 2011). This conceptual separationallows to introduce a canonical representation of eventsthat facilitates information processing (analysis tasks forexample) while preserving the specificities of each event.However, the use of a database does not allow to take fulladvantage of this feature. Unlike ontology, databases do notenable to explicitly represent semantic of data which con-strains the understanding of data by the analysis algo-rithms. The FORE approach (Schatz et al., 2004) proposes acorrelation tool based on rules to identify causal relation-ships (the event A causes the event B if the event A shouldhappen before the event B) between events. Despite therelevance of this tool, the use of a rule-based system re-quires the definitions of the rules. The construction of sucha set of rules is a tedious task and it cannot take into ac-count all cases. Event reconstruction approaches mustimplement algorithms that can adapt to any kind of situa-tions even those unknown by the investigators. It is,therefore, necessary to develop analysis tools not relying onthe rules defined by the user. In (Gladyshev and Patel,2004), the reconstruction process can be seen as a pro-cess of finding the sequence of transitions that satisfy theconstraints imposed by the evidence. The authors try toperform the event reconstruction by representing thebehaviour of the system as a finite state machine. Scenariithat do not match evidence collected are then removed.After reducing the number of potential scenarii, a back-tracking algorithm is used to extract all possible scenarii.However, the lack of automation does not allow to handlecomplex cases. Indeed, the investigation of a single com-puter may involve several processes such as web browsers,

Y. Chabot et al. / Digital Investigation 15 (2015) 83e100 85

file system, instant messaging software, etc. Thus, therepresentation of such a systemwith a finite state machineseems not possible. Second, the use of finite state machineis very time consuming when used in real case. (Jameset al., 2010) propose to convert the finite state machineinto a deterministic finite state machine to limit theexponential growth of the size of the machine and there-fore the number of scenarii to examine during the back-tracking algorithm. Despite the reduction in size of thestate machine, the experiments show that the approachstill not be able to be used on real forensic cases. To allowthe handling of large volumes of data, (Khan et al., 2007)introduces a neural network-based approach using foot-prints left in a machine to detect software activities. Themain motivation of this work is to show the ability of ma-chine learning techniques to process large datasets. Theproposed tool is able to extract footprints from varioussources including log files and registry. A preprocessor isthen used tomake data usable by the neural network that isthen used to identify launched software. Unfortunately, theperformance of the proposed tool is poor. The training ofthe neural network and the need to use it several times toget a complete scenario make the use of this tool time-consuming. In (Case et al., 2008), an approach calledFACE, for Forensics Automated Correlation Engine, focus-sing on the interpretation and the analysis of the data wasproposed. It aims at collecting data from various sourcesand then identifying correlated events. At the end of theprocess, the investigators get a report describing theevents, objects and logical relationships between eventsand between events and objects. The main contribution ofthis work is the introduction of a correlation tool handling apart of the analysis. PyDFT (Hargreaves and Patterson,2012) is proposed as a system allowing to reconstructhigh-level events from low-level events (which areextracted from information sources by log2timeline orZeitline) in order to improve readability of the final time-line. The authors stated that the number of events gener-ated by the super-timeline approaches make thevisualisation, and therefore the analysis of the timeline,complex. To facilitate the reading of the timeline, this toolproduces a timeline summary. Once the low-level timelineis built, we look for patterns and their corresponding high-level events are then added accordingly. The summa-risation of a timeline is a relevant functionality as it facili-tates the reading of the timeline and by extension, itsanalysis. A similar system is presented in (Turnbull andRandhawa, 2015). In this work, a rule-based tool, calledParFor, identifies events from low-level artefacts extractedfrom disk images.

Concerning the visualisation, some approaches offerergonomically ways to access information like (Case et al.,2008) where reports, describing the entities and relation-ships between them, are generated by the system. Zeitline(Buchholz and Falk, 2005) is an editor allowing to create atimeline from multiple sources of information. The tooldoes not offer automatic analysis process, but invites theinvestigators to handle themselves the aggregation ofevents to create high-level events. The interface of the toolallows to add new events to a given timeline, to aggregateseveral events to build a complex event or to search for

specific events using a keyword-based query tool. In (Caseet al., 2008), the proposed tool is able to generate a reportthat offers different views on events and objects in additionto hyper links between them to make the timeline easier toread and more intuitive. In (Olsson and Boldt, 2009), asystem called CyberForensic TimeLab was proposed. Thefocus of this work is on the conception of a system to viewand navigate into the data related to an investigation in anintuitive way to discover evidence more easily. Then, thetimeline is displayed on a graphical browser which is themain added value of this approach as it constitutes animprovement of the ergonomics.

Many tools do not offer an intuitive interface but only aquery tool that appears to be a powerful but complex andtedious way to access the information. This is the case ofapproaches using databases and providing a SQL queryinterface which is efficient but not intuitive for an un-trained user.

To handle large volume of data, we conclude that eventreconstruction approaches must meet the followingrequirements:

� The automation of certain tasks which are becoming toocomplex to be carried manually. In order to producetools that are able to process large volumes of data, it isnecessary to automatically reconstruct and analysetimelines. The conception of automated tools requiresdata representation that is understandable for machines(structured, formal and with a clearly-defined andcomplete semantic). Therefore, we evaluate the level ofthe data structure to enable their analysis by machinesin addition to the availability of mechanisms to facilitatethe use of automation.

� The availability of tools for interpreting, analysing in-formation and drawing conclusions. The aim of analysistools is to highlight relevant information with the viewto guide the investigators. This includes making thetimeline easier to read by filtering data or summarisingthe timeline as proposed in (Abbott et al., 2006) and(Hargreaves and Patterson, 2012), identifying correla-tions between events as in (Schatz et al., 2004) and(Case et al., 2008) and producing conclusions fromknowledge contained in the timeline.

� The availability of visualisation tools allowing to browsedata in an efficient, clear and intuitive way. In additionto the availability of the data, the model must also allowto access to data and information easily. In particular,models should allow the use of search and visualisationtools to present the data in an understandable andintuitive form.

Heterogeneity

The second issue is the heterogeneity since the infor-mation is spread across the digital crime scene. During aninvestigation, many sources of information can be used:web browser histories, windows event log, file system, logsof various software, etc. In order to not miss informationand to obtain a real and accurate picture of the incident, the


scenario reconstruction approaches must be able to extractthe information from all of these sources and process themappropriately. It should be noted that there are several kindof heterogeneity: syntactic, semantic and temporal. In thiswork, we study the techniques related to the first two asthey are both related to our desire to reduce the cognitiveoverload by providing a clear and complete view on thepast events.

A large part of existing approaches is able to deal withmultiple and heterogeneous sources including Windowsevent logs, web browser histories, Apache server logs, filesmeta data, instant messaging software, registry, memorydumps, networks activities and user configuration files. Thehandling of heterogeneous sources requires on the onehand the development of automated extraction of infor-mation specific to each source and on the other hand asufficiently complete knowledge model to represent allaspects of a digital incident. In (Chen et al., 2003), themodel introduced in ECF defines the dimensions charac-teristics to represent forensic events such as a temporaldimension, another one related to objects used by eventsand a last one related to protagonists which are involved inevents. As the FORE approach (Schatz et al., 2004), the ECFmodel does not define relationships between entities(subject, object, event). For example, it allows to model thefact that an event interacts with an object but the nature ofthis interaction cannot be defined accurately. FORE (Schatzet al., 2004) is an approach introducing a knowledge rep-resentation model for events occurring during an incident.In addition, a set of extractors able to extract knowledgefrom sources such as Apache server logs and Windows2000 logs to populate the proposed ontology was intro-duced. The FORE ontology consists of classes that representthe notion of Entity (i.e. objects composing the world) andthe notion of Event (i.e. the change of state of an object overtime). The Event may itself be inherited by others classes inorder to describe different types of events that can occur ona machine. A number of attributes are used by classesinheriting from the Event class to provide informationabout the user or the process that produced the event, thefiles, etc. The use of an ontology to represent events is anefficient way to deal with heterogeneity issues and allowsthe introduction of semantically rich models that are ableto capture the semantics of all entities and the relationshipsbetween them. The precise and formal description of thecomponents of the model can significantly increase anal-ysis capabilities by enabling machines to understand themeaning of the data (in contrast to unstructured formats).The Plaso toolbox1 is a tool that can handle a large numberof information sources. Plaso (Gudhjonsson, 2010) is able togenerate super-timeline (timeline integrating many sour-ces of events) from a wide range of sources. Among toolsproposed in Plaso, log2timeline allows to extract eventsfrom a disk image and psort can be used to format theresult produced by log2timeline as a text file, a CSV file, adatabase, etc. The output format is composed of a limitednumber of fields to store the date and time of events, thesource that has been used for the extraction of the event

1 http://plaso.kiddaland.net/.

and a message describing the event. Using a small numberof features offers better performance than more complexmodels. However, the features introduced in these dataformats do not allow to accurately represent any eventoccurring in the system. Moreover, the use of a text formatdoes not allow to explicitly and simply show the relation-ships between entities. Data formats offer interesting per-formance for simple use-cases but are limited in terms ofexpressiveness for complex use-cases. Some approachessuffer from lack of sufficient number of sources which canlead to a loss of relevant information. In addition, most ofthe proposed models are not sufficiently complete toaccurately describe an incident.

To deal with the problems of heterogeneity, an approachmust meet the following points:

� The completeness of the knowledge model. The pro-posed model must be complete enough to representaccurately the events that occur during an incident. Inother words, the model must provide a sufficientlyelaborated vocabulary to fully represent the entitiesrelated to a digital incident, their characteristics and therelationships between them.

� The implementation of mechanisms (e.g., parsers) toprocess heterogeneous sources.

Legal requirements

The last issue we are focussing on is the legal re-quirements that any digital forensic approach must meet.Indeed, the results produced as outputs of an approachmust meet certain characteristics including credibility,integrity and reproducibility (Baryamureeba and Tushabe,2004). The challenge is to ensure that the results are ad-missible in a court of law.

Few approaches are able to explain how the results areobtained. Traceability is particularly lacking in (Khan et al.,2007). Unlike others machine learning techniques, the useof neural networks does not usually allow to easily interpretthe results (which is one of the justice requirements) as theycan be considered as black boxes (some parameters usedduring the learning phase remain unknown). The modelproposed in (Hargreaves and Patterson, 2012) allows tostore low-level and high-level events, and is composed oftwo structures to represent them. The first one includesattributes to model the date, the type of the event and thesource used to identify it. The second class has a similarstructure, in addition it includes attributes tomemorisehoweach high-level event is generated using low-level events.This allows to store information about data provenance.Only (GladyshevandPatel, 2004) has theoretical support. Asa prelude to their work, the authors argued that a formal-isation of the event reconstruction problem is needed tosimplify the automation of the process and to ensure thecompleteness of the reconstruction. The use of finite statemachine, a theory well-known in the scientific community,gives strong theoretical foundations to this approach.

To meet legal requirements, an event reconstructionapproach must meet the following points:

http://plaso.kiddaland.net/


� The integration of information into the model to ensurethe traceability of information. Indeed, to meet legalrequirements, the evidence produced in a court mustmeet several criteria including credibility, integrity andreproducibility of the digital evidence (Baryamureebaand Tushabe, 2004). The level of credibility of evi-dence is directly related to the level of accuracy andverifiability of themethod and the source of data used toproduce the evidence. By detailing the investigationprocess, the modelling of the provenance of informationcan increase significantly the credibility of all evidenceproduced. The reproducibility allows independentbodies to reproduce the same conclusions using thesame data inputs and allow to evaluate the quality ofthis process. It allows a court to fully understand thepath (composed of data manipulation steps, reasoningsteps, etc.) used to reach each conclusion which arepresented during a trial. Thus, a model must be able tointegrate information on the provenance of the evi-dence produced during the investigation. This includesstoring each step of an investigation, the investigatorsinvolved in activities, the tools or the techniques used toproduce the information (e.g. extraction from a datasource, deduction from two pieces of informationalready known, etc.).

� In order to produce credible results, a model based onformal theory is needed.

To summarise, we have identified seven criteria neces-sary to reduce the cognitive overload, deal with heteroge-neity, and meet the legal requirements. These criteria wereused to guide the development of an innovative approachbased on an ontology which is presented in the followingsection.

An ontology-based approach for event reconstruction

To fulfil the seven criteria highlighted in Section 2, weintroduced an approach based on ontology. This sectionaims to show how this approach meets the problem re-quirements and it is structured as follows. Section 3.1 givesan overview of the proposed approach and justifies theconceptual choices that have ruled its development. Sec-tion 3.2 introduces the concepts and the relationships ofthe ontology representing accurately a digital incident.Section 3.3 presents the implemented tools to extractknowledge from a crime scene and populate the ontologyaccordingly. Finally, Section 3.4 and Section 3.5 respectivelyintroduce the tools used to carry out aspects related toreasoning and the interfaces for the visualisation of theresults.

2 http://www.w3.org/TR/owl2-overview/.3 http://www.w3.org/TR/owl2-profiles/.4 http://www.w3.org/TR/rdf-sparql-query/.

Overview

To process a large amount of data collected during aninvestigation, there is a need to develop automatic pro-cesses to assist investigators in the processing and theinterpretation of data. Currently, a large number of forensictools (Bodyfile, Mactime Farmer and Venema, 2004) workson unstructured data as plain text which does not allow to

build advanced analysis process. This is mainly due to thelack of semantic information. A promising perspective is touse a precise and reliable representation allowing tostructure data on one hand, and to standardise their rep-resentation on the other hand. Introducing a structured andformal knowledge representation pursues two goals: buildautomated processes more easily by including semanticinformation andmake data easy to use and understandableby the users (graph visualisation, query tools, etc.). Themain component of the proposed approach is an ontologyimplemented using OWL 2,2 a language based on descrip-tion logic. An ontology is a model allowing to representknowledge of a given domain by structuring this knowl-edge using entities, relationships and logical constraints.Thus, ontologies are able to represent accurately theknowledge generated during an investigation (knowledgeabout footprints, events, protagonists etc.). (Gruber, 1993)defined an ontology as “an explicit, formal specification of ashared conceptualisation”. Unlike more rudimentary dataformats, ontology can represent relationships betweenentities in addition to the underlying logic of data. Theexplicit and formal nature of ontology facilitates the designand the use of interpretation and analysis tools (inferenceof new information, checking of the knowledge consis-tency, etc.). Therefore, ontology is the data structure that isbest suited to meet the requirements mentioned abovebecause of the inherent characteristics of the ontologywhich allow to answer criteria of soundness, ease of auto-mation and ease of use.

The ontology ORD2I (Ontology for the Representation ofDigital Incidents and Investigations) is implemented usingthe OWL profile OWL 2 RL3 (a subset of OWL 2 DL), a lan-guage based on SHROIQ(D) description logic. The use of thisprofile is motivated by several reasons including theavailability of sufficient expressiveness to model digitalforensic cases. Using OWL 2 RL allows to constrain someaspects (class expressions in particular) of the language toensure decidability and execution time (polynomial) ofrule-based reasoning (Motik et al., 2009). The need tooperate on large volumes of data and the desire to provide apowerful inference and analysis tools make OWL 2 RLpertinent to implement an ontology for the representationof digital incidents timeline.

The OWL 2 DL language also allows to give credibility toour model as it has a strong theoretical foundation. Indeed,this ontology language is based on description logic thatdefines formally the semantics of concepts and relation-ships. The components of the ontology and the logic asso-ciated to each of them are therefore mathematicallydefined which allow to check the coherence and the con-sistency of the knowledge. In addition, the ontology pre-sented in this paper is based on formal definitions given in(Chabot et al., 2014).

Ontologies are also very easy to manipulate due to theuse of tools such as SPARQL,4 a query language designed toquery graph knowledge. Because, the structure is in the

http://www.w3.org/TR/owl2-overview/

http://www.w3.org/TR/owl2-profiles/

http://www.w3.org/TR/rdf-sparql-query/

Fig. 1. Overview of the ontology-based approach made of four layers (boxes in light grey) and fourteen modules (white boxes and solid black line: operationalmodules, white boxes and dashed black line: modules not yet implemented, hatched boxes: modules from the toolbox Plaso, data streams are represented usingsolid black arrow).


form of triples <subject, predicate, object>, it is possible tovisually represent ontologies as graphs. This kind ofgraphical representation is very intuitive and clearer than atextual representation of data. Finally, as stated in Gruber'sdefinition, ontologies are designed to help to build a com-mon view of a domain, which means that this kind ofstructure is particularly relevant to build a consensus onthe representation of events among digital forensics in-vestigators and software developers.

Our overall system consists of an ontology and fourlayers which are knowledge extraction, ontology manage-ment, knowledge analysis and visualise of the final results(Fig. 1). First, the Extraction Layer provides tools to extractknowledge from an image seized on a crime scene. In orderto process heterogeneous and large information sources,the extraction process takes advantage of Plaso, a toolboxallowing to extract automatically information from a largenumber of sources including Windows registry, browsershistories, file systems and others (i.e. Skype, Google Drive,Java IDX, etc.). The knowledge extracted is then used topopulate the ontology. The Knowledge Layer is used to filterthe knowledge extracted and to instantiate the ontologywith it. This layer also provides mechanisms to managetraceability information. After the instantiation of theontology, the Reasoning Layer is used to help analysing andinterpreting the knowledge about the incident. This layer,on the one hand, deduces new facts using knowledge aboutthe incident (Enhancement) and on the other hand ana-lyses and draw conclusions. Finally, the Interface Layerprovides visualisation tools and a query tool to display thedata in an efficient and intuitive way.

ORD2I: an Ontology for the Representation of Digital Incidentsand Investigations

The ontology ORD2I is built on four bases:

� Design an ontology that is complete and accurateenough to model any digital incident.

� Integrate generic entities to the ontology to facilitate theanalysis (canonic form).

� Take into account specialised knowledge from forensicexperts and software developers to integrate the char-acteristics of each information source.

� Model provenance of information.

To meet these needs, ORD2I is divided into three layers,each allowing to model different kinds of knowledge andanswering different questions: the Common KnowledgeLayer (CKL) addresses the need to integrate generic entitiesto the ontology, the Specialised Knowledge Layer (SKL)handles specialised knowledge and the TraceabilityKnowledge Layer (TKL) deal with traceability. Each of theselayers is presented below and illustrated in Fig. 2. Thisfigure shows the classes (owl:Class) composing each layerand the properties linking these classes. Object properties(owl:ObjectProperty) are represented using standard solidarrows and inheritance links (rdfs:SubClassOf) are repre-sented using UML generalisation relationship. To enhancethe readability of the figure, the class attributes and thenamespaces are not represented.

The proposed ontology is inspired by CybOX (Barnum,2011) and the ontology PROV-O (Lebo et al., 2013). In theCybOX project (Barnum, 2011), a set of XSD schemas areintroduced to represent any entity (process or object, i.e.cyber-observables) observed during an incident in additionto the interactions between the cyber-observables andevents affecting them. The proposed schemas also allow tointegrate knowledge from experts through a system ofobjects specialising the abstract notion of an object. Thelarge set of objects offers the ability to represent a PDF file,an HTTP session, a web history, a Windows process, anetwork connection, etc. CybOX was then enriched byDFAX (Casey et al., 2015), an ontology representing

Fig. 2. Overview of ORD2I.


information about the provenance of information. DFAXprovides an extra layer to represent forensic actions initi-ated by the investigators. PROV-O (Lebo et al., 2013) is aW3C recommendation that describes an ontology for rep-resenting the provenance of information. PROV-Owas usedas a starting point for the design of the TraceabilityKnowledge Layer. This ontology is composed of conceptsand relationships allowing to define a piece of information,to assign this information to a user or an entity and torepresent the process used to produce the information.

Common knowledge about the incidentThe Common Knowledge Layer (CKL) stores common

knowledge about events that occurred during an incident.This layer contains time information about events, infor-mation about their resources and the information on thesubjects involved in the event. This layer provides a ca-nonical representation of events using the principle ofpolymorphism. Due to the variety of sources, all eventshave a number of specific attributes and a set of commoncharacteristics. For example, if we consider an event pro-duced by an Apache server and another event from theFirefox browser, it is difficult to analyse such events to lookfor new knowledge because each of them has specificcharacteristics. An Apache event carries out informationabout IP addresses, which is requested by the connection,whereas Firefox does not need it. However, these twoevents have also common characteristics (e.g., the time atwhich the events occurred). All events belong to the sameconcept which is the general concept of Event. The use ofpolymorphism creates a consistent representation for allevents without removing the specific data of each type ofevent (specific data is stored in the Specialised KnowledgeLayer (SKL)). This uniform representation allows events to

be analysed in the same way. The concepts and relation-ships within CKL are illustrated in the bottom right part ofFig. 2. Concerning CKL, the class ord2i:Entity is subsumed bythe classes ord2i:Event, ord2i:Subject and ord2i:Object. First,the ord2i:Event class allows to model any digital actionhappening on a machine and is defined by the type of theaction performed (e.g. “File Creation”, “Webpage Visit”,etc.), the date and the time when the action occurred andthe accuracy and the granularity of the date and a status(e.g. “success”, “fail”, “error”, etc.). Dates are defined usingthe Interval class of the OWL-Time ontology. An interval isdefined by a start and an end allowing to represent thenotion of uncertainty. For example, when the start time ofan event cannot be determined accurately, the use of atime-interval allows us to approximate it. The ord2i:Subject,ord2i:Process and ord2i:Person classes are used to modelprotagonists involved in the events. A subject can partici-pate (ord2i:isInvolved) in an event or undergoes (ord2i:un-dergoes) an event. The ord2i:Object class is used torepresent resources that are used (ord2i:uses), modified(ord2i:modifies), removed (ord2i:removes) or created(ord2i:creates) by events. An object has a state used todefine the status of a resource (e.g., exists, does not exist,closed, locked, unlocked, etc.).

Specialised knowledge about the incidentThe Specialised Knowledge Layer (SKL) is used to store

specialised information of objects. SKL models technicalknowledge about every kind of objects that can be found ina cyberenvironment. The concepts composing SKL areillustrated in the left part of the Fig. 2. One advantage of SKLis to manipulate technical knowledge about events. Forexample, IP addresses, file paths or metadata files are allvaluable information during analysis. The objective of SKL is


to provide the structure needed for this technical knowl-edge. SKL is not intended to be exhaustive and the use of anontology allows to integrate easily new classes or to modifyexisting ones. All classes and attributes of SKL do not appearon the figure to improve its readability. This layer providesan important range of classes to represent a large numberof objects:

� Files (ord2i:File): ord2i:Link, ord2i:ArchiveFile, ord2i:I-mageFile, ord2i:PDFFile, ord2i:ExeFile, etc.

� User account (ord2i:Account): ord2i:UnixUserAccount,ord2i:WinUserAccount, ord2i:ComAccount.

� Objects related to the Web (ord2i:Web): ord2i:Web-page, ord2i:WebResource, ord2i:Bookmark, ord2i:-Cookie, ord2i:Website, etc.

� Objects related to digital communications (ord2i:-Communication): ord2i:MMS, ord2i:SMS, ord2i:Chat-Message, ord2i:Call, ord2i:EmailMessage.

� Registry keys (ord2i:RegisterKey).� etc.

Traceability and reproducibility of the investigation processThe Traceability Knowledge Layer (TKL) stores informa-

tion about how the investigation is conducted. This includesinformation about investigative activities, the informationused and the agents involved. For example, this layer is usedto memorise how each result of the investigation is pro-duced. The aim of this layer is to satisfy some legal re-quirements It ensures reproducibility of the results bystoring all actions taken at each stage of the investigation. Italso ensures that the conclusions have been supported bycredible data and evidence. The concepts and relationshipswithin TKL are illustrated in the upper part of Fig. 2. Theord2i:InvestigativeOperation represents any task under-taken during an investigation. Each task is characterised bya set of attributes including: the type of techniques used (i.e.extraction from an information source, inference of newknowledge, event correlation, etc.), the information sourceused (i.e. Windows registry, Web browsers histories, etc.),the date and the place where the task was performed, adescription of the task and a numerical value quantifyingthe degree of confidence of the result of the task. Forexample, in the case of an operation using information thatmay have been corrupted or obfuscated by attackers, thetruthfulnesswill be lowbecause the results of the operationare unreliable. Each task belonging to the ord2i:Investigati-veOperation class is used to deduce new entities in order toknowwhat happened in the past by identifying events thatoccurred and the subjects and the objects that have inter-acted with events. Therefore, the ord2i:identifiedBy objectpropertymodels the fact that every entity is identified usingan investigative operation (e.g. the task of extracting in-formation from a web history can lead to the identificationof an event of bookmark creation or the identification of awebpage visited by the user). For some kind of investigativetask, investigators have to use information that is alreadyknown to deduce new knowledge. The ord2i:isSupportedByobject property models this fact by linking an investigativeoperation to the information used by it to deduce newknowledge. Each instance belonging to the ord2i:Investi-gativeOperation class is linked to the tools (ord2i:Tool) used

and the investigators (ord2i:Investigator) involved usingrespectively the ord2i:isPerformedWith property and theassociation class ord2i:Contribution.

Extraction and instantiation of ORD2I

This section aims to illustrate the instantiation of ORD2Iusing data extracted from a crime scene. The integration ofautomated instantiation techniques is critical to processlarge volumes of data extracted during an investigation.The method of instantiation used in our approach is asequential process starting with the collection of digitaltraces found on a disk image and ending with the instan-tiation of concepts and properties of ORD2I.

Information extraction of digital footprints using the Plasotoolbox

The first step is the extraction of all the informationcontained in the various sources of the image disk underinvestigation. During an investigation, many sources can beused to obtain information about what happened.To manage all these sources, the tool log2timeline(Gudhjonsson, 2010), proposed as a component of Plaso, isused. log2timeline collects information from many sourcesincluding: sources inherent to the operating system (e.g.registry, file system, recycle bin, etc.); histories, cookies andcache files of web browsers; files and logs generated byvarious software including Skype, Google Drive, etc. Theresult is a DUMP file containing all the footprints retrievedfrom the disk image. Then, transforming the result isnecessary to make the data usable by downstream pro-cesses. For this, the tool psort of Plaso is used. This toolallows to serialise the data produced by log2timeline inmany formats including CSV. An example of the output ofthe tool is given in Fig. 3. This dataset was constructed froma disk image of a virtual machine. Each entry of this filedescribes an action that has occurred:

� E1: Execution of the process chrome.exe correspondingto the start of the web browser Google Chrome.

� E2 and E3: Visit of two webpages hosted on the websitehttp://malwareWebsite.com.

� E4: Download of the executable contagioMalware.exefrom http://malwareWebsite.com/MalwareDL/contagioMalware.exe locally saved on the Downloadlocal folder.

� E5: Execution of the process contagiomalware.exe cor-responding to the use of the malware previouslydownloaded.

� E6: Visit of the homepage of the website http://www.bbc.co.uk.

� E7: Deletion of the file contagioMalware.exe previouslydownloaded and launched.

Advanced information extraction and serialisationThe serialisation deals with seventeen attributes

including temporal information (date, time and timezone), adescription of the source of the information (source andsourcetype), a description of the event and its consequences

http://malwareWebsite.com

http://malwareWebsite.com/MalwareDL/contagioMalware.exe

http://malwareWebsite.com/MalwareDL/contagioMalware.exe

http://www.bbc.co.uk


Fig. 3. Data produced by log2timeline and formatted using psort.

5 http://www.w3.org/TR/turtle/.


(MACB, type, short, desc and extra) and information aboutthe tools and the files used during the extraction (version,filename, inode, notes and format). Some fields (date, time,etc.) do not require additional processing as they are highlystructured. Other fields such as the desc field require moretreatments because their content depends on the type andthe status of the event. The variability of their contentmakes the extraction of the knowledge difficult. The size isthe second issue of the timeline produced by Plaso. Indeed,a timeline produced using these tools from a disk image of amachine that operated about thirty minutes contains about300,000 entries. The analysis of the produced file is almostdone manually (e.g. grep, search by dates, ad hoc analysistools such as analysis of browser search history, etc.) byinvestigators and the interpretation of the timeline is,therefore, tedious and complex.

With the aim at facilitating the instantiation of ORD2Ifrom the data produced by Plaso, an intermediate step ofinformation extraction and XML serialisation is used in ourprocess. For this, the structured information such as tem-poral information, information about the user, informationabout the host and information about the type of the eventare first extracted. The information contained in lessstructured fields, such as desc are then extracted. Thecontents of the field desc depends on the source of infor-mation and the type of events, therefore, a set of regularexpressions is defined to properly extract the knowledge. Inthe E4 case (see Fig. 3), which corresponds to the downloadof a file using Google Chrome, a regular expression is usedto extract from desc the URL of the downloaded file, thelocal path used to store it and the size of the file. The nextstep consists of filtering the collected data to include onlythe relevant data to instantiate ORD2I in order to reduce theamount of data, improve readability of the visualisationand optimise the processing times. After filtering, the

footprints are then serialised in a XML file which is a simpleand universal way to store the data.

ORD2I instantiationThis step consists of populating the ontology using the

result of the extraction process. For each footprint item ofthe XML file, the CKL and SKL are populated by creatinginstances of events, objects and subjects and links betweenindividuals according to the formal properties defined inORD2I. The relationships between an event and an object ora subject are deduced from the type of the event. Forexample, if a file is sent to the recycle bin, the event rep-resenting this action is linked to the file using the propertyord2i:removes. In the case of a file download, the relatedevent is linked to the web resource using the propertyord2i:uses and to the local file using the property ord2i:-creates. Fig. 4 shows a Turtle5 serialisation of knowledgerelated to the event E4 of Fig. 3. The individual ord2i:File-Download40 (lines 11e18) represents this event and islinked to the individuals ord2i:webResource44 and ord2i:-File46. The individual ord2i:webResource44 (lines 31e34)belonging to ord2i:WebResource represents the remoteresource download. ord2i:webResource44 has a locationrepresented by ord2i:Location45 (lines 35e36) allowing tostore the URL and the hostname of the remote resource.ord2i:FileDownload40 is linked to ord2i:webResource44using the property ord2i:uses as the download is performedby using this remote object. ord2i:File46 (line 38e41)belonging to ord2i:File is an individual representing thelocal file download. ord2i:File46 has a location representedby ord2i:Location47 (lines 42e43) allowing to store its pathand its filename. ord2i:FileDownload40 is linked to

http://www.w3.org/TR/turtle/

Fig. 4. Turtle serialisation of knowledge generated by the instantiation of ORD2I.


ord2i:File46 using the property ord2i:creates. Regardingsubjects, Google Chrome, used to carry out the download, isrepresented by the individual ord2i:Process21 (line 45e48)and it is linked to the event using the property ord2i:i-sInvolved. The individual ord2i:Person22 (lines 50e53)represents the user who starts the download (linked to theevent using the property ord2i:isInvolved).

The knowledge describing how previous knowledgewere identified is then added in the TKL of the ontology.ord2i:investigativeOperation1 (line 20e29) represents thetask of information extraction performed by the tool Plaso(line 25e26). As property ord2i:isIdentifyBy states, this taskallows to identify several entities including ord2i:File-Download40, ord2i:WebResource44, ord2i:File46, ord2i:Pro-cess21 and ord2i:Person22. To conclude, the eventord2i:FileDownload40 and the task ord2i:InvestigativeOp-eration36 are localised in time using the ontology OWL-Time (lines 17e18 and lines 28e29).

Analysis

After the instantiation of the ontology, a large knowl-edge graph representing the information contained in theoutput of Plaso is obtained. The structure of this graph isderived from the schema of the ORD2I ontology. This graphhas many advantages: it structures information and it fa-cilitates the interpretation of the information and thedevelopment of automatic analysis processes. Asmentioned in Section 2, the analysis of this large volume ofknowledge requires the use of some very sophisticatedtools. In this paper, we present the inference tools andautomated analysis processes proposed in our approach. Itis important to note that the proposed operators are basedon assumptions made by the authors that can be subject todiscussion. However, the main objective is to demonstratethe relevance of an ontology as this greatly facilitates thedevelopment of the analysis tools.

Fig. 5. Patterns used to deduce new relationships between users and events.

Table 1Identification of users involvement using knowledge enhancement (eventidentifier in bold, inferred knowledge in bold italics).

Event Time User

ApplicationPrefetch5 (E1) 2015-04-03T08:13:13WebpageVisit17 (E2) 2015-04-03T08:13:24 Person22WebpageVisit30 (E3) 2015-04-03T08:13:27 Person22FileDownload40 (E4) 2015-04-03T08:13:28 Person22ApplicationPrefetch52 (E5) 2015-04-03T08:14:12 Person22WebpageVisit62 (E6) 2015-04-03T08:14:37 Person22FileDeletion72 (E7) 2015-04-03T08:14:47


Knowledge enhancementThe goal of the enhancement step is to enrich the

knowledge base with new deduced facts to complete theinvestigators’ knowledge on the incident. The inference ofnew facts increases the effectiveness of the analysis and thequality of final results. Indeed, adding new knowledgeabout the incident during the enhancement stage improvesthe quality of the conclusions.

Inference is an operation taking as input one or moreestablished facts and outputting a new fact, called conclu-sion. In our work, we have implemented several inferencerules allowing for example to infer the type of a file basedon its extension (e.g. an individual belonging to theord2i:File class which has the extension .exe is automati-cally classified in the ord2i:ExeFile class) or to deduce the OSuser session associated to a given event when this infor-mation is unknown.

This second example of inference is detailed below. Anyinformation on the user session associated to an event isvaluable for an investigation. It is therefore interesting toprovide a tool to identify an association between a givenuser and an event. Some kind of events provide informationabout the users involved in them. This is the case of Win-dows EVTX entries for which the user field is filled in Plaso.This is also the case of entries generated by web browsersfor which the user name is contained in the path of historyfiles that are used to generate the entry (i.e. in the case ofGoogle Chrome, every history file is located in TSK:/Users/UserX/AppData/Local/Google/Chrome/User Data/Default/History). To deduce the user associated to an event, wedefine two inference rules which are based on the temporallocation of events which is defined using Allen algebra(Allen, 1983) and the notion of time interval. Our ontologyrepresents the beginning and the end of each event tomodel its duration (i.e. downloading a file that lasts forseveral minutes). However, because of the granularity ofthe timestamps and the execution speed, most of theevents are considered instantaneous. In addition, the end ofthe event is not often available in the output of Plaso. Inmost cases, therefore, we consider that each event islocalised in time by an interval that has a start time and anend time equal to the datetime provided by Plaso.

The two inference rules used to identify users involve-ment are:

� Rule N�1: Given an event A for which the user is un-known and an event B associated to the user P, if A and Bare temporally linked by a property p 2 equal, overlaps,during, starts, meets, finishes, then the user associatedwith A is P.

� Rule N�2: Given two events A and B associated to theuser P and located at both extremities of a chain ofevents exclusively composed of events for which theuser is unknown, then each event composing this chainis associated with the user P.

These two rules are illustrated in Fig. 5 (users in bolditalic are the results of the enhancement). The upper part ofthe Figure illustrates the rule N�1. This rule is used todeduce the user that is associated with the events e2, e4,e6, e8, e10, e12. The bottom part of the Figure illustrates therule N�2 and shows that the UserX is involved in e17 ande18.

After the instantiation of the ontology by using as inputthe entries given in Fig. 3, the knowledge we have aboutuser involved in each event can be retrieved using a SPARQLquery. The column user in Table 1 shows the relation be-tween user and event. The users with normal font areknown directly after the instantiation of the ontology (asthe information is available in the output of Plaso) whilethe users appearing in bold italic are deduced during theenhancement phase. After the enhancement phase and anew execution of the SPARQL query, we see that a linkbetween ord2i:ApplicationPrefetch22 and ord2i:Person22has been identified by the rule N�2. As the user ord2i:-Person22 is involved in ord2i:FileDownload40 and ord2i:-WebpageVisit62 and as ord2i:ApplicationPrefetch52 is


located between these two events, the system states thatord2i:Person22 is also involved in this event. It should benoted that the inference of new associations between usersand events cannot be made with certainty. Indeed, anotheruser can take the place of the current user or the events canbe initiated by a person or a process beyond the control ofthe user who opened the session. Thus, the score of confi-dence associated to the operation (belonging to ord2i:In-vestigativeOperation) that has deducted the association islow (to invite the investigators to use this informationcautiously).

Timeline analysisThe first analysis tool proposed in our approach is a

process (based on Chabot et al., 2014) detecting correlationbetween a pair of events. The correlation is a relationshipwith a broad semantic that covers causal relationships andother semantic links. The identification of such relation-ships is particularly relevant for the investigators as it al-lows them to quickly have an overall view of the eventscomposing an incident and the links between them. Therelevance of this tool will be shown in Section 4. Theidentification of correlated pairs of events is performedusing four criteria: the interaction of the two events with 1)common objects or 2) subjects, 3) the temporal proximityand 4) a set of rules that are satisfied or not. For each pair ofevents(e1,e2), a score of correlation is calculated using thefollowing formula:

Corrðe1;e2Þ¼CorrTðe1;e2ÞþCorrSðe1;e2ÞþCorrOðe1;e2Þ (1)

where each score is between 0.0 and 1.0 after normalisationand where

� CorrT(e1,e2) is a score quantifying the temporal correla-tion of the two events. The hypothesis used in this cri-terion is that if two events occur at the same time, thetemporal correlation is equal to 1.0. Else, the more thetwo events are closed in time, the more they arecorrelated.

� CorrS(e1,e2) is a score quantifying the correlation in thelight of subjects interacting with the two events. Themore the two events interact with common subjects(process or person), the greater the subject correlationis. Therefore, the subject correlation is maximal for twoevents interacting with only two subjects, the GoogleChrome process and the same user for example. In thecase of two events interacting with the Google Chromeprocess but two different users, the score is equal to 0.5.

� CorrO(e1,e2) is a score quantifying the correlation in thelight of objects interacting with the two events. Themore the two events interact with common objects, thegreater the object correlation is. To obtain a finer system,we choose to not only consider common objects but alsoobjects whose location is near. In the case of two eventsinteractingwith two objects sharing the same hostname(in case of remote resources) or the same path (in case oflocal resources), the score is increased.

Each criterion can be weighted to allow to give moreimportance to one of the correlation scores (in this study,

all scores are equivalently weighted). The overall correla-tion score is between 0.0 (meaning that the two events arenot correlated) and 1.0 (meaning that the two events arestrongly correlated), which is the average of the threescores. The first three criteria are not dependent of the typeof events as they are based on generic features (time, objectand subjects). This will allow to discover correlationsinvolving events from unknown sources of informationthat are therefore not intended by the investigators. To takeinto account specific knowledge of every type of objectmodel in ORD2I, we introduce a fourth score: CorrEK(e1,e2).This score is computed using a set of rules defined by ex-perts (called expert knowledge-based rules). If two eventssatisfy a certain given rule, the overall correlation scorebetween them is maximal. For example, in our experi-ments, we add the following two rules: The first rule pro-duces a high correlation score in the case of two events;one representing the download of an executable file andthe second representing its execution. The second rule al-lows to get a high correlation score in the case of an eventrepresenting the creation of a bookmark for a webpage andan event representing a visit of this same webpage usingthe bookmark.

To illustrate the correlation process, we use ourapproach on the dataset composed of seven events shownin Listing 3. The story of this dataset appears to be thefollowing. First, the user started the web browser Chrome(ord2i:ApplicationPrefetch5). Then, he visited twowebpages(ord2i:WebpageVisit17 and ord2i:WebpageVisit30) on awebsite that seems to provide sources and binaries ofmalwares. On the second webpage, he found a link allow-ing him to download an executable file named con-tagiomalware.exe (ord2i:FileDownload40). Once thedownload was complete, the user ran the executable(ord2i:ApplicationPrefetch52). He continued to browse theweb on a non related website (ord2i:WebpageVisit62).Finally, the user deleted the file contagiomalware.exe(ord2i:FileDeletion72). Among these events, it is possible toidentify a logical chain of events using intuition. This chainis composed of the events ord2i:WebpageVisit17, ord2i:-WebpageVisit30, ord2i:FileDownload40, ord2i:Application-Prefetch52 and ord2i:FileDeletion72. These events make up acoherent story in which the user gets a malware providedby awebsite, executes it and attempts to erase the traces bydeleting the file. The main objective of the proposed cor-relation tool is to highlight this type of chain of eventsamong a large number of events composing a case. Thus, onone hand one can understand the case quickly and on theother hand see the whole chains of events. The thirdquartile of correlation score is given in Table 2. This tablecontains nine pairs of correlated events for which fivescores are given: the temporal correlation, the subjectcorrelation, the object correlation, the correlation based onrules and finally the overall correlation. The scores showfour pairs of highly correlated events (overallscore > 0.5):

� ord2i:ApplicationPrefetch52 and ord2i:FileDownload40:these two events satisfy the rule defined below(“download of an executable file and execution of thesame file”).

Table 2Third quartile of correlation scores for the events from Fig. 3 (event identifier in bold).

evt1 evt2 T S O EK Ov.

App.Prefetch52 (E5) FileDownload40 (E4) 0.01 0.5 1.0 1.0 1.0FileDownload40 (E4) WebpageVisit30 (E3) 1.0 1.0 0.25 0.0 1.0WebpageVisit17 (E2) WebpageVisit30 (E3) 0.33 1.0 0.25 0.0 0.7FileDownload40 (E4) WebpageVisit17 (E2) 0.24 1.0 0.25 0.0 0.66App.Prefetch52 (E5) FileDeletion72 (E7) 0.02 0.0 1.0 0.0 0.45FileDownload40 (E4) WebpageVisit62 (E6) 0.0 1.0 0.0 0.0 0.45WebpageVisit30 (E3) WebpageVisit62 (E6) 0.0 1.0 0.0 0.0 0.45WebpageVisit17 (E2) WebpageVisit62 (E6) 0.0 1.0 0.0 0.0 0.45FileDeletion72 (E7) FileDownload40 (E4) 0.0 0.0 1.0 0.0 0.45


� ord2i:FileDownload40 and ord2i:WebpageVisit30: thesubject correlation between these events is maximal asthey are both related to only two subjects, GoogleChrome and the user ord2i:Person22. In addition, as onlyone second has elapsed between the visit of the web-page containing the download link of con-tagiomalware.exe and the start of the download, thetemporal correlation is high too. Finally, as the URL ofthe remote resource used for the download and the URLof the visited webpage are hosted by the same website,the object correlation is modified accordingly.

� ord2i:WebpageVisit17 and ord2i:WebpageVisit30: theobject and subject correlation scores can be explained inthe same way as above.

� ord2i:FileDownload40 and ord2i:WebpageVisit17: thesame explanation can be used for this case too.

The correlation score in the fifth position (ord2i:Appli-cationPrefetch52 and ord2i:FileDeletion72) can be explainedby the interaction with the same object, one event down-loading an object and the other event deleting it. The sameexplanation can be applied to the last correlation score(ord2i:FileDeletion72 and ord2i:FileDownload40). However,for these two correlation scores, the subject correlationscore and the temporal correlation score are low as they donot interact with common subjects and they are not close intime (compared to other events). Finally, the three corre-lations involving ord2i:WebpageVisit62 are exclusively dueto the interaction with common subjects (Google Chromeand ord2i:Person22).

Fig. 6. Correlation graph of events given in Fig. 3 (solid edges (score >¼ 0.6),dashed edges (score >¼ 0.4 && score < 0.6), dotted edges (score < 0.4)).

Interface: visualisation and SPARQL querying

The Interface Layer has several functions:

� It allows investigators to visualise the data via intuitiveand clear visualisation tool.

� It provides a monitoring interface allowing to managethe settings of the system including the numericalvalues used in the correlation process (weight, size ofthe sliding window, etc.) and the information sources atthe input of the instantiation step.

� It provides an advanced query tool for users that exist-ing visualisation tools are not able to perform.

A tool has been implemented to visualise the correla-tions between the events. This visualisation takes the form

of a graph containing all the events composing the incidentand the correlations between them. The use of this graphallows to quickly have an overall view of the case. The linksbetween events show chains of related events. The visualstyle (colour, line type, thickness) of each link depends onthe strength of the correlation. The investigator can interactwith the graph by clicking on the elements. A click on a linkgives information about the correlation such as the overallcorrelation score, the temporal correlation score, the sub-ject correlation score, the object correlation score and thecorrelation score based on rules. A click on a node givesinformation about the event (URI, description, objects andsubjects interacting with the event). The collection of thisinformation from the triple store is done with the dynamiccreation of SPARQL queries.

An overview of the correlation graph corresponding tothe dataset of Fig. 3 is given in Fig. 6. There appears a chainof strongly correlated events including the visit of thewebpage, the download of the resource, the launch of thisresource and its removal. This graph allows to identify thechain of events composed of ord2i:WebpageVisit17,

Fig. 7. SPARQL query retrieving information about all webpages visited bythe user named UserX.

Table 4Quantification of processed data volumes through the reconstruction andanalysis process.

Criterionydatasets Malwarecase

DatasetN�2

DatasetN�3

Number of lines (timeline) 8664 5241 21,424Supported entries (extraction) 7918 4600 17,863Entries after filtering 222 130 774Generated triples

(instantiation)10,186 12,400 33,498

Deduced triples(enhancement)

21 17 28

Correlations (>¼0.5) 1411 485 15,356


ord2i:WebpageVisit30, ord2i:FileDownload40, ord2i:Appli-cationPrefetch52 and ord2i:FileDeletion72 discussed above.

To go further and display information which does notappear in the visualisation tool, one can use the SPARQLendpoint to query directly the knowledge contained in thetriple store (the knowledge base). For example, a queryretrieving information about all webpages visited by theuser named UserX is given in Fig. 7. The results of this queryare given in Table 3. Other examples of the use of SPARQLare given in Section 4. Despite its power, the SPARQL lan-guage (as SQL) is not an intuitive and simple way to accessknowledge and it requires technical knowledge to be usedefficiently. Therefore, we have the desire to minimise theuse of SPARQL by integrating a large amount of data in thevisualisation tools.

This section has presented the SADFC approach. It hasdemonstrated the technical feasibility of such an approachand the relevance of its features. Among them, the analysisand visualisation tools based on the ORD2I ontology appearto be promising to investigate efficiently on a forensic case.These features help to save time by identifying and dis-playing on the screen the strength of the correlations be-tween events. This will help to quickly understand thephysiognomy of the incident.

Experimentation and results

This section aims to demonstrate the capabilities of ourapproach in the context of a digital investigation analysis.For this, we propose a study of the performance and rele-vance of the approach through simulated cases. Theconfiguration of the machine used to run the experimentand hosting the triple store (Stardog server 2.2.4) has a

Table 3Results of the query retrieving information about all webpages visited bythe user named UserX.

titleWebpage hostWebpage

Download Malware Sources and Binaries malwarewebsite.comDownload contagio malwarewebsite.comBBC news e Home www.bbc.co.uk

3.20 GHz Intel Core i5-3470 processor and 8 GB RAM. Diskimages used are generated from a virtual machine runningWindows 7.

The first task of the experiment is to evaluate theexecution time of our tool and the volume of data handledand generated through the process. For comparison, wegive the results for two others datasets of different sizes.Table 4 provides information on the volumes with thenumber of entries composing the timeline, the number ofentries supported by our tool (which do not support allsources of information), the number of entries afterfiltering, the number of triples generated during theinstantiation phase, the number of triples deducted duringthe enhancement phase and finally the number of signifi-cant correlations found (score > 0.5, this threshold can bemodified by the user via the interface).Table 5 gives theexecution times (in seconds) for the different phases of theprocess.

Note that we reduce the number of entries by usingsome filtering mechanism during the instantiation phase.These filters are to used only the entries of certain sourcesof information. Thus, entries from Chrome cookies, Chromecache, Internet Explorer cache and the file system arefiltered. From these results, we can see that the extraction,instantiation and enhancement phases are carried outquickly. On the other hand, the analysis phase is a time-consuming step which emphasises the importance offiltering the sources of non-critical information to reducethe amount of data to analyse. It is also important to seethat the number of triples does not increase linearly withthe number of entries in the timeline. Indeed, the numberof triples generated for an event depends on its type. Sometypes of events carry more information and thereforerequire a larger number of triples to be modelled in theontology.

The second task of the experiment illustrates the capa-bilities and the relevance of our approach. The investigationstarted after that Phil, an employee of CompanyAndCo, has

Table 5Execution times of the different stages of the process.

Stepsydatasets Malware case Dataset N�2 Dataset N�3Extraction 1.17 1.135 1.85Instantiation 26.81 19.9 186.295Enhancement 0.31 0.231 0.353Analysis 814.2 254.3 13541.8


Fig. 8. SPARQL query used to retrieve all the executables.


reported to his IT manager that his machine generatesabnormal behaviours. To understand what has happened,the investigator in charge started by collecting the harddisk of the suspect machine, and applied the SADFCapproach to study the disk image. After successfullycompleting the extraction phase, the instantiation phaseand the enhancement phase, the investigator startsthe analysis by searching all significant correlations be-tween events (score > 0.5). From the testimony of Phil onthe behaviour of the machine, the investigator assumesthat a malware is the cause of the problem. Therefore, hedecided to search all traces of potentially suspicious exe-cutables that were run on the machine. To do this, heused the SPARQL query shown in Fig. 8 to retrieve all exe-cutables that can be identified in the timeline. Among theresults of the query given in Table 6, he identified twoentries with a suspicious name (contagiomalware.exe)located in c:yusersyandrewydownloadsy andc:yusersybobbyydownloadsy. The investigator thentried to figured out who interacts with this malware and inwhich circumstances. This information can be obtained

Table 6Results of the query retrieving all the executable files.

Path

c:ywindowsytempycr_e7cf0.tmpyc:ywindowsysystem32yc:ywindowsysystem32ywbemy

c:ywindowsysystem32ywbemy

c:yusersyandrewydownloadsyc:yprogramfilesygoogleyupdateyc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsysystem32yc:ywindowsyc:yprogramfilesyvmwareyvmwaretoolsyc:ywindowsysystem32yc:yprogramfilesyvmwareyvmwaretoolsyc:ywindowsysystem32yc:ywindowsysystem32yc:yusersybobbyydownloadsyc:yprogramfilesygoogleychromeyapplicationyc:ywindowsysystem32y

using the correlation graph viewer. Indeed, the viewer al-lows to manually retrieve the events interacting with theexecutables and then get information about the events. Theinformation can also be retrieved using the SPARQL querygiven in Fig. 9.

This query allows to retrieve each event interactingwiththe executable named contagiomalware.exe. For each ofthese events, the query retrieves the name of the userinvolved, the type and subtype of the event, the date andtime it occurred. The result of this query is shown in Table 7.The investigator noted that the users Andrew and Bobbyhave both used contagiomalware.exe. Andrew has down-loaded it and then removed it. Bobby, on his side, hasdownloaded themalware, launched it and finally deleted it.The investigator then tried to find out if both Andrew andBobby are related and if so, why? To answer these ques-tions, he uses the correlation graph viewer. The consulta-tion of a wide view of the graph allows to distinguishdifferent clusters of events (Fig. 10). In particular, there arethree clusters corresponding to the events belonging toeach user. A quick study of the cluster related to Phil showsthat he browsed the web and consulted his emails. Themost relevant information for the investigation is locatedon the clusters related to Andrew and Bobby. The existenceof many links between these two clusters indicates thatthese two users have strong interactions. The investigatorthen zooms on the small group of events between the twoclusters; looking for information about the use of theexecutable contagiomalware.exe by one of the two users.After having found one of the events, he can view theclosest neighbourhood of the event using correlation linksshown in Fig. 11. The chain of correlated events foundrepresents the life cycle of the malware on the computer.By clicking on the nodes, he can consult the information

Filename

setup.exesppsvc.exewmiprvse.exewmiadap.execontagiomalware.exegoogleupdate.exewermgr.exenotepad.exesmss.execsrss.exewinlogon.exetaskhost.exedllhost.exeuserinit.exedwm.exetaskeng.exeexplorer.exetpautoconnect.execonhost.exevmtoolsd.exesearchprotocolhost.exesearchfilterhost.execontagiomalware.exechrome.exelogonui.exe

Fig. 9. SPARQL query retrieving information about events interacting withcontagiomalware.exe.

Table 7Results of the query retrieving information about events related tocontagiomalware.exe.

Pseudo Type Subtype Datetime

Andrew Chrome history Filedownload

2016-01-03T07:21:16.000 þ 01:00

Bobby Chrome history Filedownload

2016-01-03T08:12:09.000 þ 01:00

Bobby Recycle bin File deletion 2016-01-03T08:12:43.000 þ 01:00

Bobby Windowsprefetch

App.prefetch

2016-01-03T08:12:25.000 þ 01:00


about the events (date and time, related objects and sub-jects, description). To get more details about about anevent, he can also use the SPARQL interface. The use of thecorrelation graph viewer and the SPARQL interface, twocomplementary tools, allows him to identify the followinginformation. Regarding Andrew, we can see that he hasviewed awebsite entitled “DownloadMalware Sources andMalware Binaries” at http://malwarewebsite.com. He hasthen downloaded the file contagiomalware.exe from thiswebsite. He then went on his personal website named“Andrew Portfolio”. By clicking on the corresponding node,we obtain the URL of Andrew's website http://andrew-and-co.com. Regarding Bobby, we can see that he visited thewebsite named “Andrew Portfolio”. He consulted on thiswebsite a folder called “private” from which he down-loaded the file contagiomalware.exe. The graph also con-firms that Bobby launched this executable and then deletedit.

In conclusion, we have seen that the proposed approachallows to quickly extract the knowledge related to a digitalincident. This knowledge can then be enriched using theenhancement process which allows to deduce newknowledge from the existing one. After the analysis phase,the use of the correlation graph viewer and the SPARQLinterface allow to understand the nature of the interactionsbetween the two users. It appears clearly that Andrew and

Bobby have strong interactions. In particular, we have seenthat they both used the file contagiomalware.exe and thewebsite http://andrew-and-co.com. Current tools of SADFCdo not allow to determine with certainty the nature of thisinteraction. However, the investigator may assume thatAndrew actually downloaded the malware from the web-site http://malwarewebsite.com and he then made itavailable on his personal webiste http://andrew-and-co.com. Bobby has then downloaded and launched theexecutable. The effects on the machine remain unknown atthis stage of the investigation. Therefore, we find that theproposed tool only provides good research directions butthe investigators analytical faculties are still required tocomplete the investigation.

Conclusion and future works

In this paper, we proposed an approach called SADFC forSemantic Analysis of Digital Forensic Cases. This approachis based on an ontology to represent accurately a digitalincident and the associated digital investigation. Theontology, named ORD2I, is associated with a set of tools forextracting information from disk images seized on crimescenes, instantiating the ontology, deducing new knowl-edge and analysing it. SADFC provides answers to the threeissues identified in the state of the art: the cognitive over-load due to the large volumes of data to be processed, theheterogeneity of the data and the legal requirements thatevery evidence has to meet in order to be admissible incourt. Regarding the data volumes, our tools are automaticwhich enables efficient data processing. In addition, wedevelop enhancement and analysis tools to performreasoning tasks allowing the investigators to focus on themost complex parts of the investigation where theirexperience, expertise and skills are the most needed. Vis-ualisation tools have been also introduced to provide anintuitive and clear way to access the knowledge.

Regarding heterogeneity of data, SADFC is based onORD2I which is composed of three layers for representingknowledge common to all events, specialised informationand knowledge about the investigation. The use of anontology allows to have a unified model to representknowledge, allowing to easily build analysis processes.Finally, the proposed ontology allows to meet the legalrequirements and especially the need for reproducibilityand traceability (Chabot et al., 2014).

Despite the mentioned qualities, SADFC has severallimitations that will be studied in future works. The per-formance of the tool can still be improved, especially theexecution time of the analysis phase. The benchmarksshow that it takes approximatively three hours to handle atimeline with 20,000 entries which is a small timelinecompared to the timeline handled in real cases. The opti-misation of the analysis is, therefore, a priority in futureversions of the tool.

Regarding the extraction and the instantiation of theontology, the tool can presently handle fifteen sources ofinformation which is still not enough to have a completeview of an incident. In future work, new sources willtherefore be integrated. In particular, information sourcesrelated to Android and iOS must be integrated to handle

http://malwarewebsite.com

http://andrew-and-co.com



http://malwarewebsite.com



Fig. 10. Correlation graph viewer: wide zoom on the events composing the case.

Fig. 11. Correlation graph viewer: close view on the chain of events relatedto contagiomalware.exe.


mobile devices. At the same time, the ontology ORD2Imustbe validated by experts.

For the analysis, we plan to integrate new tools besidesthe event correlation algorithm. 1) We will develop apattern matching algorithm to detect illegal actions byidentifying specific event sequences (an illegal action canbe composed of several authorised events, hence the needto use a pattern-based system to correctly detect illicit ac-tions). 2) Based on the work of (Hargreaves and Patterson,2012), we will implement a tool to find event compositionin order to allow the summarisation of the timeline.

Concerning the interface, several visualisation tools willbe implemented in future works. First, more features willbe integrated in the correlation graph viewer includingsearch functions. Second, an advanced timeline viewer willbe proposed. This viewer will allow on the one hand todisplay conventionally the timeline and in the other handto enrich this timeline with knowledge from the ontologyincluding information about objects and subjects interact-ing with each event and information produced by theenhancement phase and the analysis phase.We also plan toallow the investigators to enrich the knowledge of theontology with information they found on their own on thedigital crime scene in order to improve interactivity andcollaboration between SADFC and the investigators. The


implementation of such functionality requires mechanismsto check the consistency of the ontology and to update itfrequently. Indeed, adding new information may conflictwith the existing knowledge about the incident or enablenew deductions. Furthermore, if several investigators workon the same case, we must provide collaborative tools tomanage the points of view of all members of the team.

Acknowledgements

The above work is a part of a collaborative researchproject between the CheckSem team (Le2i UMR CNRS 6306University of Burgundy) and the UCD School of ComputerScience and Informatics. This project is supported by theUniversity College Dublin and the Burgundy region(France). The authors would like to thank S�everine Fock forthe advices and the proofreading of this manuscript.

References

Abbott J, Bell J, Clark A, De Vel O, Mohay G. Automated recognition ofevent scenarios for digital forensics. In: Proceedings of the 2006 ACMsymposium on applied computing. ACM; 2006. p. 293e300.

Allen J. Maintaining knowledge about temporal intervals. Commun ACM1983;26:832e43.

Barnum S. Cyber observable expression (cybox) use cases. 2011.Baryamureeba V, Tushabe F. The enhanced digital investigation process

model. In: Proceedings of the fourth digital forensic research work-shop. Citeseer; 2004.

Buchholz F, Falk C. Design and implementation of zeitline: a forensictimeline. In: Digital forensic research workshop; 2005.

Case A, Cristina A, Marziale L, Richard GG, Roussev V. Face: automateddigital evidence discovery and correlation. Digit Investig 2008;5:S65e75.

Casey E, Back G, Barnum S. Leveraging cybox to standardize representa-tion and exchange of digital forensic information. Digit Investig 2015;12(Suppl. 1):S102e10 {DFRWS} 2015 Europe Proceedings of the Sec-ond Annual {DFRWS} Europe.

Chabot Y, Bertaux A, Nicolle C, Kechadi T. A complete formalizedknowledge representation model for advanced digital forensicstimeline analysis. Digit Investig Fourteenth Annu DFRWS Conf 2014;11:S95e105.

Chen K, Clark A, De Vel O, Mohay G. Ecf-event correlation for forensics. In:First Australian computer network and information forensics con-ference. Perth, Australia: Edith Cowan University; 2003. p. 1e10.

Farmer D, Venema W. The coroner's toolkit (tct). 2004.Gil Y, Miles S. Prov model primer. W3C Working Group Note; 2013.Gladyshev P, Patel A. Finite state machine approach to digital event

reconstruction. Digit Investig 2004;1:130e49.Gruber TR. A translation approach to portable ontology specifications.

Knowl Acquis 1993;5:199e220.Gudhjonsson K. Mastering the super timeline with log2timeline. SANS

Reading Room; 2010.Hargreaves C, Patterson J. An automated timeline reconstruction

approach for digital forensic investigations. Digit Investig 2012;9:69e79.

James J, Gladyshev P, Abdullah M, Zhu Y. Analysis of evidence usingformal event reconstruction. Digital Forensics Cyber Crime 2010:85e98.

Khan M, Chatwin C, Young R. A framework for post-event timelinereconstruction using neural networks. Digit Investig 2007;4:146e57.

Lebo T, Sahoo S, McGuinness D, Belhajjame K, Cheney J, Corsar D, et al.Prov-o: the prov ontology. W3C Recommendation, 30th April; 2013.

Motik B, Grau BC, Horrocks I, Wu Z, Fokoue A, Lutz C. Owl 2 web ontologylanguage: Profiles, vol. 27. W3C recommendation; 2009. p. 61.

Olsson J, Boldt M. Computer forensic timeline visualization tool. DigitInvestig 2009;6:78e87.

Schatz B, Mohay G, Clark A. Rich event representation for computer fo-rensics'. Proceedings of the fifth Asia-Pacific industrial engineeringand management systems conference (APIEMS 2004), vol. 2; 2004.p. 1e16.

Turnbull B, Randhawa S. Automated event and social network extractionfrom digital evidence sources with ontological mapping. DigitInvestig 2015;13:94e106.

http://refhub.elsevier.com/S1742-2876(15)00086-9/sref1




































































An ontology-based approach for the reconstruction and analysis of digital … · 2016-02-23 · An...

Documents

Transcript of An ontology-based approach for the reconstruction and analysis of digital … · 2016-02-23 · An...