A complete formalized knowledge representation model for advanced digital forensics timeline...

11
A complete formalized knowledge representation model for advanced digital forensics timeline analysis Yoan Chabot a, b, * , Aur elie Bertaux a , Christophe Nicolle a , M-Tahar Kechadi b a CheckSem Team, Laboratoire Le2i, UMR CNRS 6306, Facult e des sciences Mirande, Universit e de Bourgogne, BP47870, 21078 Dijon, France b School of Computer Science & Informatics, University College Dublin, Beleld, Dublin 4, Ireland Keywords: Digital forensics Timeline analysis Event reconstruction Knowledge management Ontology abstract Having a clear view of events that occurred over time is a difcult objective to achieve in digital investigations (DI). Event reconstruction, which allows investigators to understand the timeline of a crime, is one of the most important step of a DI process. This complex task requires exploration of a large amount of events due to the pervasiveness of new tech- nologies nowadays. Any evidence produced at the end of the investigative process must also meet the requirements of the courts, such as reproducibility, veriability, validation, etc. For this purpose, we propose a new methodology, supported by theoretical concepts, that can assist investigators through the whole process including the construction and the interpretation of the events describing the case. The proposed approach is based on a model which integrates knowledge of experts from the elds of digital forensics and software development to allow a semantically rich representation of events related to the incident. The main purpose of this model is to allow the analysis of these events in an automatic and efcient way. This paper describes the approach and then focuses on the main conceptual and formal aspects: a formal incident modelization and operators for timeline reconstruction and analysis. © 2014 Digital Forensics Research Workshop. Published by Elsevier Limited. All rights reserved. Introduction Due to the rapid evolution of digital technologies and their pervasiveness in everyday life, the digital forensics eld is facing challenges that were anecdotal a few years ago. Existing digital forensics toolkits, such as EnCase or FTK, simplify and facilitate the work during an investiga- tion. However, the scope of these tools is limited to collection and examination of evidence (i.e. studying its properties), which are the rst two steps of the investiga- tion process, as dened in Palmer (2001). To extract acceptable evidence, it is also necessary to deduce new knowledge such as the causes of the current state of the evidence (Carrier and Spafford, 2004b). The eld of event reconstruction aims at solving this issue: event recon- struction can be seen as a process of taking as input a set of events and outputting a timeline of the events describing the case. Several approaches have been proposed to carry out event reconstruction, which try to extract events and then represent them in a single timeline (super-timeline (Gudhjonsson, 2010)). This timeline allows to have a global overview of the events occurring before, during and after a given incident. However, due to the number of events which can be very large, the produced timeline may be quite complicated to analyse. This makes the interpretation of the timeline and therefore the decision making very difcult. In addition, event reconstruction is a complex * Corresponding author. CheckSem Team, Laboratoire Le2i, UMR CNRS 6306, Facult e des sciences Mirande, Universit e de Bourgogne, BP47870, 21078 Dijon, France. E-mail address: [email protected] (Y. Chabot). Contents lists available at ScienceDirect Digital Investigation journal homepage: www.elsevier.com/locate/diin http://dx.doi.org/10.1016/j.diin.2014.05.009 1742-2876/© 2014 Digital Forensics Research Workshop. Published by Elsevier Limited. All rights reserved. Digital Investigation 11 (2014) S95eS105

Transcript of A complete formalized knowledge representation model for advanced digital forensics timeline...

Page 1: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

ilable at ScienceDirect

Digital Investigation 11 (2014) S95eS105

Contents lists ava

Digital Investigation

journal homepage: www.elsevier .com/locate/d i in

A complete formalized knowledge representation model foradvanced digital forensics timeline analysis

Yoan Chabot a, b, *, Aur�elie Bertaux a, Christophe Nicolle a, M-Tahar Kechadi b

a CheckSem Team, Laboratoire Le2i, UMR CNRS 6306, Facult�e des sciences Mirande, Universit�e de Bourgogne, BP47870,21078 Dijon, Franceb School of Computer Science & Informatics, University College Dublin, Belfield, Dublin 4, Ireland

Keywords:Digital forensicsTimeline analysisEvent reconstructionKnowledge managementOntology

* Corresponding author. CheckSem Team, Laborat6306, Facult�e des sciences Mirande, Universit�e de21078 Dijon, France.

E-mail address: [email protected] (Y. Ch

http://dx.doi.org/10.1016/j.diin.2014.05.0091742-2876/© 2014 Digital Forensics Research Works

a b s t r a c t

Having a clear view of events that occurred over time is a difficult objective to achieve indigital investigations (DI). Event reconstruction, which allows investigators to understandthe timeline of a crime, is one of the most important step of a DI process. This complex taskrequires exploration of a large amount of events due to the pervasiveness of new tech-nologies nowadays. Any evidence produced at the end of the investigative process mustalso meet the requirements of the courts, such as reproducibility, verifiability, validation,etc. For this purpose, we propose a new methodology, supported by theoretical concepts,that can assist investigators through the whole process including the construction and theinterpretation of the events describing the case. The proposed approach is based on amodel which integrates knowledge of experts from the fields of digital forensics andsoftware development to allow a semantically rich representation of events related to theincident. The main purpose of this model is to allow the analysis of these events in anautomatic and efficient way. This paper describes the approach and then focuses on themain conceptual and formal aspects: a formal incident modelization and operators fortimeline reconstruction and analysis.

© 2014 Digital Forensics Research Workshop. Published by Elsevier Limited. All rightsreserved.

Introduction

Due to the rapid evolution of digital technologies andtheir pervasiveness in everyday life, the digital forensicsfield is facing challenges that were anecdotal a few yearsago. Existing digital forensics toolkits, such as EnCase orFTK, simplify and facilitate the work during an investiga-tion. However, the scope of these tools is limited tocollection and examination of evidence (i.e. studying itsproperties), which are the first two steps of the investiga-tion process, as defined in Palmer (2001). To extract

oire Le2i, UMR CNRSBourgogne, BP47870,

abot).

hop. Published by Elsevier L

acceptable evidence, it is also necessary to deduce newknowledge such as the causes of the current state of theevidence (Carrier and Spafford, 2004b). The field of eventreconstruction aims at solving this issue: event recon-struction can be seen as a process of taking as input a set ofevents and outputting a timeline of the events describingthe case. Several approaches have been proposed to carryout event reconstruction, which try to extract events andthen represent them in a single timeline (super-timeline(Gudhjonsson, 2010)). This timeline allows to have a globaloverview of the events occurring before, during and after agiven incident. However, due to the number of eventswhich can be very large, the produced timeline may bequite complicated to analyse. This makes the interpretationof the timeline and therefore the decision making verydifficult. In addition, event reconstruction is a complex

imited. All rights reserved.

Page 2: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105S96

process where each conclusion must be supported by evi-dence rigorously collected, giving it full credibility.

In this paper, we first address these problems by pro-posing an approach to reconstruct scenarios from suspectdata and analyse them using semantic tools and knowledgefrom experts. Secondly, this paper answers the challenge ofcorrectness of the whole investigative process with aformal incident modelization and timeline reconstructionand analysis operators. The paper is organized as follows.Section Related Works reviews important issues of theevents reconstruction problem and the various approachesproposed so far. The SADFC (Semantic Analysis of DigitalForensic Cases) approach is described in Section SADFCapproach, and the formal advanced timeline reconstruc-tion and analysis model is presented in Section Advancedtimeline analysis model. Finally, a case study illustratingthe key characteristics of the proposed approach is given inSection Case study.

Related works

Events reconstruction has many issues, which aredirectly related to the size of the data, digital investigationprocess complexity, and IT infrastructures challenges. Forinstance, Table 1 compares some existing approaches (theirstrengths (✓), limitations (✖), partial or inadequate solu-tions (C)) with regard to some key issues, such as het-erogeneity, automatic knowledge extraction, the use ofproven theory as support, analysis capabilities, and pres-ervation of data integrity. While some of these challengeshave been a focus for many researchers and developers forthe last decade, the size of data volumes (Richard III &Roussev, 2006) and data heterogeneity are still very chal-lenging. The first (large data sizes) introduces many chal-lenges at every phase of the investigation process; from thedata collection to the interpretation of the results. Thesecond (data heterogeneity) is usually due to multiplefootprint sources such as log files, information contained infile systems, etc. We can classify events heterogeneity intothree categories:

� Format: The information encoding is not the sameamong sources due to the formatting. Therefore,depending on the source, footprint data may bedifferent.

� Temporal: The use of different sources from differentmachines may cause timing problems. First, there aresome issues due to the use of different time zones andunsynchronized clocks. Second, the temporal

Table 1Evaluation of approaches.

Approach/Criterion Auto extractio

ECF (Chen et al., 2003) ✓

FORE (Schatz et al., 2004b) ✓

Finite state machine(Gladyshev and Patel, 2004) ✖

Zeitline (Buchholz and Falk, 2005) C

Neural networks(Khan and Wakeman, 2006) C

CyberForensic TimeLab(Olsson and Boldt, 2009) ✓

log2timeline (Gudhjonsson, 2010) ✓

Timeline reconstruction (Hargreaves and Patterson, 2012) ✓

heterogeneity can be due to the use of different formatsor granularities (e.g. 2 s in FAT file systems, 100 ns inNTFS file systems).

� Semantic: The same event can be interpreted or repre-sented in different ways. For example, an eventdescribing the visit of awebpagemay appear in differentways in web browser logs and server logs.

In order to gather all the events found in footprintsources in a single timeline, a good handling of all theseforms of heterogeneity is required. This leads to thedevelopment of an automated information processingapproach which is able to extract knowledge from theseheterogeneous sources. In addition, once extracted, thisknowledge should be federated within the same model soas to facilitate their interpretation and future analysis. Theeffectiveness of a such approach can be assessed by thefollowing criteria:

� Efficient automated tools that can extract events andbuild a timeline (Criterion 1 in Table 1).

� The ability to process multiple and various footprintsources and to federate the information collected in acoherent and structured way (Criterion 2 in Table 1).

� The ability to assist users during the timeline analysis.This latter encompasses many aspects such as makingthe timeline easier to read, identifying correlations be-tween events or producing conclusions from knowledgecontained in the timeline (Criterion 3 in Table 1).

For the majority of existing approaches, solutions areprovided to automatically extract events and construct thetimeline. Chen et al. (2003) introduced a set of automatedextractors to collect events and store them in a canonicaldatabase, which allows to quickly generate a temporal or-dered sequence of events. These automatic extractors, awidely used concept, can also generate the timeline (Olssonand Boldt, 2009; Gudhjonsson, 2010; Hargreaves andPatterson, 2012). However, current tools extract data inits raw formwithout a good understanding of the meaningof footprints, which makes their analysis more difficult. Inorder to deal with the semantic heterogeneity, the FORE(Forensics of Rich Events) system stores the events in anontology (Schatz et al., 2004a,b). This ontology uses thenotions of entity and event to represent the state change ofan object over time. Nevertheless, the time model imple-mented in this ontology is not accurate enough (use ofinstant rather than interval) to represent events accurately.In addition, the semantic coverage of this ontology can be

n Heterogeneity Analysis Theory Data integrity

✓ ✖ ✖ ✖

✓ C ✖ ✖

C C ✓ ✖

✓ ✖ ✖ ✓

C ✖ C ✖

✓ ✖ ✖ ✖

✓ ✖ ✖ ✖

✓ C C ✖

Page 3: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105 S97

improved. Another storage structure is also proposed inBuchholz and Falk (2005). They proposed a scalable struc-ture, a variant of a balanced binary search tree. However,one of the problems of this approach is that one has to buildthe case by manually selecting relevant events.

Some automatic events extraction approaches are alsoable to process heterogeneous sources. This usually con-sists of creating a set of extractors dedicated to each type offootprint sources. In Gudhjonsson (2010), the authorshighlight several limitations of the current event recon-struction systems such as the use of a small number ofevent sources, which makes the timeline vulnerable toanti-forensics techniques (e.g., alteration of timestamps).They proposed to use a large number of event sources toensure that the high quality of the timeline and the impactof anti-forensics techniques is minimized. log2timeline usesvarious sources including Windows logs, history of variousweb browsers, log files of other software resources, etc.Multiple heterogeneous sources require a consistent rep-resentation of the events in order to process them. Themajority of approaches propose their own model for eventrepresentation. The most commonly used event featuresinclude the date and time of the event (an instant or aninterval when the duration of events is included) and in-formation about the nature of the event.

Regarding the timeline analysis, few solutions havebeen proposed. The FORE approach attempts to identifyevent correlations by connecting events with causal or ef-fect links. In Gladyshev and Patel (2004), the authors try toperform the event reconstruction by representing the sys-tem behaviour as a finite state machine. The event recon-struction process can be seen as a search for sequences oftransitions that satisfy the constraints imposed by the ev-idence. Then, some scenarios are removed using the evi-dence collected (thus, there is no automation of theextraction process in this approach). In Hargreaves andPatterson (2012), they proposed a pattern-based processto produce high-level events (“human-understandable”events) from a timeline containing low-level events (eventsdirectly extracted from the sources). Although the pro-posed approach is relevant, it can handle only one of themany aspects of the analysis by helping the investigator toread the timeline much easier. Therefore, other aspectssuch as causality analysis between events are not coveredin this approach.

All approaches have to satisfy some key requirementssuch as credibility, integrity, and reproducibility of thedigital evidence (Baryamureeba and Tushabe, 2004). Inrecent years, the protagonists of digital forensics movedaway from investigative techniques that are based on theinvestigators experience, to techniques based on proventheories. It is also necessary to provide clear explanationabout the evidence found. In addition, one has to ensurethat the tools used do not modify the data collected oncrime scenes. Thus, it is necessary to develop tools thatextract evidence, while preserving the integrity of the data.Finally, a formal and standard definition of the recon-struction process is needed to ensure the reproducibility ofthe investigation process and credibility of the results. Aclearly-defined investigation model allows to explain theprocess used to get the results. To this end we believe that

the following criteria are crucial for such techniques: theuse of a theoretical model to support the proposedapproach (criterion 4 in Table 1) and the ability to maintainthe data integrity (criterion 5 in Table 1). As a prelude to hiswork, Gladyshev (Gladyshev and Patel, 2004) argued that aformalization of the event reconstruction problem isneeded to simplify the automation of the process and toensure the completeness of the reconstruction. In Khan andWakeman (2006), the authors proposed an event recon-struction system based on neural networks. The use of amachine learning technique appears to be a suitable solu-tion because it is possible to know the assumptions andreasoning used to obtain final results. However, neuralnetworks behaviour is not entirely clear (especially duringthe training step). Thus, the approach adopted by Khandoes not seem to fulfil the goal of making reasoningexplicit. In Hargreaves and Patterson (2012), the systemkeeps information about the analysis which leads to infereach high-level event. This gives the opportunity to providefurther information when needed. As for the preservationof the data integrity, the approach described in Buchholzand Falk (2005) uses a set of restrictions to prevent thealteration of the evidence. Regarding the process model,numerous investigation models have been proposed,however, none of them is designed for automated or semi-automated investigation. Existing models (DFWRS model(Palmer, 2001), End to End Digital Investigation(Stephenson, 2003), Event-Based Digital Forensics Investi-gation Framework (Carrier and Spafford, 2004a), EnhancedDigital Investigation Process Model (Baryamureeba andTushabe, 2004), Extended Model of Cybecrime Investiga-tion (Ciardhu�ain, 2004), Framework for a Digital ForensicsInvestigation (Kohn et al., 2006)) are designed to guidehumans investigators by providing a list of tasks toperform. Thus, the proposed models are not accurateenough to provide a framework for the development ofautomated investigation tools. Among the limitations ofthe existing frameworks, the characterization of the dataflow through the model is absent and the meaning of thesteps is not clear. The creation of awell-defined and explicitframework is needed to allow an easy translation of theinvestigation process into algorithms.

We can see in Table 1 that existing approaches are notable to fulfil all the criteria. Two limitations of the existingapproaches are the lack of automation of the timelineanalysis and the absence of theoretical foundations toexplain the conclusions produced by the tools. The assis-tance provided to the user should not be limited only to theconstruction of the timeline. It is necessary to deal with alarge amount of events (criterion 3), the heterogeneity ofevent sources, the need to federate events in a suitablemodel (see criterion 1 and criterion 2), data integrity, andthe whole investigative process correctness and validation.The purpose of this study is to propose an approach thatsatisfies all the criteria presented in Table 1 while produc-ing all the necessary evidence in an efficient way.

SADFC approach

To reach these objectives, we propose the followingapproach:

Page 4: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105S98

� Identify andmodel the knowledge related to an incidentand the knowledge related to the investigation processused to determine the circumstances of the incident:- Introduce a formalization of the event reconstructionproblem by formally defining entities involved in anincident.

- Define a knowledge representation model based onthe previous definitions used to store knowledgeabout an incident and knowledge used to solve thecase (to give credibility to the results and to ensurethe reproducibility of the process).

� Provide extraction methods of the knowledge containedin heterogeneous sources to populate the knowledgemodel.

� Provide tools to assist investigators in the analysis of theknowledge extracted from the incident.

Unlike conventional approaches, the SADFC approachuses techniques from knowledge management, semanticweb and data mining at various phases of the investigationprocess. Issues related to knowledge management andontology are two central elements of this approach as theydeal with a large part of the mentioned challenges. Ac-cording to Gruber (1993), “an ontology is an explicit, formalspecification of a shared conceptualization”. Thus, ontologyallows to represent knowledge generated during aninvestigation (knowledge about footprints, events, objectsetc.). Ontology provides several advantages such as thepossibility to use automatic processes to reason onknowledge thanks to its formal and explicit nature, theavailability of rich semantics to represent knowledge(richer than databases due to its sophisticated semanticconcepts (Martinez-Cruz et al., 2012)) and the possibility ofbuilding a common vision of a topic that can be shared byinvestigators and software developers. Ontology hasalready proved its relevance in computer forensics (Schatzet al., 2004a,b). The use of ontology is also motivated by itssuccessful use in other fields such as biology (Schulze-Kremer, 1998) and life-cycle management (Vanlandeet al., 2008).

SADFC is a synergy of the three elements presentedbelow, which, once assembled, constitute a coherentpackage describing methods, processes and technologicalsolutions needed for events reconstruction:

� Knowledge Model for Advanced Digital Forensics TimelineAnalysis which is presented in the following part of thepaper.

� Investigation Process Model: The aim of this model is todefine the various phases of the event reconstructionprocess: their types, the order between them and thedata flow through the whole process.

� Ontology-centred architecture: This architecture consistsof several modules which implement some of the keyfunctions such as footprint extraction, knowledgemanagement, ontology and the visualization of the finaltimeline. This architecture is based on an ontologywhich implements the knowledge model proposed inour approach. As the proposed ontologies in Schatz et al.(2004a,b) are not advanced enough, we have designedand developed a new ontology. For instance, some

relations included in our knowledgemodel are absent inthe ontology of the FORE system.

In the following sections, we present a formalizedknowledge model for advanced forensics timeline analysis,while the investigation process model and the architecturewill be discussed in details in other documents.

Advanced timeline analysis model

This section describes the knowledge model used in theSADFC approach to allow to perform an in-depth analysis oftimeline while fulfilling the law requirements. Modelsproposed in the literature (Schatz et al., 2004a,b) are stilllimited in terms of the amount of knowledge they can storeabout events and therefore, analysis capabilities. Temporalcharacteristics predominate over other aspects such asinteraction of events with objects, processes or people.Knowledge representing the relations between events andothers entities is not sufficiently diversified for the subse-quent analysis required. Thus, we propose a rich knowledgerepresentation containing a large set of entities and re-lations and allowing to build automated analysis processes.In addition, the proposed model is designed to meet thelegal requirements and contains knowledge allowing toreproduce the investigative process and to give full credi-bility to the results. It should be noted that the assumptionthat the data has been processed upstream to ensure ac-curacy and correctness is used. A digital forensic investi-gation process model including processes for consistencycheck of data and filtering will be proposed in future works.

We will first formally define the entities and relationscomposing themodel and then, introduce a set of operatorsto manipulate the knowledge. An overview of the proposedknowledge model is given in Fig. 1.

Formal modelling of incident

We first define entities of our knowledge model andthen we detail the four composed relations using thisknowledge.

Subject, object, event and footprintA crime scene is a space where a set of events

E ¼ {e1,e2,…,ei} takes place. An event is a single actionoccurring at a given time and lasting a certain duration. Anevent may be the drafting of a document, the reading of awebpage or a conversation via instant messaging software.

Subject. During its life cycle, an event involves subjects. Let Sbe the set containing subjects covering human actors andprocesses (e.g. Firefox web browser, Windows operatingsystem, etc.), a subject x2S corresponds to an entityinvolved in one or more events e2E and is defined byx ¼ {a2Asjxas a} where:

� As is a set containing all the attributes which can be usedto describe a subject. A subject attribute may be the firstname and the last name of a person, the identifier of aweb session, the name of a Windows session, etc.

Page 5: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Fig. 1. Knowledge model.

Table 2Allen algebra.

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105 S99

� as is the relation used to link a subject with the attri-butes of As describing it.

Object. During its life cycle, an event can also interact withobjects. An object may be a webpage, a file or a registry keyfor example. An object x2O is defined by x¼ {a2Ao j xao a}where:

� Ao is a set containing all the attributes which can be usedto describe an object. An object attribute may be forexample a filename, the location of an object on a harddisk, etc.

� ao is the relation used to link an object with the attri-butes of Ao describing it.

� O4§ðAoÞ is the set of the objects, meaning that o2Obelongs to the power set of Ao. An object is a composi-tion of one or several attributes of Ao.

Note, that for easier human understanding, an object canalso be seen as a composition of attributes and objects(becauseobjects are sets of attributes), e.g. a registrykey is anobject composedof several attributes such as its value and itskey name. This registry key is also an attribute of the objectrepresenting the database containing all keys of the system.

Event. Each event takes place in a time-interval defined bya start time and an end time. These boundaries define thelife cycle of the event. The use of a time interval allows torepresent the notion of uncertainty (Liebig et al., 1999). Forexample, when the start time of an event cannot bedetermined accurately, the use of a time-interval allows toapproximate it. The use of intervals requires the introduc-tion of a specific algebra (e.g. to order events). In our work,the Allen algebra, illustrated in the Table 2, is used (Allen,1983). The authors of this paper are aware of the prob-lems caused by temporal heterogeneity and anti-forensicstechniques on the quality and the accuracy of timestamps(granularity, timestamps offset and alteration, timezone,etc). Several works try to characterize the phenomenarelated to the use of time information in digital forensicinvestigations and to propose techniques to solve theseproblems (Schatz et al., 2006; Gladyshev and Patel, 2005;Forte, 2004). In this paper, we assume that all timestampsused in our model are adjusted and normalized beforehandby a process that will be the subject of future work (theproposed model is generic enough to incorporate futuresolutions). We consider that each function returns thevalue 1 if events meet the constraints and 0 otherwise.

An event e2E is defined by e ¼ {tstart,tend,l,Se,Oe,Ee}where:

� tstart is the start time of the event, tend is the end-time ofthe event and l is the location where the event tookplace. This location may be a machine (represented byan IP) and its geolocalisation.

� Se is a set containing all subjects involved in the event.Se ¼ {s2S j e2E, s ss e} where ss is a composed relationused to link an event e2E with a subject s2S. Therelation ss is defined below.

� Oe is a set containing all objects related to the event e.Oe ¼ {o2O j e2E, eso o} where so is a composed relationused to link an event e2E with an object o2O. Therelation so is defined below.

� Ee is the set containing all events with which the event iscorrelated. Ee ¼ {x2E j e2E, ese x} where se is acomposed relation used to link an event e2E with anevent x2E. The relation se is defined below.

Footprint. According to Ribaux (2013), a footprint is the signof a past activity and a piece of information allowing toreconstruct past events. A footprint may be a log entry or aweb history for example as a log entry gives informationabout software activities and web histories provide infor-mation about user's behaviour on the Web. Let F be a setcontaining all footprints related to a case, a footprint x2F isdefined by x ¼ {a2Af j xaf a} where:

Page 6: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105S100

� Af is a set containing all the attributes which can be usedto describe a footprint. For example, a bookmark con-tains attributes such as the title of the bookmark, thedate of creation and the webpage pointed by thebookmark.

� af is the relation used to link a footprint with the attri-butes of Af used to described it.

Footprints are the only available information to definepast events and can be used by investigators to reconstructthe events which happened during an incident. However,the imperfect and incomplete nature of the footprints canlead to produce approximate results. It is therefore not al-ways possible to determine which event is associated witha given footprint. In addition, it is not always possible tofully reconstruct an event from a footprint. Thus, a footprintcan be used to identify one or several features:

� The temporal features of an event. For example, a foot-print extracted from the table moz_formhistory (data-base FormHistoryof thewebbrowser Firefox) canbeusedto establish the time at which a form field has been filled.

� A relation between an event and an object. For example,a footprint extracted from the table moz_historyvisits(database Places of the web browser Firefox) can beused to link an event representing the visit of a webpageto this webpage.

� A relation between an event and a subject. For example,all footprints produced by the web browser Firefox arestored in a folder named with the profile name of theuser. This allows to link each Firefox event to the userdesignated by this name.

� The features of an object. For example, a footprintextracted fromthe tablemoz_places (databasePlaces) canbe used to determine the URL and the title of a webpage.

� The features of a subject. For example, a footprint extrac-ted fromthe tablemoz_historyvisits (database Places) canbe used to determine the session identifier of a user.

RelationsTo link the entities presented before, four composed

relations which were introduced in previous part aredetailed here.

Subjects relations. ss is composed of two types of relationsto link an event e2Ewith a subject s2S and can be definedin the following way ss ¼ s isInvolved e∨s undergoes e:

� Relation of participation: s isInvolved e means that sinitiated or was involved in e. For example, the user of acomputer is involved in an event representing the loginto the session, etc.

� Relation of repercussion: s undergoes e means that s isaffected by the execution of e. For example, a user isaffected by the removal of one of his files, etc.

Objects relations. so is composed of four types of relations tolink an event e2E with an object o2O and can be definedin the following way so ¼ e creates o∨e removes o∨emodifieso∨e uses o:

� Relation of creation: e creates o means that o does notexist before the execution of e and that o is created by e.

� Relation of suppression: e removes o means that o doesnot exist anymore after the execution of e and that o isdeleted by e.

� Relation of modification: emodifies o means that one ormore attributes of o are modified during the executionof e.

� Relation of usage: e uses o means that one or more at-tributes of o are used by e to carry out its task.

Events relations. se is composed of relations used to link twoevents x,e2E and can be defined in the following wayse ¼ x composes e∨e composes x∨x causes e∨e causes x. In ourworks, x is Correlated e means that x is linked to e on thebasis of multiple criteria: use of common resources,participation of a common person or process, temporalposition of events. We distinguish two special cases of therelation of correlation:

� Relation of composition: x composes e means that x is anevent composing e. For example, an event representingaWindows session is composed of all events initiated bythe user during this session. Let x ¼ {txstart,txend,Sx,Ox,Ex}be an event composing e ¼ {testart,teend,Se,Oe,Ee}, therelation of composition implies a set of constraints. First,a temporal constraint requiring that sub-events takeplace during the parent event. Using Allen relations, ifx composes e then equal(x,e) or during(x,e) or starts(x,e)or finishes(x,e). Sub-events have also constraints onparticipating subjects as well as the objects with whichthe event interacts. If x composes e then Sx4Se andOx4Oe. Thus, x composes e ¼ [equal(x,e)∨during(x,e)∨starts(x,e)∨finishes(x,e)]∧(Sx4Se)∧(Ox4Oe).

� Relation of causality: x causes e means that x has tohappen to allow e to happen. For example, an eventdescribing the download of a file from a server is causedby the event describing the connection to this server. Anevent can have several causes and can be the cause ofseveral events. Let e ¼ {testart,teend,Se,Oe,Ee} be an eventcaused by x ¼ {txstart,txend,Sx,Ox,Ex}, the relation of cau-sality implies a temporal constraint requiring that thecause must happen before the consequence. Using Allenalgebra, x causes e ¼ [before(x,e)∨meets(x,e)∨overlaps(x,e)∨starts(x,e)]∨(Sx∩Se)∨(Ox∩Oe).

Footprints relation. sf is a relation used to link a footprintf2F with an entity en2{E�O�S}. This relation is calledRelation of support: f supports en means that f is used todeduce one or more attributes of en. We define a functionsupport which can be used to know the footprintsused to deduce a given entity: support(en2{E�O�S}) ¼ {f2F j f sf en}. For example, an entry of a webhistory can be used to reconstruct an event representingthe visit of a webpage by a user.

Crime sceneIn our works, we define a crime scene CS as an envi-

ronment in which an incident takes place and byCS ¼ {PCS,DCS}. PCS is a set containing the physical crime

Page 7: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105 S101

scenes. At the beginning of an investigation, PCS is initial-ized with the localization where the incident takes place.However, in an investigation, the crime scene is not limitedto only one building. Due to network communication forexample, the initial physical crime scene may be extendedto a set of new physical crime scenes if one of the pro-tagonists communicated with an other person through thenetwork, downloaded a file from a remote server, etc. Inthese cases, the seizure of the remote machines (and byextension, the creation of new physical crime scenes anddigital crime scenes) should be taken into account as it maybe relevant for the investigation. DCS is a set containing thedigital crime scenes. Unlike Carrier (Carrier and Spafford,2003), there is no distinction between primary and sec-ondary digital crime scene in our works to simplify themodelization.

A crime scene is also related to events that take place onit ECS ¼ {EiCS∪EcCS∪EnCS}. For easier notation, we write hereE ¼ {Ei∪Ec∪En} where:

� Ei is a set containing illicit events as Ei ¼ {e2E j esi i,i2I}with I a set containing all actions considered as in-fractions by the laws and si the relation linking an eventwith an action. For example, an event representing theupload of defamatory documents to a website is anevent of Ei.

� Ec is a set containing the correlated events asEc ¼ {e2E j e si l,l2L,ese x,x2Ei}. This set contains alllegal events which are linkedwith a set of illicit events x.

� En ¼ {En(Ei∪Ec)} is a set containing the events which arenot relevant for the investigation.

Event reconstruction and analysis operators

Since investigators have ensure the preservation of thecrime scene, it becomes a protected static environmentcontaining a set of footprints. After the collection of all thefootprints of the crime scene, the goal of the eventreconstruction consists in moving from the static crimescene to a timeline describing the dynamics of an incidentwhich happened in the past. Describing an incident meansidentify all the events Einc ¼ Ei∪Ec using footprints of F.Events ordered chronologically describing an incident arecalled the scenario of the incident. In the SADFC approach,four operators are defined to carry out the event recon-struction. These operators are illustrated in the SectionCase study.

The aim of extraction operators is to identify and extractrelevant information contained in digital footprints fromvarious sources. Sources are chosen according to thedefinition of the perimeter of the crime scene. The rele-vancy or the irrelevancy of information contained infootprints is determined by the investigators according tothe goals of the investigation. For example, in a caseinvolving illegal downloads of files, the investigator willpay more attention to information related to the userbehaviour on the web while logs of word processingsoftware will be ignored. The mapping operators createentities (events, objects and subjects) associated to theextracted footprints. These operators take the form of

mapping rules allowing to connect attributes extractedfrom footprints and attributes of events, objects and sub-jects. A large part of the features of an event can bedetermined by extraction operators from footprintscollected in the crime scene. However, the identification ofsome kinds of features requires the use of advancedtechniques such as inference. The inference operators allowto deduce new knowledge about entities from existingknowledge. Unlike extraction operators which useknowledge of footprints, inference operators use theknowledge about events, objects and subjects (knowledgegenerated by mapping operators). The analysis operatorsare used to help the investigators during the interpretationof the timeline. These operators can be used to identifyrelations between events or to highlight the relevant in-formation of the timeline. In our works, we introduce anoperator dedicated to the identification of event correla-tions. The correlation between two events e,x2E ismeasured by the following function:

Correlationðe;xÞ¼CorrelationTðe;xÞþCorrelationSðe;xÞþCorrelationOðe;xÞþCorrelationKBRðe;xÞ

(1)

Correlation(e,x) can be weighted to allow to give moreimportance to one of the correlation functions. Correlatio-n(e,x) can also be ordered and thresholded to deal with datavolume constraints by selecting the most significant cor-relations. Those four correlations are described in thefollowing way:

Temporal Correlation, CorrelationT(e,x). First of all, a set ofassumptions about the temporal aspect is defined (ac-cording to the Allen algebra given in Table 2):

� The greater the relative difference between the twoevents before(e,x) is, the lower the temporal relatednessis and reciprocally.

� The temporal relatedness is high for functions meet-s(e,x), overlaps(e,x), during(e,x) and finishes(e,x) andmaximal for starts(e,x) and equals(e,x).

Thus;CorrelationT ðe; xÞ ¼ a*startsðe; xÞ þ a*equalsðe; xÞþmeetsðe; xÞ þ overlapsðe; xÞþ duringðe; xÞ þ finishesðe; xÞþ beforeðe; xÞ

(2)

where starts(e,x), equals(e,x), meets(e,x), overlaps(e,x), dur-ing(e,x), finishes(e,x) are binary functions andbefore(e,x) ¼ 1/(xtstart�etend). Previous assumptions statethat the more two events are close in time, the more it islikely that these events are correlated. Because of timegranularities, and multi-tasks computers, if two eventsstart at the same time, the relatedness is more important.That is why this importance can be highlighted with an a

factor.Subject Correlation, CorrelationS(e,x) and Object Correla-

tion, CorrelationO(e,x). These two correlations respectivelyquantify correlations regarding subjects involved in each

Page 8: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105S102

event and objects used, generated, modified or removed byevents. The following hypothesis are defined according tothe core idea of several domains such as data mining, forexample in formal concept analysis (Ganter et al., 1997)which groups objects regarding to the attributes they shareor in statistics such as principal component analysis, whichgroups observations measuring how far they are spread(variance).

� The relatedness between e and x increases proportion-ally to the number of common subjects they shareregarding to relations of participation and repercussion.CorrelationS(e,x) ¼ jSe∩Sxj /max(jSej,jSxj).

� The relatedness between e and x increases proportion-ally to the number of objects they share regarding torelations of creation, suppression,modification and usage.CorrelationO(e,x) ¼ jOe∩Oxj /max(jOej,jOxj).

Rule-based Correlation, CorrelationKBR(e,x). In addition tothe previous factors (time, subject, object), rules based onexpert knowledge can be used to correlate events.CorrelationKBRðe; xÞ ¼

Pnr¼1rulerðe; xÞ with ruler(e,x) ¼ 1 if

the rule is satisfied and 0 otherwise. Therefore, the greaterthe number of satisfied rules is, the greater the value ofCorrelationKBR is. All correlation functions return a scorebetween 0 and 1 except CorrelationKBR. This specificity al-lows to give more importance to expert knowledge syn-thesized in the rules.

Case study

The aim of this section is to illustrate with an examplehow this knowledge model can be used to formalize acomputer forensics case and reconstruct the incident byusing the proposed operators. To illustrate the capabilitiesof our model, we designed a fictitious investigation con-cerning a company manager who contacted a privateinvestigator after suspicions about one of its employees.The manager believes that this employee uses the internetconnection of the company to illegally download files forpersonal purposes. As the investigation process model isnot presented yet, we used in this case study a processdesigned for the need of this paper and composed of fivesteps: the definition of the crime scene, the collection offootprints, the creation of entities (event, object, subject),the enrichment of the extracted knowledge and finally theconstruction and the analysis of the timeline.

Fig. 2. Overview of the knowledge generated during the case.

Crime scene definition

As a start of the investigation, the investigator evaluatesthe size of the crime scene. As suspicions concern theworkstation of the employee, the open-space of the com-pany where this computer is located is designated as thephysical crime scene. Based on the testimony and to reducethe complexity of the investigation, only the workstation ofthe suspected employee is added to theDCS set. To establishwhether or not the doubts on the employee are founded, theinvestigator in charge of the case seizes the machine used

by the employee and starts using the theoretical tools pro-posed in this paper to track the user's past activities.

Footprints collection

After defining the machine used by the employee ascrime scene, the investigator collects footprints on thislatter. To carry out this step, the extraction operators areused. According to the objectives of the investigation, theinvestigator chooses to collect footprints left by the webbrowser used by the employee as he knows that he can findinformation about downloads performed by the user. Theoutput of the extraction is a set containing several types ofweb browser footprints (id is a unique identifier for eachelement populating our model):

� fWebpage(id,webpageID,pageTitle,URL,hostname): foot-print giving information about a webpage.

� fVisit(id,date,session,pageID): footprint of a visit of awebpage.

� fBookmark(id,date,bookmarkTitle,pageID): footprint of acreation of a bookmark.

� fDownload(id,start,end,filename,pageSourceID,URLTarget):footprint of a file download where start and end are thedates on which the download starts and ends, page-SourceID is the webpage from where the download islaunched and URLTarget is the destination used to storethe file.

The result of the extraction is given in Fig. 3 and isillustrated in the left part of Fig. 2.

Entities creation

The output of this step is a set containing the entities(event, object, subject) which can be recovered using the

Page 9: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Fig. 3. Output of footprints extraction.

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105 S103

information left in the digital crime scene. To carry out thisstep, the mapping operators are used. The output of themapping is the set of entities given in Fig. 4 and illustrated inthe right part of Fig. 2. Entities are linked together by re-lations defined in Section Relations. support(x,y) means thatthe footprint identified by the id x has been used to createthe entity identify by the id y. participation(x,y) means thatthe subject x is involved in the event y.usage(x,y)means thatthe event x used the object y. A subject is defined by sub-ject(id,session)where id is a unique identifierwhile session isthe session number associated to the user. The hypothesisthat duration of bookmark creation and the visit of a web-page is null is used (start time equals to end time).

Knowledge enrichment

This step is particularly useful to improve the results ofthe analysis steps as it allows to discover new knowledgeabout entities. For example, the only available informationtodetermine the subject involved in an event generatedbyaweb browser is the session identifier found in some digitalfootprints extracted from Web browser. To identify the

Fig. 4. Entities created by t

subject involved in other events, we used an inferenceoperator based on the following assumption. Let ei be thefirst visit of awebpage for the session s, ej the last visit of s, tithe start date of ei and tj the enddate of ej, an event occurringon themachine in a datewithin the time interval defined byti and tj involves the person who owns the session s. In thiscase study, the person who is involved in the event even-tid¼19 (j) in Fig. 4 (creation of a bookmark for the webpage“http://www.warez.com”) is unknown. However, using theprevious inference rule, it is possible to infer that the user isthe subject subjectid¼13 (d) using the fact that this userperform a visit of a webpage before and after the eventeventid¼19 (j) (respectively eventid¼15 (f) and eventid¼16 (g)).Thus, the fact participation(13,19) can be added.

Timeline construction and analysis

After collecting knowledge about events and relatedentities, the last step is to build the associated timeline andanalyse it. A graphical representation of the timeline isgiven in Fig. 5. To find correlations between events, theoperator introduce in Section Event reconstruction and

he mapping process.

Page 10: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Fig. 5. Timeline of the incident.

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105S104

analysis operators is used. To illustrate the computation ofcorrelations, three events defined in Fig. 4 are used. In thisexample, the correlations between the events eventid¼15 (f)and eventid¼17 (h) (called “pair A”) and between the eventseventid¼15 (f) and eventid¼19 (j) (called “pair B”) arecomputed. First, the subject correlation and the objectcorrelation are computed. Regarding the pair A, the twoevents share neither subjects nor objects. Regarding thepair B, the two events use a common resource which is thewebpage “http://www.warez.com” (objectid¼11 (b)). Acommon subject is also involved in both events: the subjectidentified by the session number 14 (subjectid¼13 (d)). Sec-ond, the temporal correlation is computed. For both pairs,the case before(x,e) where an event occurs before the otheris used. Thus, the temporal correlation is greater for the pairB than the pair A as the relative difference between even-tid¼15 (f) and eventid¼19 (j) is lower than the difference be-tween eventid¼15 (f) and eventid¼17 (h). To conclude, the pairB is more correlated than the pair A due to the use of acommon object (objectid¼11 (b)), the participation of acommon subject (subjectid¼13 (d)) and the temporal prox-imity between these two events. At the end of the analysisstep, the investigator gets a timeline enriched with infor-mation about correlations between events. The investigatorstarts the interpretation of the results by identifyingeventid¼15 (f) which is a visit of the website “warez.com”

that appears to be a web platform providing links to illegalfiles. Then, correlation results show that eventid¼15 (f) andeventid¼19 (j) are highly correlated. This correlation allowsto conclude that the visit of the website is voluntary andthat this website has some relevance for the user (indeed,he uses a bookmark to facilitate a subsequent visit). Theanalysis of others events and others possible correlationscan lead to the conclusion than the user has visited inten-tionally the website “warez.com” and used this website todownload illegal files.

Conclusion and future works

In digital investigations, one of the most importantchallenge is the reconstruction and the analysis of pastevents, because of the heterogeneity of data and the vol-ume of data to process. To answer this challenge we pre-sent the SADFC approach. It allows to reduce the tediouscharacter of the analysis thanks to automatic analysisprocesses helpful for the investigators. Thus, investigators

can focus on the tasks for which their expertise andexperience are most needed such as interpretation of re-sults, validation of hypotheses, etc. Moreover, SADFC im-plements mechanisms allowing to satisfy lawrequirements. These mechanisms rely on formal defini-tions on which this paper focuses: a formalization of theevent reconstruction containing formal definitions of en-tities involved in an incident and four sets of operatorsallowing to extract, manipulate and analyse the knowledgecontained in the model. The case study presented in theprevious section has shown the relevance of this model.Indeed, the use of a semantically rich representation usingnew semantic aspects (in addition to temporal aspects)allows to build advanced analysis processes. In particular,we have presented the possibilities introduced by ourmodel to correlate events. The identification of correlatedevents enables to highlight valuable information for theinvestigators.

Future works will be to develop the two other com-ponents of our approach. First, we will design an investi-gation process model providing a framework for thedevelopment of automated investigation tools. This pro-cess model will include a definition of the steps composinga digital investigation, from the preservation of the crimescene to the presentation of conclusions to Justice passingthrough the definition of the crime scene, the collection offootprints, the construction and the analysis of the time-line. During the development of this model, the focus is onthe precision and completeness of the definition of eachstep as well as its inputs and outputs. Second, we willimplement the theoretical model presented in this paper.This implementation will consist of a reference architec-ture based on an ontology derived from our knowledgemodel coupled with inference mechanisms and analysisalgorithms. This architecture will be used to validate ourmodel.

Acknowledgements

The above work is a part of a collaborative researchproject between the CheckSem team (University of Bur-gundy and the UCD School of Computer Science andInformatics) and is supported by a grant from UniversityCollege Dublin and the Burgundy region (France). The au-thors would like to thank Dr Florence Mendes for valuablecomments on the formal part of this paper.

Page 11: A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Y. Chabot et al. / Digital Investigation 11 (2014) S95eS105 S105

References

Allen J. Maintaining knowledge about temporal intervals. Commun ACM1983;26:832e43.

Baryamureeba V, Tushabe F. The enhanced digital investigation processmodel. In: Proceedings of the Fourth digital forensic research Work-shop. Citeseer; 2004.

Buchholz F, Falk C. Design and implementation of zeitline: a forensictimeline editor. In: Digital forensic research workshop; 2005.

Carrier B, Spafford E. An event-based digital forensic investigationframework. In: Digital forensic research workshop; 2004.

Carrier B, Spafford E. Getting physical with the digital investigation pro-cess. Int J Digital Evid 2003;2:1e20.

Carrier BD, Spafford EH. Defining event reconstruction of digital crimescenes. J Forensic Sci 2004b;49:1291.

Chen K, Clark A, De Vel O, Mohay G. Ecf-event correlation for forensics. In:First Australian Computer Network and Information Forensics Con-ference. Perth, Australia: Edith Cowan University; 2003. pp. 1e10.

Ciardhu�ain S. An extended model of cybercrime investigations. Int JDigital Evid 2004;3:1e22.

Forte DV. The art of log correlation: tools and techniques for correlatingevents and log files. Comput Fraud Secur 2004;2004:15e7.

Ganter B, Wille R, Franzke C. Formal concept analysis: mathematicalfoundations. Springer-Verlag New York, Inc.; 1997.

Gladyshev P, Patel A. Finite state machine approach to digital eventreconstruction. Digit Investig 2004;1:130e49.

Gladyshev P, Patel A. Formalising event time bounding in digital in-vestigations. Int J Digital Evid 2005;4:1e14.

Gruber TR. A translation approach to portable ontology specifications.Knowl Acquisit 1993;5:199e220.

Gudhjonsson K. Mastering the super timeline with log2timeline. SANSReading Room; 2010.

Hargreaves C, Patterson J. An automated timeline reconstruction approachfor digital forensic investigations. Digit Investig 2012;9:69e79.

Khan M, Wakeman I. Machine learning for post-event timeline recon-struction. In: First Conference on Advances in Computer Security andForensics Liverpool, UK; 2006. pp. 112e21.

Kohn M, Eloff J, Olivier M. Framework for a digital forensic investigation.In: Proceedings of the ISSA 2006 from Insight to Foresight Confer-ence, Sandton, South Africa; 2006 (published electronically).

Liebig C, Cilia M, Buchmann A. Event composition in time-dependentdistributed systems. In: Proceedings of the Fourth IECIS Interna-tional Conference on Cooperative Information Systems. Washington,DC, USA: IEEE Computer Society; 1999. pp. 70e8.

Martinez-Cruz C, Blanco IJ, Vila MA. Ontologies versus relational data-bases: are they so different? a comparison. Artif Intell Rev 2012;38:271e90.

Olsson J, Boldt M. Computer forensic timeline visualization tool. DigitInvestig 2009;6:78e87.

Palmer G. A road map for digital forensic research. In: First DigitalForensic Research Workshop, Utica, New York; 2001. pp. 27e30.

Ribaux O. Science forensique Http://www.criminologie.com/article/science-forensique; 2013.

Richard III G, Roussev V. Digital forensics tools: the next generation.Digital crime and forensic science in cyberspace. Idea Group Pub-lishing; 2006. pp. 75e90.

Schatz B, Mohay G, Clark A. Rich event representation for computer fo-rensics'. In: Proceedings of the Fifth Asia-Pacific Industrial Engineer-ing and Management Systems Conference (APIEMS 2004), vol. 2;2004. pp. 1e16.

Schatz B, Mohay G, Clark A. A correlation method for establishing prov-enance of timestamps in digital evidence. Digit Invest 2006;3:98e107.

Schatz B, Mohay GM, Clark A. Generalising event forensics across multipledomains. In: Valli C, editor. 2nd Australian Computer Networks In-formation and Forensics Conference. Perth, Australia: School ofComputer Networks Information and Forensics Conference, EdithCowan University; 2004b. pp. 136e44.

Schulze-Kremer S. Ontologies for molecular biology. In: Proceedings ofthe Third Pacific Symposium on Biocomputing, vol. 3. AAAI Press;1998. pp. 695e706.

Stephenson P. A comprehensive approach to digital incident investiga-tion. Inf Secur Tech Rep 2003;8:42e54.

Vanlande R, Nicolle C, Cruz C. Ifc and building lifecycle management.Automation Constr 2008;18:70e8.