6a Description of the proposed research Overview

6a Description of the proposed research

Overview

Vast amounts of data are being collected and logged in computer networks. These data are often monitored for intrusion detection, e.g. by matching known malicious IP-addresses or by searching for specific packet payloads. Suspicious traffic is then investigated manually by security analysts to find new intrusions or malicious activity. A security analyst can only investigate a limited amount of traffic however, and much of the wealth of this data is therefore completely ignored. In LEMMA, we aim to remedy this situation by developing tools that will help to automatically analyze malware activity in all of this data. Our solution consists of three innovative contributions:

1. A theoretical study with the aim of finding a type of timed state machine model with parameters that is both learnable from data and able to succinctly represent many different malware behaviors.

2. A novel algorithm for learning these models from streaming network traffic instead of storing all observations in memory.

3. A new information fusion method that combines the knowledge contained in state machine models that have been learned from different sources. Currently, such fusion methods do not exist for state machine models.

Together, the learning and fusion methods will be used to discover unknown malware infections, analyze their behavior, and draw system-wide conclusions. In particular, we will apply them to two test-cases (see Valorization):

1. Localizing infected hosts within network architectures using distributed network monitors from SURFnet and the ministry of Security and Justice together with Madison-Gurkha.

2. Discovering suspicious network connections in nation-wide network streams monitored at the NCSC.

The time and technology is ripe for tackling the challenge of analyzing network traffic with automated learning techniques:

1.Experiments in automated analysis of network data are have already demonstrated interesting results [38,39,40,41,42], despite using fairly basic learning techniques.

2. State machine learning has reached the stage where it is capable of tackling systems of this size and complexity [34,35,72].

3. First attempts at dealing with streaming data [78,79] are positive.Learned state machines have already proved very useful in detailed reverse engineering of unknown malware behaviour [39], using it to detect its presence and location is a logical next step.

Background

Behavior-based malware analysis

Malware analysis is the task of analyzing the inner workings of malicious software. In addition to uncovering its true purpose, such an analysis is useful for fingerprinting malware programs [1,2], detecting new malware intrusions [3,4,5], infiltrating malware programs [6], developing targeted defenses [7], and fixing security leaks [8]. Malware analysis comes in two flavors: static and dynamic. Static analysis operates on the executable binary of a malware program using disassembly (see, e.g., [9]) and emulation tools (see, e.g., [10,11]). This task is steadily becoming more and more

difficult due to the use of techniques to thwart analysis as using code obfuscation, encryption, and polymorphic/self-modifying code [12]. In contrast, dynamic malware analysis monitors the run-time execution behavior of malware, which is much more difficult to conceal [13]. Consequently, there has been a large amount of research devoted to the development of effective behavior-based malware analysis tools, see [14,15] for an overview.

Both static and dynamic malware analysis are laborious manual tasks. An important issue is therefore to direct malware investigation to the most suspicious or interesting programs, e.g., using techniques from machine learning [13,16,17,18]. Although these methods have been shown to lead to a small number of false positives and reasonable detection rates, a downside is that the resulting models (subspaces of a high-dimensional feature space) are not intuitive representations of software behavior and therefore of limited use in subsequent investigations. Instead, we propose to develop a novel tool for automated behavior-based analysis of malware using recently developed techniques for learning (timed) state machines.

Figure 1. State machine learning. From execution traces a state machine model is constructed.

State machine learning

State machines are key models for the design and analysis of computer systems [19]. Consequently, they are also used in the security domain to detect and analyze malicious software [20]. The problem of learning (reverse-engineering) finite state machines automatically from trace data [21] (see Figure 1) enjoys a lot of interest from the software engineering and formal methods communities. They use learned state machines for providing insight into complex software systems and test their properties using model checking [22] and testing techniques [19]. In the literature, this approach has been used for learning and analyzing models for different types of complex software systems such as web-services [23], X11 windowing programs [24], communication protocols [25], the biometric passport [26] (see Figure 2), and java programs [27]. Process mining [28] is also a form of state machine learning, which focuses on learning systems that display a large amount of concurrency such as business processes, represented as Petri-nets [29]. Due to the difficulty of learning Petri-nets [30], however, most work in process mining aims to learn restricted classes of Petri-nets, for instance nets in which every transition corresponds to a unique event.

Figure 2. A state machine model of the biometric passport, that we obtained fully automatically by exploring and observing its interactions [26]. The diagonal down from the Start state is a normal communication session, where the loops correspond to file reading operations. The arrows back to Start correspond to communication which cause the session to be aborted.

Formally, state machine learning can be seen as a grammatical inference [31] problem in which the traces are modeled as the words of a language, and the goal is to find a model for this language, such as a deterministic finite state automaton (DFA) [32]. By convincingly winning the recent 2010 international DFA learning competition [33], dr. Verwer (applicant) demonstrated that realistic DFA software models can be learned accurately from sparse data [34]. The competition organization called his winning algorithm ``a major step forward in software model inference techniques’’ [33]. Furthermore, in the recently started STW ITALIA project [35], prof Vaandrager (applicant) has shown how to scale active state machine learning (providing input and reading the corresponding ouput) up to learning models of real-world software systems with thousands of states and transitions, see, e.g, [36]. These algorithms provide important starting points for the passive approach adopted in LEMMA due to the close links between active and passive learning [37].

State machine learning for malware analysis

There exist some methods that use state machine learning of malware for analysis purposes [38,39,40,41,42]. Some work is necessary to abstract the network packets or system calls generated by malware into a finite set of symbols, for instance using tools for message format reverse engineering [43]. The resulting symbolic sequences are then provided to a state machine learner that will learn the behavioral structure underlying these sequences. A potential complication for learning state machines from network communication is that more and more malware communication is encrypted [44]. In this case, however, useful models can still be obtained using decryption techniques [45], or using the remaining data such as timing and packet sizes [46] and Netflow traces [47]. Although the state machine models learned using these tools are often small and very limited in functionality, they have been used successfully in several important analyses:

● visualizing malware protocols to aid reverse engineering and analysis [39, 42],

● emulating malware communication in order to infiltrate malware systems [41, 42],

● uncovering design flaws in malware, which simplifies nullifying them [38], and● discovering backchannel communication that needs to be investigated [38].

Information fusion

The data under analysis in LEMMA is distributed in nature: a network contains many hosts within different network layers that all generate different kinds of network traffic. The field of research that studies techniques that can be used to combine such distributed data in order to reach a global conclusion/decision is known as information fusion [48]. Technically, data fusion techniques combine data from multiple information sources, such as (mobile) sensors, databases, and humans (soft fusion), in order to obtain more accurate probability estimates and models than could be achieved by considering a single information source.

In recent years computer security experts start to recognize the importance of information fusion in the cyber security domain, see, e.g., [49,50,51,52,53,54]. Since cyber-attacks are often coordinated, correlations between events occurring at different hosts can be exploited using information fusion techniques to improve for example intrusion detection performance. These methods use different fusion techniques such as Dempster-Shafer theory [55] or Bayesian networks [56]. Dr. Pavlin (applicant) and Thales research (project partner) have in recent years build up expertise in several such information fusion methods, for example in the FP7 DIADEM project [57,58,59,60,61,62,63] and Figure 3.

(a) (b)Figure 3. (a) An information fusion system developed to detect gas leaks in the Rotterdam harbor by researchers from Thales [60]. The fusion system is based on a Bayesian network describing physical properties of a gas propagation, expert knowledge, and sensory information. (b) A more detailed view of the gas propagation model.

Limitations of current approaches

Current behavior-based malware analysis systems are not well-suited for our two test-cases (see Valorization): it is unclear how the resulting models can be used in subsequent analysis, and the learning algorithms require labeled execution trace data. In contrast, state machines are key models for analyzing software systems, and can be learned from unlabeled data. However, all current state machine learning and process mining tools share some properties that limit their applicability to our test-cases:

1. The used state machine models only models a language over event types. Execution traces however also contain timestamps and parameter values and using these in behavioral models has a lot of potential due to the dependence of network traffic and malware behavior on timing and parameters [64,65,66,67]. For instance, human users generate much slower and less regular than software systems [66].

2. They store all observations in memory. This creates important limitations since a malware program continuously generates new executions traces, i.e., malware data is streaming in nature [68]. Since important malware behavior can be spread over several days of data [69], it is impossible to learn this behavior using the current tools.

3. The models resulting from these tools are learned from a single source. In practice, however, multiple sources of execution (i.e., network) trace data are often available. It is a waste not to combine the information from these sources, especially since it is well-known that malware can behave differently in different contexts, see, e.g., [70,71].

The research in this proposal is geared towards addressing and removing these limitations of state machine learning. This constitutes a major step in the applicability of state machine learning tools in general and for malware analysis in particular.

Overall aim & Key Objectives

In contrast to the state machine learners currently used in malware learners [38,39,40,41,42], in this project, we will develop a novel state machine learner that is dedicated to malware and malware analysis. Our research team has a very strong background in state machine learning demonstrated by the PhD. thesis and award-winning algorithm of dr. Verwer [34,72], and by the STW ITALIA project [35]. In addition, it contains leading experts in information fusion and security shown by the EU FP7 projects DIADEM [57] and C-DAX (stating soon). Our team thus has all the expertise required to remove the aforementioned limitations, which we will address by focusing on the following two main research challenges:

1. (a) Develop an advanced learner for extended state machines that includes time and parameter values, and (b) that learns these models accurately from streaming data, as opposed to a fixed data set stored in memory.

2. Develop methods for fusing and sharing the information available in learned state machines with each other, information from the network, and expert knowledge.

Tackling these challenges will not only increase the range of learnable state machine models, but also provide an improved insight into complex malware behavior. We will demonstrate the power and uses of our method through the two test-cases at project partners (see valorization).

Methodology & research plan

Challenge 1

Approach.

a. Dr. Verwer recently provided the first efficient algorithms for learning a type of timed state machine [72,73,74], see Figure 4. In addition, algorithms for learning parameter dependent state machines have also been developed in the ITALIA project [35]. Our methods will be based on these recently developed learning methods. We have to make a trade-off between the capabilities of a model and the ability to learn it accurately (see, e.g, [74]). Through discussion with security experts and based on results from learning theory (see, e.g., [75,76]), a time and parameter dependent state machine will be selected that is both efficiently learnable and relevant for malware analysis.

b. Next, learning algorithms need to be developed that learn these models from streaming data. Such methods have already been proposed for malware analysis tools based on machine learning [17], and intrusion detection systems based on data mining [77] (Thuraisignham, advisor). Furthermore, very recently, some methods have been developed that make such an adaptation for probabilistic [78] (Gavaldà, advisor) and real-time state machines [79], based on counting frequently occurring prefixes from data streams [80]. Unfortunately, this is not a full solution for security since especially the infrequently occurring prefixes (anomalies) are indicative of malicious behavior. We will therefore investigate how to use state-of-the-art anomaly detection methods (see, e.g., [81]) in order to obtain interesting infrequent prefixes for state machine learning.

Method.

a. We aim to add the maximum number of relevant protocol values to a state machine such that it is still efficiently learnable. We expect it to be similar to those recently studied by dr. Verwer and prof. Vaandrager. The model will be selected based on this work and interviews with security experts. Its learning properties will be analyzed theoretically together with prof. Gavaldà.

b. We will then develop a streaming data algorithm for the chosen model based on recently proposed methods [35,72,78,79]. We will apply this algorithm in order to obtain two state machine models: one for frequent behavior and one for interesting infrequent behavior. The anomaly detection methods required to obtain the second model will be selected using the expertise of dr. Thuraisingham.

Figure 4. A simple timed state machine learned from artificial execution traces [72]. In addition to event types, the events contain time bounds that restrict the time delay between events.

Challenge 2

Approach.

The absence of suspicious traces in one source and presence in another can provide important information on the existence of malware in a network. Such information is successfully being used in collaborative intrusion detection systems, see, e.g., [82]. These methods rely on information fusion techniques that combine signals from these different sources. In LEMMA, we will extend the information fusion methods currently used by Thales for combining information from different sensors [57,62,63] to the fusion of information from state machines and computer security experts. The main research question for this part of the project is thus how to map learned state machines to sensor

values. A promising approach that we will pursue is to use the probability distribution induced by a state machine model (by counting transition occurrences) to compute distances/dissimilarities between models, e.g., their Kullback-Leibler divergence [83], which are then modeled as sensors.

Method.

We will investigate the different existing information fusion methods and ways to compute dissimilarities between learned state machine models [84,85]. By analyzing network traffic from known malware, obtained from the NCSC and SURFnet (see Section 4), state machines will be learned that capture the behavior of malware traffic. The distances between these malicious models and learned models are then used as sensor values. The expert knowledge will be obtained by analyzing malicious state machine models together with security experts. This resulting fusion system will be used to determine: models that are dissimilar to all others and thus possibly an unknown intrusion, the similarity of learned models with models of known malicious behavior, and the unique parts of models learned from malware traffic, i.e., behavioral fingerprints.

Evaluation

The individual techniques of all challenges will be evaluated on real-world benchmarks from existing malware state machine learners [38,39,40,41,42]. The final combined algorithm will be evaluated using the two test cases from our project partners and compared with existing behavior-based malware analysis systems, e.g., [13,16,17].

Innovative elements of the topic and approach

The models, algorithms, and tools developed during this project are innovative contributions on multiple fronts:

1. Learning timed state machine models with parameters is a new direction in state machine learning that has a lot of potential for malware analysis.

2. Only very recently some initial studies that aim to adapt state machine learning to the setting of streaming data have been published [78,79]. Furthermore, learning state machines from infrequently occurring traces for security is entirely novel.

3. There currently exists no general information fusion method for learned state machines. Such a method will be very valuable in many application domains.

4. The tools are innovative contributions that enable security experts to analyze vast amounts of network traffic.

6b Multidisciplinarity

In LEMMA, we aim to develop new methods for learning finite state machines and the means to analyze them. In this multidisciplinary research project, we will combine techniques from Grammatical Inference, Formal Methods, Probabilistic Reasoning, and Security. Our research has a high potential to have a large impact in cyber security since we develop and combine state-of-the-art techniques and tools from these areas. In addition, to the best of our knowledge, the combination of learning and analysis methods proposed in LEMMA is unique, and therefore likely to lead to new developments in grammatical inference [31], and its applications in Software Engineering [20], Bioinformatics [86], Speech of Image Processing [87,88], Process Modeling [89], Robotics and Control [90], Behavioral Profiling [91], and Opponent Modeling [92].

3000 words

7. Valorisation and relevance

7a Valorisation

This proposal addresses key interests of the two commercial partners in the consortium, Madison Gurkha and Thales, hence their commitment to invest both cash and person months into the project. Both companies already have network monitoring tools that they develop and use in their everyday practice. The techniques developed and implemented in LEMMA can be directly incorporated into these tools. Thales and Madison-Gurka are committing the resources in the project both for implementation and subsequent experimentation, which will provide vital feedback in the further development of the techniques, and makes commercialization of the LEMMA tools very likely.

Whereas Madison Gurkha will use the technology in services they supply to customers, and Thales additionally markets tools that will be sold to customers, the other partners in the project - WODC, NCSC, and SURFnet - are the ultimate end users of this type of technology. As such, they provide concrete settings and case studies for applying the techniques developed in the project. Moreover, NCSC and SURFnet have more a lot more technical expertise than a typical end-user, so they are ideal partners to discuss their needs and requirements, and their experience in data and malware analysis will be a useful input to the project.

Cooperation:

The research in LEMMA will be carried out in close collaboration between the three main partners: RU, Madison-Ghurka, and Thales. The postdoc and PhD. researcher employed on the project will be based in Nijmegen. Given that the PhD research is very closely linked to the research performed at D-CIS lab, both the postdoc and the PhD student will spend one day at D-CIS to ensure optimal collaboration. When working on the use-cases, both the postdoc and the PhD student will spend several days per week at the NCSC in the Hague and at Madison-Gurkha in Eindhoven. The cooperation between researchers and partners will be focused on successfully completing our two use-cases, displayed in Figures 5 and 6 We discuss these two cases below.

Use case 1:

One of the security services provided by Madison-Gurkha is to identify security issues in a customer’s network. One of the difficulties of this service is that the customer often does not know what problem (s)he has. The current method for determining such security issues is to manually analyze machines in the network until the problem is found. Resulting in lots of network downtime, and depending on the nature of the network, this is often very costly. Instead, the tools developed in LEMMA will be used to passively listen to network communication in order to find likely locations of these security problems (Figure 5), greatly reducing both the downtime of the network and the manual analysis effort.

Madison-Gurkha, the NCSC, SURFnet, and the WODC can all provide access to very large industrial networks. With the help of experts from Madison-Gurkha, we will install monitors on at least one of these networks, which will then be used for malware detection and localization. The expertise of Madison-Gurkha in this area will be very useful when selecting interesting data sources to monitor within this network and reading, preprocessing, and converting this data such that it is usable by our tools. In addition, the obtained results will be analyzed together with their security experts.

Figure 5. Malware localization in a large network. Several monitors are placed at strategic locations within the network. These sniff packets, filter interesting ones, detect intrusions, and continuously learn state machines. These state machines represent the different kinds of traffic behavior on the network. After leaving these monitors alone for about a week, an analyst gathers all the information from these state machines in an information fusion system. This system highlights the similarities and differences between the patterns of traffic over these locations, which the analyst then uses to identify a pattern indicative of a malware infection.

Use case 2:

A particular problem the NCSC are facing is that more and more data is being collected, for instance from network monitors or honeypots, but that the full potential for analyses of this data is not utilized due to the lack of analysis tools. In particular, the NCSC currently only monitors such data for known malware and sporadically analyzes collected data in small research projects. LEMMA will provide the first learning approach that is able to continuously read network data for several weeks or months and compile them into state machines. The fusion tool developed in LEMMA will then help to analyze this malware and find new infections (Figure 6), which greatly reduces the effort needed by security analysts. The LEMMA tools will be therefore be a valuable extension of the network monitors the NCSC are currently setting up at several high-risk locations.

We will use data collected from network monitors of the NCSC. These monitors are based on the SURFids framework developed by SURFnet. We will therefore cooperate with SURFnet in order to run our learning tool on streaming network trace data from different locations in the Netherlands. SURFnet will provide all the tooling required to setup the nationwide network sniffers that will provide input for our learning algorithm. Experts from the NCSC will determine interesting locations for these sniffers and help with the selection of suitable data sources. The results will be analyzed together with the same experts.

Figure 6. Behavioral finger printing of command and control traffic. Network monitors from the NCSC and lists of known malicious IP-addresses are used to monitor traffic to and from known botnet command and control servers. These monitors continuously learn state machine descriptions of this traffic and other traffic from the same hosts. Once the models are sufficiently detailed, they are collected and analyzed in the information fusion system in order to find behavior that is unique for the command and control traffic from this particular botnet, i.e. a behavioral fingerprint. This fingerprint is matched with state machines learned from other traffic sources. If a match is found, this indicates the existence of another command and control server and can be used to add another malicious IP-address to the list.

Apart from the two use cases above, there are many other possibilities for applying the technologies developed in the projects that we may consider, depending on time and progress: in a current collaboration between SURFnet and RU we have obtained network traffic from the HLUX2 botnet, and ongoing research into smart grid security (through the ENCS and FP7 project C-DAX) may produce a case study from that application domain. In addition, our methods are useful to perform many other analyses on both malicious and benign software such as automatic fingerprinting [93], intrusion detection [94], model-based fuzz testing [95], and stateful traffic analysis [96].

Source code publication strategy

Although several of the existing tools built by the commercial partners in the project are proprietary, the researchers are allowed to make use of them for the duration of the project. The core algorithms developed in LEMMA will be published as open-source tools under the OpenBSD license. This allows open dissemination to the research community, and may spark of research by others on improving them, while allowing the commercial partners partners to integrate these tools into their existing tool sets.

Research valorisation

The solid network and expertise of prof. Vaandrager, dr. Verwer, and prof. Gavaldà in formal methods, grammatical inference, and machine learning creates a lot of potential for cooperation with these communities. In particular, since dr. Verwer will be involved in the organisation of state machine learning competitions, see, e.g. [98], we will use non-proprietary malware data in several of these, thereby stimulating the worldwide development of learning techniques for malware analysis.

The security research of dr. Poll and the security group at the RU provides good opportunities to ensure that the techniques pioneered in this project are disseminated to the wider security community.

Through contacts with prof. Thuraisingham we will explore wider international possibilities of applying the technology developed in LEMMA. With her scientific expertise and network, Thuraisingham may provide the project not only with technical improvements (see Sect 6), but also additional uses and contacts with interested users in the USA cyber security community and government.

7b Relevance

LEMMA addresses of one the core challenges in cyber security, namely of understanding the behaviour of systems of ever-growing complexity, and to effectively use the vast amounts of data that can be produced in monitoring these systems to extract the useful information and insights.

Impact on NCSRA action lines

The research in LEMMA clearly falls under the research topic Malware in the NCSRA, but it is also relevant for 3 additional research topics listed in the NCSRA:

1. Forensics, where analyzing large data streams is an important bottleneck;2. Secure design, Tooling and Engineering, models learned from system behaviour

can be used to look for design flaws and as a valuable starting point for fuzz-testing;

3. Operational cyber capacities, where techniques to detect and analyze detection new behaviour in data streams are very valuable.

Value for society and industry

Malware is becoming an increasing threat for organisations, both commercially as non-profit, and it might even be a threat to national security. The increasing uses of techniques such as code obfuscation, polymorphic code, and encryption make modern malware easily bypass virus scanning and Intrusion Detection Systems and very hard to find in systems and networks. Nowadays, international corporations should thus no longer expect their internal networks to be safe, but must assume their networks and systems to be compromised and take additional actions to ensure that there network and data are safe. One of the key challenges in cyber security is therefore to keep ahead with the capability to detect malicious behavior in cyberspace. By providing the ability to learn descriptions of the behavior in continuous network traffic, and the tooling necessary to analyze these descriptions, the LEMMA tools constitute a big improvement of existing network intrusion detection methods. LEMMA is therefore likely to result in new tools and services provided by our commercial partners and can in the long run help to prevent serious societal damage caused by malware intrusions. Our two use cases clearly show the added value of the research proposed in LEMMA: large amounts of data are currently being monitored, but no good tool exists to analyze this data.

Relevance for the cyber security research

The tools and theory developed in LEMMA help to make sense of the vast amount of data that are generated when monitoring networks. By focusing on learning state machine models, the research in LEMMA provides important steps to being able to performing highly detailed analyses, for instance using model checking and testing tools. Such analyses have already been applied to handmade state machine models in order to discover design flaws, see, e.g., [98]. Very recently, we successfully applied automated

state machine learning to electronic banking transactions [99]. Such automated learning could even be used to directly reveal a security flaw in an internet banking protocols such as the one we recently reported in [100].

Although the LEMMA tools are dedicated to malware, the developed methods will be general and applicable in other application areas of state machine learning, also within security. For instance, due to close links between active and passive learning algorithms [37], the results of the project - both the general learning techniques and the concrete models that these can produce - can also be used more in active approaches to test security, such as model-based testing, which is a logical next step in the evolution of fuzz testing approaches.

The relevance to cyber security is further exemplified by the facts that several state-of-the-art intrusion and malware detection systems use handmade state machines in order to detect their behavior [21], and that research in behavioral malware analysis has for this reason recently begun to focus on learning state machine descriptions [42], albeit in an ad-hoc manner. The research performed in LEMMA is therefore very likely to be used in the future for advanced behavioral malware analysis.

1997 words

8 Project planning

Phasing

The PhD Student and postdoc researcher will start the project at the same time. The postdoc will focus on Challenge 1 and the PhD Student will focus on Challenge 2. Both will work on the central use cases in the project, as a means to tie these two strands of work together. This also ensures optimal interaction with the industrial partners, as they provide these use cases.

The first method will be developed by a postdoc at RU, where some of the state-of-the-art state machine learning tools have been developed. A PhD candidate will perform research on the second part at Thales research (D-CIS lab), which is a research center with several years of experience in information fusion methods.

The detailed planning is as follows:

First year – first semester:

● Challenge 1 – Selecting a time-dependent and state machine model with parameters that is sensible to security experts, collect data sets from project partners.

● Challenge 2 – The PhD. Student will become familiar with the project.

First year – second semester:

● Challenge 1 – Develop theory for learning the chosen model, perform trial runs of existing algorithms on collected data.

● Challenge 2 – Collect domain knowledge through interviews and start construction of the information fusion methodology.

Second year:

● Challenge 1 – Research visit to prof. Gavaldà, finalize theory development, build initial algorithm, test and select preprocessing methods, and apply the algorithm to collected data.

● Challenge 2 – Build the fusion framework for combining knowledge from state machines and domain knowledge.

● Joint – Select data streams to use for learning and methods for filtering them together with project partners.

Third year:

● Challenge 1 – Finalize tool development and apply algorithm to real streaming data from the two use-cases.

● Challenge 2 – Finalize construction of fusion framework and apply to the learned state machines.

● Joint – Research visit Thuraisingham. Combine state machine learning and information fusion tools, run on real data, and analyze results with project partners.

Fourth year:

● Joint – Finish up.● Challenge 2 - Write PhD. thesis.

Education and training

In LEMMA, two positions will be filled: a postdoc (dr. Sicco Verwer) and a PhD. student.

The training of the PhD student will be provided by supervision of experts in state machine learning (Sicco Verwer and Frits Vaandrager), security (Erik Poll), and information fusion (Patrick de Oude), and a research visit to a world-leading group on data mining in computer security (prof. Thuraisingham). In addition, the PhD. student will receive training from the SIKS research school and two summer school visits, e.g. Marktoberdorf, FOSAD, NIS (Network and Information Security), or the machine learning summer school. In the first year some of the courses in Master programme in Computer Security at the RU (joint with TUE and UT) may be interesting for the PhD student. In addition, the RU is one of the founding fathers of the ENCS (European Network for Cyber Security) in the Hague. ENCS will provide additional opportunities for training in cyber security, esp. in the domain of criticial infrastructures.

By closely cooperating with security experts from public and private parties, it is expected that both the postdoc and PhD student will receive an excellent hands-on training in computer security. The two research visits to world experts in grammatical inference (Gavaldà) and data mining in security (Thuraisingham) will also further the

expertise of the postdoc in these areas.

9a Literature (internal)

1. Fides Aarts, Julien Schmaltz, and Frits Vaandrager. Inference and Abstraction of the Biometric Passport. In Proceedings of the International Symposium on Leveraging Applications of Formal Methods, Verification and Validation 2010, pp. 673-686. (Google Scholar: 20 cites)

2. Sicco Verwer. Efficient Identification of Timed Automata: Theory and practice. PhD thesis, Delft University of Technology, 2010. (Google Scholar: 11 cites)

3. Marijn Heule and Sicco Verwer. Software model synthesis using satisfiability solvers. Empirical Software Engineering, in press, 2012. (winning algorithm Stamina competition)

4. Gregor Pavlin, Patrick de Oude, Marinus Maris, Jan Nunnink, and Thomas Hood. 2010. A multi-agent systems approach to distributed Bayesian information fusion. Information Fusion 11, 3, 267-282, 2010. (Google Scholar: 27 cites)

5. Arjan Blom, Gerhard de Koning Gans, Erik Poll, Joeri de Ruiter, and Roel Verdult. Designed to Fail: A USB-Connected Reader for Online Banking. In 17th Nordic Conference on Secure IT Systems (NordSec), LNCS vol. 7616, 1—16, 2012. (Google Scholar: 4 cites, widely reported by Dutch national news)

9b Literature (external)

[1] G. Conti and K. Abdullah. Passive visual fingerprinting of network attack tools. In Proceedings of the 2004 ACM workshop on visualization and data mining for computer security (VizSEC/DMSEC '04). ACM, 45-54, 2004.

[2] M. Bailey, J. Oberheide, J. Andersen, Z.M. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on recent advances in intrusion detection (RAID'07), Springer-Verlag, 178-197, 2007.

[3] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller. An Overview of IP Flow-Based Intrusion Detection. Commun. Surveys Tuts. 12, 3, 343-356, 2010.

[4] C. Kolbitsch, P.M. Comparetti, C. Kruegel, E. Kirda, X. Zhou, and X. Wang. Effective and efficient malware detection at the end host. In Proceedings of the 18th conference on USENIX security symposium (SSYM'09). USENIX Association, 351-366, 2009.

[5] H. Yin, D. Song, M. Egele, C. Kruegel, and E. Kirda. Panorama: capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM conference on Computer and communications security (CCS '07). ACM, 116-127, 2007.

[6] M. Sharif, V. Yegneswaran, H. Saidi, P. Porras, and W. Lee. Eureka: A Framework for Enabling Static Malware Analysis. In Proceedings of the 13th European Symposium on Research in Computer Security: Computer Security (ESORICS '08), 2008.

[7] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kemmerer, C. Kruegel, and G. Vigna. Your botnet is my botnet: analysis of a botnet takeover. In Proceedings of the 16th ACM conference on Computer and communications security (CCS '09). ACM, 635-647, 2009.

[8] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M.G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena. BitBlaze: A New Approach to Computer Security via Binary Analysis. In Proceedings of the 4th International Conference on Information Systems Security (ICISS '08),

http://www.cs.ru.nl/staff/Fides.Aarts



http://www.cs.ru.nl/~julien/



http://www.mbsd.cs.ru.nl/publications/papers/fvaan/passport














http://isola-conference.org/isola2010/






















Springer-Verlag, 1-25, 2008.

[9] C. Eagle. The IDA Pro Book: The Unofficial Guide to the World’s Most Popular Disassembler. No Starch Press Series, 2011.

[10] C. Willems, T. Holz, and F. Freiling. Toward Automated Dynamic Malware Analysis Using CWSandbox. IEEE Security and Privacy 5, 2, 32-39, 2007.

[11] A. Dinaburg, P. Royal, M. Sharif, and W. Lee. Ether: malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM conference on Computer and communications security (CCS '08). ACM, 51-62, 2008.

[12] X. Chen, J. Andersen Z.M. Mao, M. Bailey, and J. Nazario. Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware. In IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, DSN 2008. 177 – 186, 2008.

[13] K. Rieck, T. Holz, C. Willems, P. Dussel, and P. Laskov. Learning and Classification of Malware Behavior. In Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA '08). Springer-Verlag, 108-125, 2008.

[14] M. Egele, T. Scholte, E. Kirda, and C. Kruegel. A survey on automated dynamic malware-analysis techniques and tools. ACM Computing Surveys 44, 2, Article 6, 42 pages, 2008.

[15] G. Jacob, H. Debar, E. Filiol: Behavioral detection of malware: from a survey towards an established taxonomy. Journal in Computer Virology 4(3): 251-266, 2008.

[16] M. Christodorescu, S. Jha, and C. Kruegel. Mining specifications of malicious behavior. In Proceedings of the 1st India software engineering conference (ISEC '08). ACM, 5-14, 2008.

[17] K. Rieck, P. Trinius, C. Willems, T. Holz: Automatic analysis of malware behavior using machine learning. Journal of Computer Security 19(4): 639-668, 2011.

[18] U. Bayer, P.M. Comparetti, C. Hlauschek, C. Krügel, E. Kirda: Scalable, Behavior-Based Malware Clustering. In NDSS 2009.

[19] D. Lee and M. Yannakakis. Principles and methods of testing finite state machines - a survey, in Proceedings of the IEEE 84(8):1090–1123, 1996.

[20] J.E. Cook and A.L. Wolf. Discovering models of software processes from event-based data. ACM Transactions on Software Engineering Methodology, 7:215-249, July 1998.

[21] G. Jacob, E. Filiol and H. Debar. Malware as interaction machines: a new framework for behavior modelling. Journal in Computer Virology, Volume 4, Number 3 (2008), 235-250.

[22] E.M. Clarke. Model Checking. In Proceedings of the Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 54-56, 1997.

[23] A. Bertolino, P. Inverardi, P. Pelliccione, and M. Tivoli. Automatic synthesis of behavior protocols for composable web-services. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering (ESEC/FSE '09). ACM, 141-150, 2009.

[24] G. Ammons, R. Bodik, and J.R. Larus. Mining specifications. In Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL '02), 2002.

[25] W. Cui, J. Kannan, and H.J. Wang. Discoverer: automatic protocol reverse engineering from network traces. In Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium (SS'07). USENIX Association, Article 14 , 14 pages, 2007.

[26] F. Aarts, J. Schmaltz, and F.W. Vaandrager. Inference and Abstraction of the Biometric Passport. In Proceedings of the International Symposium On Leveraging Applications of Formal Methods, Verification and Validation, pp. 673-686, 2010.

[27] N. Walkinshaw, K. Bogdanov, M. Holcombe, and S. Salahuddin. Reverse Engineering State Machines by Interactive Grammar Inference. In Proceedings of the 14th Working Conference on Reverse Engineering (WCRE '07). IEEE Computer Society, 209-218, 2007.

[28] Wil M. P. van der Aalst. 2011. Process Mining: Discovery, Conformance and Enhancement of Business Processes (1st ed.). Springer.

[29] W.M.P. Van der Aalst. The application of Petri nets to workflow management. Journal of Circuits, Systems and Computers 1998 08:01, 21-66.

[30] J. Esparza, M. Leucker, and M. Schlund. Learning workflow petri nets. In Proceedings of the 31st international conference on Applications and Theory of Petri Nets (PETRI NETS'10), Springer, 206-225.

[31] C. de la Higuera. Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, 2010.

[32] T.A. Sudkamp. Languages and Machines: an introduction to the theory of computer science. Addison-Wesley, third edition, 2006.

[33] N. Walkinshaw, B. Lambeau, C. Damas, K. Bogdanov and P. Dupont. STAMINA: a competition to encourage the development and assessment of software model inference techniques. Empirical Software Engineering 2012 (in press).

[34] M.J.H. Heule and S. Verwer. Software Model Synthesis using Satisfiability Solvers, Journal of empirical software engineering Empirical Software Engineering 2012 (in press, competition results are at: http://stamina.chefbe.net/).

[35] The ITALIA project, http://www.italia.cs.ru.nl/.

[36] W. Smeenk. Applying Automata Learning to Complex Industrial Software Master's Thesis, Radboud University Nijmegen, September 2012.

[37] S.A. Goldman and H.D. Mathias. 1993. Teaching a smart learner. In Proceedings of the sixth annual conference on Computational learning theory (COLT '93), Lenny Pitt (Ed.). ACM, New York, NY, USA, 67-76.

[38] C.Y. Cho, D. Babic, E.C.R. Shin, and D. Song. Inference and analysis of formal models of botnet command and control protocols. In Proceedings of the ACM Conference on Computer and Communications Security, pp. 426-439, 2010.

[39] P.M. Comparetti, G. Wondracek, C. Krügel, and E. Kirda. Prospex: Protocol Specification Extraction. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 110-125, 2009.

[40] Y. Wang, Z. Zhang, D. Yao, B. Qu, and L. Guo. Inferring Protocol State Machine from Network Traces: A Probabilistic Approach. In Proceedings of the International Conference on Applied Cryptography and Network Security, pp. 1-18, 2011.

[41] C. Leita,K. Mermoud,M. Dacier. ScriptGen: an automated script generation tool for honeyd. In Proceedings of the Computer Security Applications Conference 2005, pp. 203-214.

[42] T. Krueger, H. Gascon, N. Krämer and K. Rieck. Learning Stateful Models for Network Honeypots. In 5th ACM Workshop on Artificial Intelligence and Security (AISEC), 2012.

[43] J. Caballero, P. Poosankam, C. Kreibich, and D.X. Song. Dispatcher: enabling active botnet infiltration using automatic protocol reverse-engineering. In Proceedings of the ACM Conference on Computer and Communications Security, pp. 621-634, 2009.

[44] Nationaal trendrapport cybercrime en digitale veiligheid 2010. GOVCERT.NL www.govcert.nl/trends.

[45] Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace. ReFormat: Automatic Reverse Engineering of Encrypted Messages. In Proceedings of the European Symposium on Research in Computer Security, pp. 200-215, 2009.

[46] C.V. Wright, F. Monrose, and G.M. Masson. On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research 7:2745-2769, 2006.

[47] X. Yin, W. Yurcik, M. Treaster, Y. Li, and K. Lakkaraju. 2004. VisFlowConnect: netflow visualizations of link relationships for security situational awareness. In Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security (VizSEC/DMSEC '04). ACM, New York, NY, USA, 26-34.

[48] D. L. Hall, J. Llinas. An introduction to multisensor data fusion. Proceedings of the IEEE, Vol. 85, No. 1. (06 Jan 1997), pp. 6-23

[49] I. Corona, G. Giacinto, C. Mazzariello, F. Roli, C. Sansone, Information fusion for computer security: State of the art and open issues, Information Fusion, Volume 10, Issue 4, October 2009, Pages 274-284.

[50] T. Bass. 2000. Intrusion detection systems and multisensor data fusion. Commun. ACM 43, 4 (April 2000), 99-105.

[51] S.J. Yang, A. Stotz, J. Holsopple, M. Sudit, M. Kuhl, High level information fusion for tracking and projection of multistage cyber attacks, Information Fusion, Volume 10, Issue 1, January 2009, Pages 107-121.

[52] N. Ye; J. Giordano, J. Feldman, Q. Zhong. Information fusion techniques for network intrusion detection. Information Technology Conference, 1998. IEEE , vol., no., pp.117-120, 1-3 Sep 1998

[53] P. Ning , Y. Cui , D. Reeves, Constructing Attack Scenarios through Correlation of Intrusion Alerts, North Carolina State University at Raleigh, Raleigh, NC, 2002 .

[54] C. Feng, J. Peng, H. Qiao,J.W. Rozenblit. Alert Fusion for a Computer Host Based Intrusion Detection System. Engineering of Computer-Based Systems, 2007. ECBS '07. 14th Annual IEEE International Conference and Workshops on the , vol., no., pp.433-440, 26-29 March 2007

[55] G. Shafer. A Mathematical Theory of Evidence. 1976

[56] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988

[57] FP7 DIADEM project, http://www.ist-diadem.eu/

[58] P. de Oude and G. Pavlin. 2009. Efficient Distributed Bayesian Reasoning via Targeted Instantiation of Variables. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02 (WI-IAT '09), Vol. 2. IEEE Computer Society, Washington, DC, USA, 323-330.

[59] P. de Oude, and G. Pavlin. An Information Theoretic Approach to Verification of Modular Bayesian Fusion Systems. In The 11th International Conference on Information Fusion, Cologne, Germany, 2008.

[60] G. Pavlin, F.C.A. Groen, P. de Oude, M. Kamermans. A Distributed Approach to Gas Detection and Source Localization Using Heterogeneous Information. Interactive Collaborative Information Systems 2010: 453-474

[61] G. Pavlin, P. de Oude, M. Kamermans, F. Groen. Dynamic process integration framework: A novel approach to efficient implementation of robust distributed information fusion systems. Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on , vol., no., pp.1-8, 5-8 July 2011

[62] G. Pavlin, P. De Oude, and F. Mignet. Gas detection and source localization: A Bayesian approach, in: 14th International Conference on Information Fusion, Chicago, 2011.

[63] G. Pavlin, P. de Oude, M. Maris, J. Nunnink, and T. Hood. 2010. A multi-agent systems approach to distributed bayesian information fusion. Inf. Fusion 11, 3 (July 2010), 267-282.

[64] U. Bayer, I. Habibi, D. Balzarotti, E. Kirda, and C. Kruegel. A view on current malware behaviors. In Proceedings of the USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more, 2009, pp. 1-8.

[65] M. Jaber, R.G. Cascella, C. Barakat. Can We Trust the Inter-Packet Time for Traffic Classification? Communications (ICC), 2011 IEEE International Conference on , vol., no., pp.1-5, 5-9 June 2011

[66] W. Strayer, David Lapsely, Robert Walsh and Carl Livadas. Botnet Detection Based on Network Behavior. Advances in Information Security, 2008, Volume 36, 1-24.

[67] F. Giroire, J. Chandrashekar, N. Taft, E. Schooler, and D. Papagiannaki. 2009. Exploiting Temporal Persistence to Detect Covert Botnet Channels. In Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection (RAID '09), Engin Kirda, Somesh Jha, and Davide Balzarotti (Eds.). Springer-Verlag, Berlin, Heidelberg, 326-345.

[68] C. Aggarwal. Data streams: Models and Algorithms. In Series: Advances in Database Systems, Springer, 2006.

[69] G. Gu, R.Perdisci, J. Zhang, and W. Lee. 2008. BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection. In Proceedings of the 17th conference on Security symposium (SS'08). USENIX Association, Berkeley, CA, USA, 139-154.

[70] D. Balzarotti, M. Cova, C. Karlberger, E. Kirda, Christopher Kruegel, Giovanni Vigna: Efficient Detection of Split Personalities in Malware. NDSS 2010

[71] D. Brumley, C. Hartwig, Z. Liang, J. Newsome, D.X. Song, H. Yin: Automatically Identifying Trigger-based Behavior in Malware. Botnet Detection 2008: 65-88

[72] S. Verwer. Efficient Identification of Timed Automata: Theory and Practice. PhD thesis, Delft University of Technology, 2010.

[73] S. Verwer, M. de Weerdt, and C. Witteveen. Efficiently identifying deterministic real-time automata from labeled data. Machine Learning 86(3): 295-333 (2012), Springer.

[74] S. Verwer, M. de Weerdt, C. Witteveen: The efficiency of identifying timed automata and the power of clocks. Inf. Comput. 209(3): 606-625 (2011)

[75] S. Jain, D.N. Osherson, J.S. Royer, A. Sharma. Systems That Learn, 2nd Edition: An Introduction to Learning Theory. MIT Press, 1999.

[76] M.J. Kearns, U.V. Vazirani. An introduction to computational learning theory. MIT Press, 1994.

[77] B. Thuraisingham, L. Khan, M.M. Masud, and K.W. Hamlen. 2008. Data Mining for Security Applications. In Proceedings of the 2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing - Volume 02 (EUC '08), Vol. 2. IEEE Computer Society, Washington, DC, USA, 585-589.

[78] B. Balle, J. Castro and R. Gavaldà. Bootstrapping and Learning PDFA in Data Streams. JMLR Workshop and Conference Proceedings Volume 21: ICGI 2012, 21:34-48.

[79] J. Schmidt and S. Kramer. Online Induction of Probabilistic Real Time Automata In: Proceedings of the 2012 IEEE International Conference on Data Mining (ICDM 2012)

[80] J. Cheng, Y. Ke, and W. Ng. 2008. A survey on algorithms for mining frequent itemsets over data streams. Knowl. Inf. Syst. 16, 1 (July 2008), 1-27.

[81] M. Masud, J. Gao, L. Khan, J. Han, and B.M. Thuraisingham. 2011. Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints. IEEE Trans. on Knowl. and Data Eng. 23, 6 (June 2011), 859-874.

[82] C.V. Zhou, C. Leckie, and S. Karunasekera. A survey of coordinated attacks and collaborative intrusion detection. Computers and Security, 29 (1), pp. 124-140, 2010.

[83] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley-Interscience, 2006

[84] R.C. Carrasco. Accurate Computation of the Relative Entropy Between Stochastic Regular Grammars. Theoretical Informatics and Applications, 1997.

[85] M. Mohri. Edit-distance of Weighted Automata: General Definitions and Algorithms. International Journal on Foundadions of Compututer Science, 14, 957, 2003.

[86] Y. Sakakibara. Grammatical Inference in Bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7):1051-1062, 2005.

[87] K.S. Fu, Syntactic pattern recognition and applications, Prentice Hall, 1982.

[88] T.Y. Young. Handbook of Pattern Recognition and Image Processing (Vol. 2): Computer Vision. Academic Press, 1994.

[89] C.R. ShaliziandJ. P. Crutchfield. Computational Mechanics: Pattern and Prediction, Structure and Simplicity, Journal of Statistical Physics 104(3-4):817-879, 2001.

[90] R.L. Rivest and R.E. Schapire. Inference of finite automata using homing sequences, Information and Computation 103:299–347, 1993.

[91] J. Borges and M. Levene. Data mining of user navigation patterns. In Proceedings of the International Workshop on Web Usage Mining and User Profiling 2000, pp. 92–111.

[92] D. Carmel and S. Markovitch. Opponent Modeling in Multi-Agent Systems. In Proceedings of the Workshop on Adaption and Learning in Multi-Agent Systems 1995, pp. 40-52.

[93] S. Venkataraman, J. Caballero, P. Poosankam, M.G. Kang, D.X. Song. Fig: Automatic Fingerprint Generation. In Proceedings of the Annual Network and Distributed System Security Conference, 2007.

[94] R. Sekar, A.K. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou. Specification-based anomaly detection: a new approach for detecting network intrusions. In Proceedings of the ACM Conference on Computer and Communications Security, pp. 265-274, 2002.

[95] Y. Hsu, G. Shu, and D. Lee. A model-based approach to security flaw detection of network protocol implementations. In IEEE International Conference on Network Protocols (ICNP 2008), pages 114–123.

[96] C. Kruegel, F. Valeur, G. Vigna, and R. Kemmerer. 2002. Stateful Intrusion Detection for High-Speed Networks. In Proceedings of the 2002 IEEE Symposium on Security and Privacy (SP '02). IEEE Computer Society, Washington, DC, USA, 285.

[97] S. Verwer, R. Eyraud, and C. de la Higuera. Results of the PAutomaC Probabilistic Automaton Learning Competition ; 21:243-248, 2012. JMLR Workshop and Conference Proceedings Volume 21: ICGI 2012

[98] J. Berendsen, B. Gebremichael, F.W. Vaandrager, and M. Zhang. 2011. Formal specification and analysis of zeroconf using uppaalS. ACM Trans. Embed. Comput. Syst. 10, 3, Article 34 (May 2011), 32 pages.

[99] F. Aarts, J. de Ruiter, and E. Poll. Formal models of bank cards for free. Manuscript, 2012.

[100] A. Blom, G. de Koning Gans, E. Poll, J. de Ruiter, and R. Verdult. Designed to Fail: A USB-Connected Reader for Online Banking. In 17th Nordic Conference on Secure IT Systems (NordSec), LNCS vol. 7616, 1—16, 2012.

10. Requested Budget

The total project budget is: 526236 euro

10a Requested personnel budget

A PhD student and a postdoc will be funded by LEMMA. The PhD student position requires 198115 euro. The three year postdoc position requires 201121 euro.

In total: 399236 euro personnel budget

10b Requested additional budget

In the project there will be additional travel costs for travelling between Nijmegen and Delft (Thales)/Den Haag (WODC, NCSC). Both the PhD student and the postdoc will make these costs estimated at 1275 euro per year making 10200 euro in total.

Additional travel costs for research visits abroad are:

8000 euro for visit prof. Thuraisingham in Texas, USA, by both PhD student and postdoc

3000 euro for visit prof. Gavaldà in Barcelona, Spain, by postdoc

4000 euro for visiting the summer schools by the PhD. student

5000 euro budget for material to setup network monitors for the first case study

In total: 30200 euro

10c Co-funding of the consortium partners

6a Description of the proposed research Overview

Documents

Transcript of 6a Description of the proposed research Overview