D5.1 Visualization Requirements for Massive Online Machine … · 2016. 11. 23. · statistics,...

PROTEUS Scalable online machine learning for predictive analytics and real-time

interactive visualization

687691

D5.1 Visualization Requirements for

Massive Online Machine Learning

Strategies Lead Author: Abdelhamid Bouchachia

With contributions from: Rubén Casado Reviewer: Marco Laucelli

Deliverable nature: Report (R)

Dissemination level: (Confidentiality)

Public (PU)

Contractual delivery date: M06 (05/2016)

Actual delivery date: M06 (05/2016)

Version: 4.0

Total number of pages: 35

Keywords: Visualisation, interactivity, online machine learning, requirements, architecture

PROTEUS Deliverable D5.1

687691 of 35

Abstract

The present report discusses visualisation of big data which is characterised by volume, velocity and variety.

Despite its importance, visual analytics is still in its first outset, especially for big data, and is scattered

among various disciplines. The relevance of visualisation and visual analytics in general for both experts

(data scientist) and end-users for understanding the behaviour of the machine learning algorithms and how

these later make decisions is high. Specifically, visualising the details, tuning the parameters, changing the

training data, tracking back the results (cause-effect) are usually some aspects that are appealing for every

user of data. This is particularly desired in order to cope with transparency and thus for white-box machine

learning.

In this report, we reflect on these requirements by discussing the value of visualisation, the existing

visualisation techniques and tools described in the literature and related to selected computational models

that are used for online and stream-based machine learning. For each model, we put forward the minimum

visualisation requirements, from both perspectives, the expert and user perspectives, that such model should

be equipped with. These requirements reflect mostly on the volume and velocity aspects that characterise big

data. Towards the end of the report, a quite detailed description of the visualisation system that PROTEUS

will develop is presented. The final aim of PROTEUS is to obtain a system that is interactive and fulfils the

requirements associated with the various scalable online machine learning algorithms to be developed.

Deliverable D5.1 PROTEUS

687691 of 35

Executive summary

Visualization is relevant to many broad areas of science such as, imaging and graphics, man-machine

interfaces for various scientific disciplines including machine learning and data mining. It is about

representing information in an interpretable graphical form taking into account the available plotting space.

There are basic graphical elements that each representation uses such as points, lines, shapes, images, text,

and area and there are attributes associated with these elements such as color, intensity, size, position, shape,

and motion [1]. The importance of visualisation stems from its relevance but also and its versatility to

manipulate various types of information such as raw and processed data, processes, relationships, concepts,

bodies, etc. Interestingly enough, visualisation can be animated and interactive allowing the user to control

and manipulate the representations (including the elements and their attributes) often by "Overview first,

zoom and filter, And Then details-on-demand"1. This requires “interaction” which is basic modality in any

analysis or illustration it allows the user to manipulate the representations at wish.

As specified in [2], among the top 10 challenges in extreme-large visual analytics, large data visualisation,

scalable algorithms and data movement are very prominent challenges. Thus, from a machine learning

perspective, visualization allows both the expert (data scientist) as well as the end-user to understand the

data, the algorithm (model) being used and its individual components, its evolution, and its output upon

presentation of input allowing thus a better insight into the decision making process. Thus the overall aim of

visualisation using interaction is knowledge discovery, process explanation, and decision making [3]. These

should allow for transparency and interpretability that have been regarded as important research topics for

statistics, machine learning and artificial intelligence over many years. In a nutshell, for the data scientist, the

key questions that can be addressed relying on visualisation are mostly related to the verification and

validation of the theoretical models (often based on some statistical assumptions, e.g., Gaussianity) on real-

world data, to how the model gets updated upon seeing new data and to why and how the results are

generated by the model.

While building transparent machine learning tools that address those key questions, mentioned earlier, is

well-relevant to static data of moderate size, it is even vital for big data because of the volume, velocity and

variety characteristics. The dynamic and streaming aspect of big data is maybe the one that requires more

attention compared to the existing techniques, since both the expert and the user need to understand the

behaviour in (pseudo) real-time as data streams flow into the model.

In this report, we reflect on these requirements by discussing the value of interactive visualisation, the

existing visualisation techniques and tools described in the literature and related to selected computational

models that are used for online and stream-based machine learning. For each model, we put forward the

minimum visualisation requirements, from both perspectives, the expert and user perspectives, that such

model should be equipped with. The focus is placed on online analytics techniques such as sketches,

clustering, classification and regression.

These requirements reflect mostly on the volume and velocity aspects that characterise big data. Towards the

end of the report, a quite detailed description of the visualisation system that PROTEUS will develop is

presented. The final aim of PROTEUS is to obtain a system that is interactive and fulfils the requirements

associated with the various scalable online machine learning algorithms to be developed.

1 https://www.recordedfuture.com/information-seeking-mantra/

https://www.recordedfuture.com/information-seeking-mantra/


687691 of 35

Document information

IST Project

Number

687691 Acronym PROTEUS

Full Title Scalable online machine learning for predictive analytics and real-time

interactive visualization

Project URL http://www.proteus-bigdata.com/

EU Project Officer Martina EYDNER

Deliverable Number D5.1 Title Visualization requirements for massive online

machine learning strategies

Work Package Number WP5 Title Real-time interactive visualization

Date of Delivery Contractual M06 Actual M01

Status version 4.0 final

Nature report demonstrator □ other □

Dissemination level public restricted □

Authors (Partner)

Responsible Author Name Abdelhamid

Bouchachia E-mail [email protected]

Partner BU Phone 01202962401

Abstract

(for dissemination)

The present report discusses visualisation of big data which is characterised by

volume, velocity and variety. Despite its importance, visual analytics is still in

its first outset, especially for big data, and is scattered among various

disciplines. The relevance of visualisation and visual analytics in general for

both experts (data scientist) and end-users for understanding the behaviour of

the machine learning algorithms and how these later make decisions is high.

Specifically, visualising the details, tuning the parameters, changing the training

data, tracking back the results (cause-effect) are usually some aspects that are

appealing for every user of data. This is particularly desired in order to cope

with transparency and thus for white-box machine learning.

In this report, we reflect on these requirements by discussing the value of

visualisation, the existing visualisation techniques and tools described in the

literature and related to selected computational models that are used for online

and stream-based machine learning. For each model, we put forward the

minimum visualisation requirements, from both perspectives, the expert and

user perspectives, that such model should be equipped with. These requirements

reflect mostly on the volume and velocity aspects that characterise big data.

Towards the end of the report, a quite detailed description of the visualisation

system that PROTEUS will develop is presented. The final aim of PROTEUS is

to obtain a system that is interactive and fulfils the requirements associated with

the various scalable online machine learning algorithms to be developed.

Keywords Visualisation, interactivity, online machine learning, requirements, architecture

Version Log

Issue Date Rev. No. Author Change

01/04/2016 V1.0 Abdelhamid Bouchachia Table of contents

12/04/2016 V1.5 Abdelhamid Bouchachia Visualisation of classifiers

19/04/2016 V2 Abdelhamid Bouchachia Visualisation of clustering and

regression

22/04/2016 V2.5 Rubén Casado Architectural requirements

http://www.proteus-bigdata.com/


687691 of 35

25/04/2016 V3 Abdelhamid Bouchachia Visualisation of sketches

27/04/2016 V3.5 Rubén Casado Challenges of big data visualisation

17/05/2016 V4 Abdelhamid Bouchachia Introduction, conclusion and front

matter


687691 of 35

Table of contents

Executive summary ........................................................................................................................................... 3 Document information ....................................................................................................................................... 4 Table of contents ............................................................................................................................................... 6 List of figures .................................................................................................................................................... 7 1 Introduction ................................................................................................................................................ 8 2 Challenges of visual analytics for data streams ........................................................................................ 10

2.1 Data presentation............................................................................................................................... 10 2.2 Data analysis ..................................................................................................................................... 11 2.3 Perceptual scalability ........................................................................................................................ 11

3 Visualisation requirements for online machine learning .......................................................................... 12 3.1 Sketches ............................................................................................................................................ 12

3.1.1 Moments .................................................................................................................................... 12 3.1.2 Sampling .................................................................................................................................... 12 3.1.3 Change detection ....................................................................................................................... 12 3.1.4 Feature selection ........................................................................................................................ 13

3.2 Online clustering ............................................................................................................................... 14 3.2.1 Partitional clustering .................................................................................................................. 15 3.2.2 Hierarchical clustering ............................................................................................................... 16

3.3 Online classification .......................................................................................................................... 16 3.3.1 Classification trees ..................................................................................................................... 17 3.3.2 Neural networks ......................................................................................................................... 17 3.3.3 Probabilistic classifiers .............................................................................................................. 21 3.3.4 Ensemble learning ..................................................................................................................... 22

3.4 Online regression .............................................................................................................................. 24 4 Architectural requirements for scalable visual analytics .......................................................................... 25

4.1 Data collector .................................................................................................................................... 25 4.2 Incremental analytics engine ............................................................................................................. 26 4.3 Visualization layer ............................................................................................................................ 28

4.3.1 Websocket connector ................................................................................................................. 28 4.3.2 Graph library .............................................................................................................................. 28

5 Conclusions .............................................................................................................................................. 29 References ....................................................................................................................................................... 30


687691 of 35

List of figures

Figure 1: Visualisation of streams as streamgraph and horizon lines [4] .......................................................... 8 Figure 2: INFUSE visualisation system ......................................................................................................... 13 Figure 3: Design patterns for feature analysis ................................................................................................. 14 Figure 4: Examples of output by the different clustering algorithms .............................................................. 15 Figure 5: Visualisation of SOM using U-matrix [51] ...................................................................................... 15 Figure 6: Visualisation of mixture-based models ............................................................................................ 16 Figure 7: Hierarchical structures ..................................................................................................................... 17 Figure 8: Visualisation of trees [64] ................................................................................................................ 18 Figure 9: A two-layer NN with its corresponding diagram [66] ..................................................................... 18 Figure 10: Bond diagram for a network consisting of 6 input units, 2 hidden units and one output unit. ...... 19 Figure 11: Hyperplane diagram for the hidden nodes of the network shown in Figure 9 [66]. ...................... 19 Figure 12: Response-function plots for the network shown in Figure 9 [66]. Leftmost and middle plots

represent the hidden units, while the rightmost plot represents the output unit. ...................................... 20 Figure 13: Visualisation of convolution networks........................................................................................... 20 Figure 14: Visualisation of class probabilities and decision boundary ........................................................... 21 Figure 15: EnsembleMatrix visualisation. ....................................................................................................... 23 Figure 16: Details of EnsembleMatrix. ........................................................................................................... 23 Figure 17: Output of visreg for a non-linear regression model ....................................................................... 24 Figure 18: PROTEUS’s Architecture .............................................................................................................. 25 Figure 19: Data collector ................................................................................................................................. 25 Figure 20: Traditional AVG computation ....................................................................................................... 26 Figure 21: Concpet of Incremental AVG computation ................................................................................... 26 Figure 22: Flink generic workflow for incremental operations ....................................................................... 27 Figure 23: Implementation apply method for IncrementalAverageOperation ................................................ 27 Figure 24: Real-time and incremental communication ................................................................................... 27


687691 of 35

1 Introduction

Visualisation is central to many areas and usually used to illustrate mainly the architecture of the system

under observation, its evolving behaviour and the final outcome of the processing. Applying visualisation

techniques for machine learning can be extremely useful not only for the expert (developer or data scientist

in general), but also for the users. While for the former, it is worth checking how the machine learning model

performs, for the latter, we seek a better insight into the decision making process and interaction with the

model to change the parameters, to closely examine a particular aspect of the algorithm, or understand the

results.

Advanced interactive visualization of data (termed as visual analytics) has become one of the trends in

machine learning. The assumption is that machine learning (and data analytics) tools are seen by non-experts

as black-box tools lacking understanding the internals of those tools that come with less interaction and

mostly without any explanation module or facility. The transparency and interpretability of the behaviour

and the (intermediate) results have been also an important research topic for statistics, machine learning and

artificial intelligence researchers. Key questions that can be addressed are mostly related to the verification

and validation of the theoretical models, often based on some statistical assumptions (e.g., Gaussianity), on

real-world data, how the model gets updated as new data and why and how the results are generated by

model.

These questions are mainly embedded into the type of data analysis that can be used. While confirmatory

analysis is about accepting assumptions about the data, developing models and establishing whether those

assumptions are true or false. This is clearly very relevant to the developer of the model, rather than to the

user. On the other hand, exploratory analysis is about investigating and discovering, through the model and

the data, and it is relevant to both the expert and the user. In this later case, interaction is a key element.

Thus, interactive machine learning should facilitate the exploration of the data as well as the machine

learning models.

While building transparent machine learning tools that address those key questions, mentioned earlier, is

well-relevant to static data of moderate size, it is more vital for big data because of the volume, velocity and

variety characteristics. The dynamic and streaming aspect of big data is maybe the one that requires more

attention, since both the expert and the user need to understand the behaviour in (pseudo) real-time as data

streams through into the model.

The visualisation techniques used for static data such as response-function plots, scatterplots, parallel

coordinates, heatmaps, parallel sets, linear and non-linear projection, but also for data streams, time-series

graphs, temporal mosaics, streamgraph (see Figure 1), line charts, horizon chart, etc. are standard techniques

and can be used for evolving data as well. However, they should be adapted to the dynamic nature of data

streams and to interactivity required for nowadays big data applications. In fact, visualization of data streams

is strongly related to its temporal nature, but, in many cases, also to other important aspects such as data

source, space, relevance, etc. What is required for big data are adapted rich and dynamic user interfaces for

interacting with complex and possibly linked data to derive analytic insights by visualisation of the data and

the models developed.

(a) Streamgraph (b) Multiple streams using horizon lines

Figure 1: Visualisation of streams as streamgraph and horizon lines [4]


687691 of 35

In order to reflect on the current practices in visual analytics in order to inspire PROTEUS, the rest of the

document will highlight some representative visualisation studies. We will focus mostly on visualisation

techniques used for online data analytics and we will consider in particular the following analytics

techniques: sketches, clustering, classification and regression. We then specify the visualisation requirements

that PROTEUS will fulfil for each of those techniques to meet the need of the challenges of massive and/or

streaming data in terms of presenting information as well exploring the data and the machine learning

models. The document will also highlight the architectural design of the visualisation tool that will be

developed within PROTEUS.


687691 of 35

2 Challenges of visual analytics for data streams

Advanced visualization of data analytics in real-time, user experience and usability is still an open issue in

the context of Big Data. How does Big Data change the nature of visual interaction? The interactivity

requirement creates special challenges when it comes to Big Data. Interaction is a necessary condition for

data analysis tasks, especially when using exploratory visual tools. However, most state-of-the-art tools or

techniques do not properly accommodate Big Data.

Specifically, a key challenge of visual analytics is to meet the requirements of Big Data in supporting real-

time interaction while considering the challenges of volume, velocity and variety. Despite the emerging

advances to achieve low latency for ad-hoc queries, it is still necessary to rethink efficient software

architecture styles to enable real-time interaction. On the other hand, visualization of data streams is strongly

related to its temporal context. Although the data being generated and delivered in the streams has a strong

temporal component, in many cases it is not only the temporal component that the analysts are interested in.

There are other important data dimensions (e.g. source, space, relevance, etc.) that are equally important and

time might be just an additional aspect that they care about. Finally the use of visualisation paradigms

dedicated to machine learning and data analytics methods would help inspect the data as well as to explain

the behaviour of the algorithms.

2.1 Data presentation

The main objective of data visualization [5][6] is to represent knowledge more intuitively and effectively by

using different graphs. To convey information easily by providing knowledge hidden in the complex and

large-scale data sets, both aesthetic form and functionality are necessary. Information that has been

abstracted in some schematic forms, including attributes or variables for the units of information, is also

valuable for data analysis. This way is much more intuitive [5] than sophisticated approaches. For Big Data

applications, it is particularly difficult to conduct data visualization because of the large size and high

dimension of Big Data. However, current Big Data visualization tools mostly have poor performances in

functionalities, scalability and response time. It is necessary to rethink the way we visualize Big Data, not

like the way we adopt before. For example, the history mechanisms for information visualization [7] also are

data-intensive and need more efficient approaches. Uncertainty can lead to a great challenge to effective

uncertainty-aware visualization and arise in any stage of a visual analytics process [8]. New framework for

modelling uncertainty and characterizing the evolution of the uncertainty information are highly necessary

through analytical processes [9].

High-volume datasets are ubiquitous in many domains, such as finance, discrete manufacturing, or sports

analytics [10]. It is not uncommon that millions of readings from high-frequency sensors are subsequently

stored in relational database management systems (RDBMS), to be later accessed using visual data analysis

tools. Modern data analysis tools must support a fluent and flexible use of visualizations [11] and still be able

to squeeze a billion records into a million pixels [12]. In this regard, one open issue for the scientific

community is the development of compact data structures that support algorithms for rapid data filtering,

aggregation, and display rendering. Unfortunately, these issues are yet unsolved for existing RDBMS-based

visual data analysis tools, such as Tableau Desktop [13], SAP Lumira [14], QlikView [15], Tibco Spotfire

[16] or Datawatch Desktop [18].

While they provide flexible and direct access to relational data sources, they do not consider an automatic,

visualization-related data filtering or aggregation and are not able to quickly and easily visualize high-

volume historical data, having one million records or more. For example they redundantly store copies of the

raw data as tool-internal objects, requiring significant amounts of system memory per record. This causes

long waiting times for the users, leaving them with unresponsive tools or even impairing the user’s operating

systems, in case the system memory is exhausted. Apart of commercial solutions, a number of open-source

visual toolkits exist; each covers a specific set of functionalities for visualization, analysis and interaction.

For example, InfoVis Toolkit [19], Prefuse [20], Improvise [21] and D3 [22]. Using existing toolkits for

required functionality instead of implementing from scratch provides much efficiency while developing new

visual analytics solutions, although the level of maintenance, development and user community support of

open source toolkits can vary drastically.


687691 of 35

2.2 Data analysis

A common gap for both commercial and open source solutions is that all existing tools are focused on batch

data (data-at-rest), not in data streams (data-in-motion). There are some domain-specific tools to address

such gap. ELVIS [23] is a highly interactive system to analyse system log data, but cannot be applied to real-

time streams. SnortView focus [24] on the specific analyses of intrusion detection alerts. The focus of Event

Visualizer [25], is to provide real-time visualizations for event data streams for real-time monitoring and

possibilities to smoothly change to exploration mode. In contrast to this event-based approach, authors in

[26] propose another real-time system to enhance situational awareness using the analysis of network traffic

based on LiveRAC[27]. The analysed and aggregated time-series are displayed in a zoomable tabular

interface to provide the analysis an interactive exploration interface for time-series data. Another tool, which

focuses on monitoring of time series data is VizTree[28], which provides visual real-time anomaly detection

for time series. The general approach is to transform the time series data to a representation of symbols.

2.3 Perceptual scalability

From the perception point of view, we can identify two main issues [1]:

Human perception: Human eyes have difficulty to extract meaningful information when the data becomes

extremely large. Not many existing visualization systems are designed to scale nicely to present meaningful

and quality information of human perception.

Limited screen: Data is becoming simply larger and larger, it is challenging when visualization displays too

many data items of features on limited screen, especially a dataset with a billion entries. Too many data to

present on the limited screen that resulting of which the visualization is too dense to be useful to the users.

The limitation of screen resolution forces us to explore novel ways display and visualize information using

various abstraction techniques.


687691 of 35

3 Visualisation requirements for online machine learning

In the following we will present the main online machine learning techniques that will be investigated in

PROTEUS. We will show some of the existing visualisation aspects of such techniques for the sake of

illustration. We then highlight the main visualisation requirements that online machine learning library

should consider.

3.1 Sketches

A sketch is a compact representation of data and usually designed in an efficient way to cope with high-

speed data streams. It can be used for various purposes by capturing some key statistics about the stream.

The most known sketches are the count-min, lossy counting, Bloom filters, etc.

3.1.1 Moments

Being typical sketches, moments are usually used to summarise particular statistical characteristics of the

data stream that are used to capture trends, anomalies, etc. Such sketches can be either related to frequency

of the stream items or to the items themselves. These are quite useful in many monitoring applications such

as network traffic monitoring, network topology monitoring, sensor networks, financial market monitoring,

and web-log monitoring [29]. Since moments are numbers, their visualisation has not been of an issue in the

relevant literature. It is however interesting to visualise such quantities as a stream to highlight their

evolution over time. Illustrating different moments on the same screen can be extremely handy for human

decision makers.

3.1.2 Sampling

In contrast to sketches which are computed from the whole data stream, sampling is another way of

summarizing a data stream as it allows to compute a representative set of stream elements so that an efficient

processing. In general the sample is continuously maintained in order to accommodate anytime approximate

query answering, selectivity estimation, query planning, or any other mining task. The sample fits in the

RAM, hence various standard offline algorithms can be applied. The challenge is to develop sampling

techniques that provide unbiased estimate of the underlying stream with provable error guarantees.

In one-pass stream sampling, the main challenge is to ensure that a sample is drawn uniformly across the

union of the data while minimizing the communication needed to run the protocol on the evolving data. At

the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low

for each stream [30].

Depending on the approach taken to do sampling, the visualisation system should fulfil the following

requirements:

- Visual presentation of the selection criteria

- Details about the effect of adding new data points to the sample (e.g., accuracy)

- Visualisation of the characteristics of the sample (moments, distribution, density over time, etc.)

It is worth mentioning, that little has been done in this context.

3.1.3 Change detection

The relevance of change detection in streaming applications is quite straightforward as it aims at identifying

the any change that occurs. In general, change refers to many concepts such as concept drift [31], novelty

detection [32] and anomaly detection [33]. Each of these has had great attention by different research

communities. Generally speaking, a change corresponds to change in the underlying probability distribution

of the data. The goal is therefore to identify the deviation of the model at hand by monitoring its behaviour in


687691 of 35

real-time. Thus, to deal with changes in the context of streaming, there should be a need to mechanisms of

detection and verification that allow distinguishing noise from real change.

Visualisation of changes in data streams is a very interesting avenue as it helps to explain the evolution of the

data over time and what factual characteristics/events emerge over time. Therefore the set of requirements

associated with change visualisation can be summarised in the following elements:

- Real-time tracking of the sketches used for change detection (illustrating the change)

- Identification of the change by indicating the location in the stream with the associated sketches

- Illustration of the change effect on the learning algorithm.

3.1.4 Feature selection

Streaming and online feature selection has been the centre of a number of investigations. There are two

variants: (i) horizontal: the set of data points is fixed, while new features become available over time, (i)

vertical: the features are fixed, while new data arrives over time. A third (iii) option would be the

combination of (i) and (ii) where both new features and new data become available over time. So far, variant

(i) has attracted some attention [34][35][36], variant (ii) has been also investigated in few works

[37][38][39]. To the best of our knowledge, variant (iii) has not yet been discussed.

In terms of visualisation, there has not been much focus on feature selection. INFUSE [40] is an interesting

visualisation tool that provides many functionalities (see Figure 2), but works offline. It is interactive and

investigative into the process of feature selection for static data.

(b) The glyph

representation of features

which are ranked by 4

selection algorithms

(a) Overview of INFUSE

Figure 2: INFUSE visualisation system

A visualisation system proposed in [41] allows visualising, select and measure the correlation between

features in the context of space analysis. It is based on an interesting series of visual designs as shown in

Figure 3.

Mostly two aspects have presented in the studies:

- Effect of adding new (batches of) data points

- Effect of the selected features on the accuracy of the predictive algorithms


687691 of 35

Figure 3: Design patterns for feature analysis

For online feature selection, in addition to these aspects, the visualisation system should also fulfil the

following:

- Interactive (manual) as well as automatic selection of features

- Visualisation of the selection criteria

- Visualisation of the ranking of the features in case of automatic selection

- Visualisation of the data in the space of selected feature (after mapping)

3.2 Online clustering

The goal of clustering is to uncover the hidden structure of data. In other terms, it aims partitioning

unlabelled data objects into groups, called clusters, in a way that similar data points are assigned to the same

group and distinct ones are assigned to different groups. Ideally, a good clustering algorithm should observe

two main criteria: (i) minimization of intra-cluster distance, and (ii) maximization of the inter-cluster

distance. A sheer mass of offline clustering algorithms based on (or partly based on) these two criteria have

been proposed in the literature. We can distinguish two types of clustering algorithms: (i) hard partitioning

where data objects belong to one cluster and (ii) soft partitioning where objects belong all clusters to a

certain degree. In addition, there are classes of algorithms: (i) partitional where data are split into a

predefined number of clusters according to some criteria and (ii) hierarchical where the output takes the form

of a tree (dendrogram) and can be performed either in a divisive (top-down) or aggregative (bottom-up)

manner [42].

In terms of visualisation, usually the users are interested in the boundary of clusters (low density regions

between clusters) and the clusters themselves (high density regions) populated by the data points. The cluster

centres (prototypes) as different statistics (e.g., clustering quality) as well as the membership of the data

points to their respective clusters can also be of interest.


687691 of 35

3.2.1 Partitional clustering

There are a number of models for online partitional clustering; among these we just mention few:

Neural networks, such as ART networks [43], Generalized fuzzy min-max neural networks

(GFMMN) [44], MaxNet [45], evolving self-organizing maps (ESOM) [46][47][48], growing neural

gas [49] and many others such as minimum allocation networks. All of these methods are based on

the concept of competitive learning and vector quantization. Visualisation of clustering depends

mainly on the computational model used and the model’s output as shown in Figure 4.

Neural Gas ART and GFMMN (E)SOM

Figure 4: Examples of output by the different clustering algorithms

One of the neural networks that is known for its visualisation capabilities is self-organising maps

(SOM) [50]. They are so popular because they allow showing clusters visually. The estimation of

clusters can be intuitive. There have a lot of work on the visualisation of SOM. The most widely

used technique is the U-matrix which represents the distance between the each neuron (codebook

vector) and its neighbouring neurons. The U-matrix is visualised as a 2-D image (see Figure 5) with

different colourings between the neighbouring neurons. A dark colouring indicates a large distance

(between clusters), while a light colouring indicates that the neurons are similar (forming a cluster).

It is quite interesting to note that U-matrix can be used even if the input data is high-dimensional.

Another technique is the P-matrix used to visualise SOM. Instead of distances, P-matrix makes use

of density values of the data space at the neurons. A combination of both U-matrix and P-matrix was

proposed in [51].

Figure 5: Visualisation of SOM using U-matrix [51]

While Sammon’s mapping [52] is a non-linear mapping that maps data from an input space onto an

output space and can be considered as a dimension reduction technique, it has been used also to

visualize SOM by mapping the codebook vectors onto a plane2. Sammon’s mapping and other

transformation techniques such as multidimensional scaling and curvilinear component analysis are

2 http://users.ics.aalto.fi/jhollmen/dippa/node1.html

http://users.ics.aalto.fi/jhollmen/dippa/node1.html


687691 of 35

general, but were combined with SOM to enhance the visualisation of data [53]. Clearly these

techniques can be applied to the online version of SOM

Mixture-based models: they can be either parametric (e.g., the Gaussian mixtures models) or non-

parametric (e.g., DBSCAN and Dirichlet process-based clustering). Representative algorithms are

the Growing Gaussian Mixture Model (2G2M) [54] and the Incremental Gaussian Mixture Model

(IGMM)[55] and LSEARCH [56]. The outcome of such algorithms is shown in Figure 6:

Visualisation of mixture-based models. The visualisation associated with these models would

require showing how the clusters are generated and how the clusters themselves evolve as new data

become available. Often an optimisation process is applied to control the complexity of clustering

(number of clusters).

Objective function-based models rely on the optimisation of an objective function often by

introducing some simplification and assumptions to avoid iterating over data. A set of algorithms

based on K-means and Fuzzy C-means have been proposed like in [57][58] and other general stream

algorithms described in [59]. In general, the partition matrix, the prototypes or other statistics are

recursively updated as new data become available. In terms of visualisation, there are no specific

requirements for this class of online clustering algorithms and thus they follow the previous classes.

Figure 6: Visualisation of mixture-based models

3.2.2 Hierarchical clustering

Traditionally, hierarchical clustering is visualised using a dendogram, but other representations such as

sunburst3,4 have been used [60] (See Figure 7).

In a sunburst diagram, the root of the tree is represented as the centre of the diagram, while each concentric

ring represents a subtree (sub-cluster). Online hierarchical clustering [61] requires adapted visualisation

techniques. The dendogram and sunburst representations still work for the case of streams, but additionally

the visualisation system should reflect on the evolution and the size of the hierarchical structure of the

clustering. It should show the updated leaf continuously or on demand while accommodating zoom-in and

out to observe the overall evolution.

3.3 Online classification

Motivated by the requirement of transparency and understanding the behaviour as well as the decision

making process of the classifiers, visualisation is an important tool to gain insight by both expert and users

(non-experts) as explained in the following.

3 http://www.cc.gatech.edu/gvu/ii/sunburst/ 4 http://vcg.informatik.uni-rostock.de/~hs162/treeposter/poster.html#Chen2015

(a) Growing GMM (b) LSEARCH (c) Incremental GMM

http://www.cc.gatech.edu/gvu/ii/sunburst/

http://vcg.informatik.uni-rostock.de/~hs162/treeposter/poster.html#Chen2015


687691 of 35

(a) Dendogram representation (b) Sunburst representation of hierarchical

structures

Figure 7: Hierarchical structures

3.3.1 Classification trees

Decision trees are some of the very early algorithms adapted for classifying data streams. In particular,

Hoeffding Tree or Very Fast Decision Tree (VFDT) [62][63] is the standard decision tree algorithm for data

stream classification. The Hoeffding tree induction algorithm induces a decision tree from a data stream

incrementally, inspecting each example in the stream only once, without the need for storing examples after

they have been used to update the tree. The only information needed in memory is the tree itself, which

stores sufficient information in its leaves in order to grow and can be used to form predictions at any time.

Visualising trees during the continuous learning process is generally straightforward from a conceptual point

of view. A tree is represented as a set of nodes, the top node (level 0) is the root and the nodes at level i are

children of nodes i-1, the lowest level includes the class nodes. Since the decision tree changes (or grows if

new attributes become available) continually in the context of streaming, an efficient visualiser should

accommodate the following functionalities:

- During the evolution, show the updated path this corresponds to a focused examination

- Enable expansion and contraction of the tree’s parts around node(s) of interest

- Enable the visualisation of the partial results from different workers

While the visualisation of a tree looks quite intuitive as shown in Figure 8 [64], the space required for

visualising the whole tree at one become challenging because of its increasing size over time. Therefore new

and different visual formats are required. There exist some libraries available that allow a compact

visualisation such as SunBurst5 style. Here the tree looks like a pie chart. However other format can be also

used. There exist interesting libraries that can be of high relevance for tree visualisation6,7,8.

3.3.2 Neural networks

Neural networks (NN) are not very popular for learning data streams, but there exist a number of online NN

algorithms proposed many years ago such as adaptive resonance theory networks, growing neural gas,

incremental learning based on function decomposition, min-max neural networks, incremental radial basis

5 http://www.cc.gatech.edu/gvu/ii/sunburst/ 6 http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html 7 http://bl.ocks.org/ 8 https://github.com/mbostock/d3

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

http://www.cc.gatech.edu/gvu/ii/sunburst/

http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html

http://bl.ocks.org/

https://github.com/mbostock/d3


687691 of 35

function networks such as minimal resource allocation networks. Neural networks have recently gained a lot

of interest due to the advent of deep learning architectures which show great potential.

Figure 8: Visualisation of trees [64]

The visualisation of neural networks often used to track their behaviour and performance. Because, NNs are

considered blackbox tools (once trained, knowledge is encoded as a set of numerical weights), the

visualisation becomes even more relevant. Early work on NN visualisation goes back to the study by [65]

where diagram, called Hinton diagram, was proposed to explain NNs (see Figure 9).

Figure 9: A two-layer NN with its corresponding diagram [66]

Hinton diagrams are used to visualise a 2D array that represents a weight matrix. The white and black

squares indicate the positive and negative values respectively, while the size of each square depicts the

magnitude of the corresponding weight.


687691 of 35

A more intuitive visualisation technique, called bond diagrams, proposed in [67] depicts not only the weights

but also the topology of the network as shown in Figure 10. The weights are depicted as bonds, where

positive and negative weights are given different colours or shapes (e.g., striped bonds represents negative

weights and plain ones represent positive weights). The thickness indicates the magnitude of the weights,

while the size of the units represents the magnitude of the unit’s bias.

These diagrams do not refer to the training data and how the decision boundary (i.e., the hyperplane) is

determined. To overcome this problem, [68] proposed a visualisation tool that illustrates how the hyperplane

changes during the learning process. Variations of this representation have been used by different tools.

A hyperplane diagram is used to depict how hidden units make decisions with the help of the input units, but

more often, it is applied to illustrate how output units make decision using the hidden units’ outputs as shown

in Figure 11. The hyperplanes show how the hidden units partition its input by approximating their transfer

function by a threshold function. The movement of the hyperplanes can be animated along with the change

of the weights during the training to exhibit the network’s behaviour.

Figure 10: Bond diagram for a network consisting of 6 input units, 2 hidden units and one output unit.

Figure 11: Hyperplane diagram for the hidden nodes of the network shown in Figure 9 [66].

An alternative technique for visualising the decision boundary is to use the response-function plots which

show the decision surfaces formed by the individual hidden and output units. Figure 12 shows an example of

response-function plots. Axes are the input for each unit and the lighter shades indicate higher activation.


687691 of 35

Figure 12: Response-function plots for the network shown in Figure 9 [66]. Leftmost and middle plots

represent the hidden units, while the rightmost plot represents the output unit.

However, visualising data could be challenging if it is high-dimensional (>3). Hence, a good visualisation in

such a case should be equipped with either an interactive interface that allows the user to dynamically and

repeatedly decide which features (dimensions) to be plotted or an option for the tool to choose the most

influential features using feature selection algorithms.

Authors in [66] developed a tool, called Lascaux, which offer possibility to display: (a) architecture of the

network, (2) importance of the weights (solid lines indicate positive weights, dashed lines indicate negative

weights - the thicker the lines, the more important the weights are).

With the advent of deep learning, visualisation of neural networks has taken more importance. Several recent

visualisation studies have emerged [69][70]. The challenge of visualising deep architectures is known [70],

because of (a) the complexity and the number of layers a deep architecture which are often not well

understood, (b) each component of a network may have dozens of hyper-parameters, and (c) the complexity

of neural networks has protected them from the rigorous formalism of other fields of machine learning, so

practitioners can only rely on anecdotal results to guide design.

An interesting visualisation tool was presented in [70], called deepViz. It enable the users to view bitmap

representations of filter banks, weight matrices, the output of a corresponding input image, confusion matrix,

images from various classes, etc. [71] suggested the use of a deconvolution method to visualise convolution

networks by finding which neurons are activated by which parts of the image. The deconvolution consists of

projecting the prediction from the output layer back to the input layer.

Similarly, authors in [72] introduced a tool (see Figure 13) that provides an interactive visualization of

neurons in a trained convolution network when presented with an image or video. The tool allows visualising

forward activation of units, top images for each unit from the training set, and deconvolution to understand

the reaction of units to images as proposed in [71].

Figure 13: Visualisation of convolution networks

Other visualisation of deep networks studies [73][74] applied sensitivity analysis resulting in heatmaps that

makes use of partial derivatives. Interestingly enough, this approach is general and can be applied to any

classifier and irrespective of being linear or non-linear. It measures the relative importance of the input

features to the classifier [75].


687691 of 35

This section has focused mostly on offline neural networks, but the aim was to show the visualisation

techniques that have been proposed for neural networks. Actually all of these techniques fit perfectly the case

of online neural networks for data streams. To summarise, the requirements for real-time informative

visualisation when using neural networks can be addressed as follows:

- Insight into the behaviour of selected network’s neurons upon presentation of a new input or a batch

of input and probably on regular interval or on demand

- Insight into the evolution of the decision boundary

- For networks with dynamic structure that evolves over time, insight into the evolution of the

architecture

- Insight into the evolution of the actual accuracy of the network

3.3.3 Probabilistic classifiers

Often probabilistic classifiers (e.g., Bayesian networks) eek to quantify the likelihood scores for a data point

being of each of the classes. Such membership probabilities explain well the behaviour as well as the

decision making by the classifier. Thus it is quite appealing to visualize the probabilistic landscape

associated with the classifier’s model and output.

For instance, authors in [76] proposed a tool to visualise the class probabilities. They used projection

techniques based SOM to visualize data, then they considered various quantities such as the class probability

at each data point in the data space, the decision boundary to indicate the classes, misclassification

information, misclassification types (e.g., false positives and false negatives for binary classes) and the meta-

attributes of the model (e.g., the distribution and density of the training data used to build the model as well

as the confidence assigned to each estimate). Figure 14 some of their heapmap-based visualisation output.

In [77] apply Bayesian decision theory to compute the risks as a function of the class probability associated

with mis-classification and to visualise the class boundaries. They further developed an interactive

optimization system, called ManiMatrix, to guide the classification towards targeted confusion matrix

configuration set by the user.

Figure 14: Visualisation of class probabilities and decision boundary

The authors in [78] describe some visual methods for analysing classification results of a probabilistic

classifier. In particular class probabilities are represented as coloured histograms, while features are ranked

and visualised based on their discrimination power. Different interaction mechanisms are used to explore the

classification outcome as well as the data.

The work in [79] proposed parametric embedding (PE), as a projection method, to visualise the posteriors

estimated over a mixture model. The PE method maps the class posterior of data points onto an embedding

space by minimizing a sum of Kullback-Leibler divergences considering that the data is generated by a

Gaussian mixture with equal covariances in the embedding space.


687691 of 35

The authors in [80] proposed a visualization system that shows the classification results at three levels, the

classifier level, the class level, and the test item level. Classes are represented on the perimeter of a circle

which is filled with the data. Any data point can be interactively activated so that its class probabilities are

highlighted using lines whose thickness indicates the value class probability.

From these examples, it can be easily concluded that a visualisation of probabilistic classification makes use

of the class distribution of data points to explain both the classification (class boundaries) as well as the

membership of data points to classes. Therefore online probabilistic classifiers for data streams, such as

online random Naïve Bayes classifier [81] and Bayesian online classification [82] should follow the same

visualisation concept.

3.3.4 Ensemble learning

The idea of combining different algorithms, known as ensemble learning, has attracted a lot of attention. The

main motivation for it is that even if the performance of one predictor (learner, expert), called base learner,

or few may not be that much satisfactory, the ensemble of the algorithms can still perform well. Usually,

when the task is relatively hard, multiple predictors are used following the conquer-and- divide principle

[83][84]. Ensemble learning offers the advantage of limiting the effect of the parameters of each basic

learner.

We can distinguish two combination schemes. The first is that the base learners (based on the same model)

are trained on different data sets randomly generated sets (re- sampling from a larger training set) before they

are combined. These include stacking, bagging and boosting. The second scheme assumes that the ensemble

contains several base learners trained on the same data but are based on different models (neural networks,

decision trees, etc.) with different parameters and trained using different initial conditions (e.g., weight

initialization in neural networks, etc.). Both schemes seek to ensure high diversity of the ensemble.

There have been very few studies on visualisation in the context of ensemble learning. Probably the more

prominent one is the work presented in [85]. This study proposed an interactive visualisation system, called

EnsembleMatrix that visualises the confusion matrices of the individual learners (classifiers) graphically in

order to understand the behaviour and performance of each learner. EnsembleMatrix enables the users to

interact with the confusion matrices to decide the combination they want to inspect.

Figure 15 shows the EnsembleMatrix. The left side shows the confusion matrix Confusion matrices of the

current ensemble classifier built by the user, while the bottom right side shows the confusion matrix of the

individual base classifiers. Here the confusion matrix is encoded by colour (the darker, the higher). The user

can select any part of ensemble confusion matrix for examining the how the base classifiers perform on that

part. The top left side (polygon) is used to fix by the user the weight of each classifier in the desired linear

combination.


687691 of 35

Figure 15: EnsembleMatrix visualisation.

EnsembleMatrix was patented9 with further details, especially in terms of steps as shown in Figure 16.

Figure 16: Details of EnsembleMatrix.

Authors in [86] used different visualisations to analyse tree forest which were generated using bagging. Such

visualisations are however static and are just for displaying statistics about the trees and the variables

involved therein. Similar work was presented in [64] where a visualisation tool, called VISE, was presented

for analysing small ensembles that consist of three decision trees obtained using bagging.

PROTEUS focuses on data streams and online learning, hence the relevance of online ensemble learning.

There have been a number of studies reporting on the online ensemble learning [87][88][84][89][90]. But all

these studies did not look at the visualisation aspect, expect the performance curves.

It is however clear that the visualisation techniques for online ensemble methods should reflect on the real-

time updates which may be of different nature [84]:

- Visualisation of the evolution of the weights associated with the ensemble when using a dynamic

combination of the learners to illustrate the importance of each base learner.

- Visualisation of new data, classification boundary and confusion matrix of both the ensemble as well

as the base learners.

- Visualisation of the ensemble structure, possibly along with the evolution of the performance as in

Figure 15. In case a dynamic structure is adopted, that is, if new learners are added or removed

9 http://www.google.co.uk/patents/US8306940

http://www.google.co.uk/patents/US8306940


687691 of 35

dynamically, the change should be tracked. The evolution of the ensemble as well as that of the base

learners

While a learner here refers to a classifier, the visualisation requirements and the learner are meant for any

type of ensembles (e.g., clustering, mapping, regression).

3.4 Online regression

Regression is about relating investigating the relationship between one or several predictors (known as

independent variables, input variables or explanatories) and the response (known as dependent variable or

output). There exist a lot of models linear, nonlinear, multiple linear and nonlinear, etc. Overall there are two

types of models: parametric and non-parametric regression. There exist also a number of studies that discuss

incremental and online regression [91][92][93] relying often on mechanisms like recursive least squares,

online support vector machines [94][95].

However in terms of visualization, often authors show just the outcome of the model fitting and computing

different quantities to show the performance of the model. Estimating the relationship between the response

and the predictors through a visual inspection is worth.

Authors in [96] described a visualisation tool, called visreg, for different regression models. Visreg allows

mainly illustrating the regression curve along with the corresponding region (see Figure 17). The work in

[97] discussed also in detail the visualisation techniques for various regression models using Stata. The study

presented in [98] suggested an interactive approach to develop regression. Two approaches are proposed, the

first is based on multiple expert while the second performs local regression relying on guidance by the user.

Figure 17: Output of visreg for a non-linear regression model

Visualisation of online regression models should allow for the following:

- The accommodation of the model in 2D and 3D if the data is multi-variate.

- The evolution of the regression model, for instance how the fit changes as new data becomes available.

- The evolution of the model’s parameters when new data arrives.

- The quantification of the fitness/shape of the model to the data.

- The indication of the regions where the fit not good.

These requirements apply uniformly to all types of regression models (generalised linear model,

nonparameteric models, decision trees, etc.) and to all computational models (neural networks, probabilistic

models, statistical learning, etc.) used to compute the regression models.

It is worthwhile mentioning that regression models have been used for efficient visualisation multi-

dimensional data. A number of studies relied on regression models such as [99][91]. For instance, the work

in [99] used an ensemble of regression models (i.e., neural networks) to reduce the dimensionality by

mapping the features in the input space into a two-dimensional latent space. Authors in [91] developed a tool

for multivariate data visualization and exploration based on the integrated use of regression analysis and

advanced parallel coordinates visualization. Because of the difficulty of using parallel-coordinates for

presenting multivariate data on a 2D screen, the authors used a LASSO-based regression model to select,

order, and group dimensions.


687691 of 35

4 Architectural requirements for scalable visual analytics

In order to deal with massive datasets and continuous unbounded streams of data visualization issues,

PROTEUS proposes a new novel architecture using incremental methods. This approach will allow end users

to explore both data-at-rest and data-in-motion efficiently to make well-informed decisions in real time

The architecture we propose consists of three main layers (see ¡Error! No se encuentra el origen de la

referencia.): Data Collector, Incremental Analytics Engine and Visualization Layer. The Data Collector is in

charge of continuously getting new data from data sources and sending them to the next processing layer.

The Incremental Analytics Engine processes data using the online incremental algorithms and outputs up-to-

date results which are then visualized by the third layer. The visualization of the results at various time points

allows the users to track and interact with those results in real-time

Figure 18: PROTEUS’s Architecture

4.1 Data collector

Data collector is in charge of continuously collecting new stream data points from the data sources. As soon

as a data chunk (window) becomes available it is sent the next layer to cope with the high velocity of the

stream. The process of data collection is done incrementally in line with the requirements of the streaming

processing. Figure 19 depicts how data collector retrieves and sends sequentially data chunks to the next

layer:

Figure 19: Data collector


687691 of 35

With this new approach, next layer does not need to wait until data collection process ends, since it is

continuously receiving chunks of from data collector.

4.2 Incremental analytics engine

The incremental analytics engine processes data incrementally, mostly using the concepts of recursivity and

approximation. The algorithmic processing of each batch will lead to results that are communicated to the

next layer for visualization. Online incremental analytics algorithms can range from simple statistical

moments (e.g., average, median, sum, max, min, etc.) to advanced machine learning and data mining

algorithms such classifiers and clustering algorithms. For instance, Figure 20 depicts how the average is

calculated with a traditional approach.

Figure 20: Traditional AVG computation

In order to send the final result to the visualization layer, the offline AVG computation needs to wait until all

the data points are processed. Instead,. Figure 21 summarizes a naïve online version of the AVG

computation.

Figure 21: Concpet of Incremental AVG computation

This version depicts a poor implementation of the incremental average, since a new result is sent for each

new element processed leading to communication overhead. In addition, it assumes that only one partial

result (e.g. there is no avg for each different key) is generated and the whole computation occurs in a single

node (no distributed architecture). As explained in Data Collection section, input data are partially received

and then processed chunk-wise .

To realize the incrementallity in an efficient way, we make use of mechanisms offered by big data streaming

engines. These later introduce concepts such as windowing that splits data streams into finite sequence of

data points. By using windows, it is possible to execute aggregations on unbounded data streams. We process

data in windows of size N (denoted as WINDOW_SIZE in Figure 22) to generate a partial result. When a

window is filled, it automatically calls to the apply(…) method that initiates the computation for that

window. Every incremental operation, executed over the window, needs the result of the previous one to

successfully calculate the new average. Normally it is necessary to save not only the new average, but also

the number of windows used for that partial result. That information is called state.

Figure 22 of code summarizes the normal flow for stream processing in Apache Flink, using the window

concept. The IncOperation() class provides the logic necessary to calculate approximated results,

window by window. As an example of incremental operation, Figure 23 shows the schema of

IncrementalAverageOperation implementation of its apply (…) function.

for value in values

avg+=value;

avg = avg / values.length;

send(avg);

for value in values

avg+=value;

send(avg/values.length)


687691 of 35

Figure 22: Flink generic workflow for incremental operations

Figure 23: Implementation apply method for IncrementalAverageOperation

As incremental operations continuously send results to the visualization layer, a continuous and full-duplex

communication between the incremental analytics engine and the visualization layer is required. For this

reason, we have decided to use Websockets . Websocket is a protocol providing full-duplex communication

channels over a single TCP connection. It is an independent TCP-based protocol and it is designed to be

implemented in web browsers and web servers, but it can be used by any client or server application. We will

use Websocket protocol instead of HTTP or other application protocols because these do not provide

bidirectional communication (in a normal web scenario, architectures are based on based on request-response

protocols like HTTP).

When a window computation ends, DataStream class automatically calls the invoke() method of the

WebsocketSink class, which is in charge of sending results to the visualization layer. Figure 6 depicts

how incremental analytics engine and visualization layer are connected. Once a partial result is computed, it

is sent to the websocket server and then to the visualization layer.

Figure 24: Real-time and incremental communication

stream

.keyBy(“key”)

.countWindow(WINDOW_SIZE)

.apply(new IncOperation())

.addSink(new WebsocketSink());

ValueState state =

getRuntimeContext().getState();

double[] values = getWindowValues();

//calculate new avg using the new elements

//and the previous result

double actualAVG = calculate(state,values);

//Update window state with new values

updateWindowState(state, actualAVG);

//Send partial result to WebsocketSink

collector.collect(actualAVG);


687691 of 35

4.3 Visualization layer

This layer is a web-based library that allows users to real-time graphically visualize results of incremental

operations carried out by the incremental analytics engine layer. Visualizations are performed using a set of

minimalist graphs that visualization layer provides such as line, bar, pie or stream graphs are some of the

basic visualization elements available in this library. All of the components of the visualization layer have

been developed using Javascript, since it is the de facto programming language for creating interactive

applications on the web.

Like all modern programming languages, Javascript is implemented following a certain scripting language

specifications. Ecmascript6 (commonly known as ES6) is the latest Javascript specification language which

we use for developing the visualization layer. ES6 is not fully supported by all the modern browsers yet, but

they are tending to implement the most of its features, such as arrow functions, class orienting, modulating,

etc.

This layer continuously receives partial results from the incremental analytics engine layer. To successfully

implement this functionality, we need two core components: (i) a websocket connector that receives the

results from the incremental analytics engine layer and (ii) a graph library that contain the basic graphical

elements used to visualize data.

4.3.1 Websocket connector

Websocket is a technology, based on the websocket protocol that makes it possible to establish a continuous

full-duplex connection stream between a client and a server. Although the ws protocol is platform

independent, clients are typically based on web browsers. The visualization layer provides a Javascript

websocket connector that enables a bidirectional communication between web-browsers and the

WebsocketSink component developed in the analytics engine layer. This connector is in charge of

receiving and sending data and acts as a proxy between the visualization and the analytics layers.

4.3.2 Graph library

Our graph library is a visualization tool that allows users to real-time visualize data. This library will be built

using the Scalable Vector Format (SVG) as the format to show data to users. SVG is a language for

describing two-dimensional graphics in XML format. It allows for three types of graphics objects: vector

graphic shapes, images and text. Graphical object can be grouped, transformed and composited into

previously rendered objects. It also includes nested transformations, clipping paths, alpha masks, filter

effects and template objects.

After analyzing other graphic technologies such as Canvas or WebGL, we opted for SVG due to its

simplicity and easy user-interaction API. To deal with SVG API and facilitate the graph creation, we have

starting building the library on top of D3.js . D3.js is a Javascript library for manipulating documents based

on data. D3.js provides the full capabilities of modern browsers and is developed using data-driven approach

to DOM manipulation.


687691 of 35

5 Conclusions

Interactive visualisation of big data is extremely valuable for both the expert (data scientist) and the end-user

for understanding the behaviour of the machine learning algorithms and decision making process of the

algorithms. Tuning the parameters, changing the training data, tracking back the results (cause-effect) are

some of the requirement, white-box machine learning should offer in order to cope with transparency.

In this report, we reflect on these requirements by discussing the value of interactive visualisation, the

existing visualisation techniques and tools described in the literature and related to selected computational

models that are used for online and stream-based machine learning. For each model, we put forward the

minimum visualisation requirements, from both perspectives, the expert and user perspectives, that such

model should be equipped with. These requirements reflect mostly on the volume and velocity aspects that

characterise big data. Towards the end of the report, a quite detailed description of the visualisation system

that PROTEUS will develop is presented. The final aim of PROTEUS is to obtain a system that is interactive

and fulfils the requirements associated the various scalable online machine learning algorithms to be

developed.


687691 of 35

References

[1] B. Shneiderman and C, Plaisant. Designing the user interface, Pearson Education, Inc.2005.

[2] C. Chen, Top 10 unsolved information visualization problems, in IEEE Computer Graphics and

Applications, vol. 25, no. 4, pp. 12-16, July-Aug. 2005.

[3] D. A. Keim, Information visualization and visual data mining, in IEEE Transactions on Visualization

and Computer Graphics, vol. 8, no. 1, pp. 1-8, Jan/Mar 2002.

[4] M. Krstajić and D. A. Keim, "Visualization of streaming data: Observing change and context in

information visualization techniques," Big Data, 2013 IEEE International Conference on, Silicon

Valley, CA, 2013, pp. 41-47.

[5] D. Simeonidou, R. Nejabati, G. Zervas, D. Klonidis, A. Tzanakaki, and M. J. O’Mahony, “Dynamic

optical-network architectures and technologies for existing and emerging grid services,” J. Light.

Technol., vol. 23, no. 10, pp. 3347–3357, Oct. 2005.

[6] D. A. Keim, C. Panse, M. Sips, and S. C. North, “Visual Data Mining in Large Geospatial Point

Sets,” IEEE Comput. Graph. Appl., vol. 24, no. 5, pp. 36–44, Sep. 2004.

[7] J. Heer, J. Mackinlay, C. Stolte, and M. Agrawala, “Graphical histories for visualization: supporting

analysis, communication, and evaluation.,” IEEE Trans. Vis. Comput. Graph., vol. 14, no. 6, pp.

1189–96, Jan. 2008.

[8] N. D. Lane, Y. Xu, H. Lu, A. T. Campbell, T. Choudhury, and S. B. Eisenman, “Exploiting Social

Networks for Large-Scale Human Behavior Modeling,” IEEE Pervasive Comput., vol. 10, no. 4, pp.

45–53, Apr. 2011.

[9] C. L. Philip Chen and C.-Y. Zhang, “Data-intensive applications, challenges, techniques and

technologies: A survey on Big Data,” Inf. Sci. (Ny)., vol. 275, pp. 314–347, Aug. 2014.

[10] J. Davey, F. Mansmann, and D. Keim, “Visual Analytics: Towards Intelligent Interactive Internet and

Security Solutions,” pp. 93–104, 2012.

[11] J. Heer and B. Shneiderman, “Interactive dynamics for visual analysis,” Commun. ACM, vol. 55, no.

4, p. 45, Apr. 2012.

[12] B. Shneiderman, “Extreme visualization,” in Proceedings of the 2008 ACM SIGMOD international

conference on Management of data - SIGMOD ’08, 2008, pp. 3–12.

[13] “Tableau Desktop | Tableau Software.” [Online]. Available:

http://www.tableau.com/products/desktop. [Accessed: 31-Mar-2015].

[14] “SAP Lumira.” [Online]. Available: http://saplumira.com/. [Accessed: 31-Mar-2015].

[15] “Qlik - Business Intelligence and Data Visualization Software.” [Online]. Available:

http://www.qlik.com/uk. [Accessed: 31-Mar-2015].

[16] “TIBCO Spotfire - Business Intelligence Analytics Software & Data Visualization.” [Online].

Available: http://spotfire.tibco.com/. [Accessed: 01-Apr-2015].

[17] “Datawatch | Data Visualization and Big Data Analytics.” [Online]. Available:

http://www.datawatch.com/. [Accessed: 31-Mar-2015].

[18] J.-D. Fekete, “The InfoVis Toolkit,” pp. 167–174, Oct. 2004.

[19] J. Heer, S. K. Card, and J. A. Landay, “prefuse,” in Proceedings of the SIGCHI conference on Human

factors in computing systems - CHI ’05, 2005, p. 421.

[20] C. Weaver, “Building Highly-Coordinated Visualizations in Improvise,” in IEEE Symposium on

Information Visualization, pp. 159–166.

[21] “D3.js - Data-Driven Documents.” [Online]. Available: http://d3js.org/. [Accessed: 01-Apr-2015].


687691 of 35

[22] C. Humphries, N. Prigent, C. Bidan, and F. Majorczyk, “ELVIS,” in Proceedings of the Tenth

Workshop on Visualization for Cyber Security - VizSec ’13, 2013, pp. 9–16.

[23] H. Koike and K. Ohno, “SnortView,” in Proceedings of the 2004 ACM workshop on Visualization

and data mining for computer security - VizSEC/DMSEC ’04, 2004, p. 143.

[24] F. Fischer, F. Mansmann, and D. A. Keim, “Real-time visual analytics for event data streams,” in

Proceedings of the 27th Annual ACM Symposium on Applied Computing - SAC ’12, 2012, p. 801.

[25] D. M. Best, S. Bohn, D. Love, A. Wynne, and W. A. Pike, “Real-time visualization of network

behaviors for situational awareness,” in Proceedings of the Seventh International Symposium on

Visualization for Cyber Security - VizSec ’10, 2010, pp. 79–90.

[26] P. McLachlan, T. Munzner, E. Koutsofios, and S. North, “LiveRAC,” in Proceeding of the twenty-

sixth annual CHI conference on Human factors in computing systems - CHI ’08, 2008, p. 1483.

[27] J. Lin, E. Keogh, S. Lonardi, J. P. Lankford, and D. M. Nystrom, “VizTree: a tool for visually

mining and monitoring massive time series databases,” pp. 1269–1272, Aug. 2004.

[28] R. Agrawal, A. Kadadi, X. Dai, and F. Andres, “Challenges and opportunities with big data

visualization,” in Proceedings of the 7th International Conference on Management of computational

and collective intElligence in Digital EcoSystems - MEDES ’15, 2015, pp. 169–173.

[29] S. Ganguly and G. Cormode. On estimating frequency moments of data streams. In Proceedings of

the 11th International Workshop on Randomization and Computation (RANDOM), pages 479-493,

2007.

[30] Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for Massive Data: Samples,

Histograms, Wavelets, Sketches. Now Publishing: Foundations and Trends in Databases Series

(2011)

[31] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift

adaptation.In ACM Comput. Surv. 46, 4, 2014.

[32] M. Pimentel, D. Clifton, L. Clifton, and L. Tarassenko. Review: A review of novelty detection.

Signal Process. 99 (June 2014), 215-249, 2014.

[33] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv. 41, 3,

2009.

[34] J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. J. Mach. Learn.

Res., 7:1861–1885, 2006.

[35] X. Wu, K. Yu, W. Ding, H. Wang and X. Zhu. Online Feature Selection with Streaming Features. In

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178-1192, May

2013.

[36] S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient

descent in function space. The Journal of Machine Learning Research, vol. 3, pp. 1333–1356, 2003.

[37] A. Bouchachia. An evolving classification cascade with self-learning. Evolving Systems 1(3): 143-

160 (2010)

[38] J. He, L. Balzano, and J. Lui. Online robust subspace tracking from partial information. arXiv

preprint arXiv:1109.3827, 2011.


687691 of 35

[39] C. Qiu, N. Vaswani, and L. Hogben. Recursive robust pca or recursive sparse recovery in large but

structured noise. arXiv preprint arXiv:1211.3754, 2012.

[40] J. Krause, A. Perer, E. Bertini, Infuse: interactive feature selection for predictive modeling of high

dimensional data, IEEE Trans. Visual. Comput. Graph. 20 (12) (2014) 1614–1623.

[41] S. Goodwin, Dykes, J., Slingsby, A. & Turkay, C. Visualizing Multiple Variables Across

Scale and Geography. IEEE Transactions on Visualization and Computer Graphics, 22(1), pp. 599-

608, 2015.

[42] A. Jain, M. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31:264–323,

1999.

[43] S. Grossberg. Nonlinear neural networks: principles, mechanism, and architectures. Neural

Networks, 1:17–61, 1988.

[44] B. Gabrys and A. Bargiela. General fuzzy min-max neural network for clustering and classification.

IEEE Trans. on Neural Networks, 11(3):769–783, 2000.

[45] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 2006.

[46] D. Da Deng and N. Kasabov. On-line pattern analysis by evolving self-organizing maps.

Neurocomputing, 51:87 – 103, 2003.

[47] N. Rougier and Y. Yann Boniface. Dynamic self-organising map. Neurocomputing, 74(11):1840–

1847, 2011.

[48] J. Rubio. Sofmls: Online self-organizing fuzzy modified least-squares network. IEEE T. Fuzzy

Systems, 17(6):1296–1309, 2009.

[49] B. Fritzke. A growing neural gas network learns topologies. In Advances in neural information

processing systems, pages 625–632, 1995.

[50] T. Kohonen, Self-Organization and Associative Memory. Springer, Berlin, 1984.

[51] A. Ultsch. Clustering with SOM: U*C. Proceedings Workshop on Self-Organizing Maps, pp. 75-82,

2005.

[52] J.W. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Trans. Comput., vol. C-18,

pp. 401–409, 1969.

[53] S. Wu and T. Chow. PRSOM: A new visualization method by hybridizing multidimensional scaling

and self-organizing map. IEEE Trans. Neural Networks, 16: 1362–1380.

[54] A. Bouchachia and C. Vanaret: GT2FC: An Online Growing Interval Type-2 Self-Learning Fuzzy

Classifier. IEEE Trans. Fuzzy Systems 22(4): 999-1018 (2014)

[55] Engel, P. M. and Heinen, M. R. (2010b). Incremental learning of multivariate Gaussian mixture

models. In Proc. 20th Brazilian Symposium on AI (SBIA), volume 6404 of LNCS, pages 82–91, São

Bernardo do Campo, SP, Brazil. Springer-Verlag.

[56] S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O'Callaghan. Clustering data streams: Theory

and practice. In the IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515-

528, May-June 2003.


687691 of 35

[57] D. Dovzan and I. Igor Skrjanc. Recursive clustering based on a Gustafson-Kessel algorithm.

Evolving Systems, 2(1):15–24, 2011.

[58] O. Georgieva and F. Klawonn. Dynamic data assigning assessment clustering of streaming data.

Appl. Soft Comput., 8(4):1305–1313, 2008.

[59] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. de Carvalho, and J. Gama. Data stream

clustering: A survey. ACM Computing Surveys (CSUR), vol. 46, no. 1, p. 13, 2013.

[60] R. Patton, J. Beaver, C. Steed, T. Potok and J. Treadwell. Hierarchical clustering and visualization of

aggregate cyber data. The 7th International Wireless Communications and Mobile Computing

Conference (IWCMC), 1287-1291, 2011.

[61] P. Rodrigues, Gama, J., and Pedroso, J. Hierarchical clustering of time-series data streams. IEEE

Transactions on Knowledge and Data Engineering, 20 (5), 615-627, 2008.

[62] G. Hulten, L. Spencer, and P. Domingos. “Mining time-changing data streams” in KDD ’01, 2001.

[63] P. Domingos and G. Hulten. “Mining high-speed data streams” In KDD ’00, 2000.

[64] G. Stiglic, N. Khan, M. Verlic, and P. Kokol. Gene expression analysis of Leukemia samples using

visual interpretation of small ensembles: a case study. In Proceedings of the 2nd IAPR international

conference on Pattern recognition in bioinformatics, 189-197, 2007.

[65] E. Hinton, McClelland, J. L., and Rumelhart, D. E.Distributed representations.In Parallel

Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations,

MIT Press, 1986.

[66] W. Craven, J. Shavlik. Constructive Induction in Knowledge-Based Neural Networks Machine

learning: proceedings of the Eighth International Workshop (ML91) 8, 213, 1991

[67] J. Wejchert and G. Tesauro. Neural network visualization. In Advances in neural information

processing systems 2, David S. Touretzky (Ed.). Morgan Kaufmann Publishers Inc., San Francisco,

CA, USA 465-472, 1990.

[68] L. Pratt and S. Nicodemus. HYPERPLANE ANIMATOR: Graphical display of backpropagation

training data and weights. Department of Mathematical and Computer Sciences Colorado School of

Mines 402 Stratton Golden, CO 80401

[69] Zeiler, M. and Fergus, R. Visualizing and understanding convolutional neural networks. arXiv

preprint, arXiv:1311.2901, 2013

[70] Bruckner, D., Rosen, J. and Sparks, E. Deepviz: Visualizing convolutional neural networks for

image classification, 2013. URL http://vis.berkeley.edu/courses/cs294-10-fa13/wiki/images/f/fd/

DeepVizPaper.pdf.

[71] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. ECCV, volume

8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.

[72] J. Yosinski, J. Clune, A. M. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks

through deep visualization,” CoRR, vol. abs/1506.06579, 2015.

[73] W. Samek, A. Binder, G. Montavon, S. Bach, and K. Mueller. Evaluating the visualization of what a

deep neural network has learned. CoRR, abs/1509.06321, 2015.

[74] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising

image classification models and saliency maps,” in ICLR Workshop, 2014.

http://arxiv.org/abs/1311.2901


687691 of 35

[75] P. M. Rasmussen, T. Schmah, K. H. Madsen, T. E. Lund, G. Yourganov, S. C. Strother, and L. K.

Hansen, “Visualization of nonlinear classification models in neuroimaging - signed sensitivity

maps,” in BIOSIGNALS, pp. 254–263, 2012.

[76] P. Rheingans and M. desJardins Visualizing High-Dimensional Predictive Model Quality. Proc. VIS

2000, pp:493-496.

[77] A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Interactive optimization for steering machine

classification. In Proceedings of the International Conference on Human Factors in Computing

Systems (CHI), pages 1343–1352. ACM, 2010.

[78] B., Alsallakh, A. Hanbury, H. Hauser, S. Miksch, and A. Rauber, "Visual Methods for Analyzing

Probabilistic Classification Data", IEEE Transactions on Visualization and Computer Graphics, vol.

20, issue 12, no. 12, pp. 1703--1712, 12/2014

[79] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum. Parametric

embedding for class visualization. Neural Computation, 19(9):2536–2556, 2007.

[80] C. Seifert and E. Lex. A novel visualization approach for data-mining related classification. In the

13th International Conference on Information Visualisation , pages 490–495, 2009.

[81] M. Godec, C. Leistner, A. Saffari, and H. Bischof. Online Random Naive Bayes for tracking. In Int’l

Conf. on Pattern Recognition, pages 3545–3548, 2010.

[82] T. Minka, P., Xiang, R. and Qi, Y. Virtual vector machine for Bayesian online classification. The

25th Conference On Uncertainty In Artificial Intelligence, 2009.

[83] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers, IEEE Transactions on Pattern

Analysis and Machine Intelligence 20 (3) (1998) 226–239.

[84] L. Kuncheva, Classifier ensembles for changing environments, in: Proceedings of the Fifth

International Workshop on Multiple Classifier Systems, 2004, pp. 1–15.

[85] J. Talbot, lee, b., Kapoor, a., and Tan, d. S. 2009. EnsembleMatrix: interactive visualization to

support machine learning with multiple classifiers. ACM Conference on Human Factors in

Computing Systems, Boston, MA, April, 1283-1292.

[86] Urbanek, S. Exploring Statistical Forests. Proc. of the 2002 Joint Statistical Meeting, Springer

(2002).

[87] A. Bouchachia, E. Balaguer-Ballester. DELA: A Dynamic Online Ensemble Learning Algorithm. In

the 22th European Symposium on Artificial Neural Networks, Bruges, 2014

[88] A.Bouchachia. Incremental learning with multi-level adaptation. Neurocomputing 74(11): 1785-

1799 (2011)

[89] J. Kolter and M. Maloof, Dynamic weighted majority: a new ensemble method for tracking concept

drift, in: Proceedings of the Third International Conference on Data Mining ICDM’03, IEEE CS

Press, 2003, pp. 123–130

[90] W. Street, Y. Kim, A streaming ensemble algorithm (sea) for large-scale classification, in:

Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining

KDDM’01, 2001, pp. 377–382.


687691 of 35

[91] S. Wang, Y. Yang, J. Chang, F. Lin. Using Penalized Regression with Parallel Coordinates for

Visualization of Significance in High Dimensional Data. International Journal of Advanced

Computer Science and Applications, Vol. 4, No. 10, pp:32-28, 2013

[92] S. Vijayakumar, S. Schaal, Locally weighted projection regression: An O(n) algorithm for

incremental real time learning in high dimensional space, in: Proceedings of the 17th International

Conference on Machine Learning, ICML'00, 2000.

[93] S. Vijayakumar, A. D'Souza and S. Schaal. Incremental Online Learning in High Dimensions.

Neural Computation, vol. 17, no. 12, pp. 2602-2634 (2005)

[94] G Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In

Advances in Neural Information ProcessingSystems , volume 13, 2001.

[95] J. Ma, J. Theiler, and S. Perkins. Accurate Online Support Vector Regression. Neural Computation

15(11):2683-703 · November 2003

[96] P. Breheny and W. Burchett. Visualization of regression models using Visreg. Internal report,

University of Kentucky, 2012.

[97] M. Mitchell. Interpreting and Visualizing Regression Models Using Stata. Stata Press, 2012

[98] D. Maniyar, and I. Nabney. Data visualization with simultaneous feature selection. Proceedings of

the 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational

Biology, CIBCB'06. 2006. p. 156-163.

[99] N. Gianniotis and C. Riggelsen, Visualisation of high-dimensional data using an ensemble of neural

networks., IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL),

Singapore, 2013, pp. 17-24.

D5.1 Visualization Requirements for Massive Online Machine … · 2016. 11. 23. · statistics,...

Documents

Transcript of D5.1 Visualization Requirements for Massive Online Machine … · 2016. 11. 23. · statistics,...