BigData Stream Mining

Post on 13-Feb-2017

221 views 1 download

Transcript of BigData Stream Mining

Albert Bifet, André C P L F de Carvalho, João Gama andre@icmc.usp.br

BigData Stream Mining

1

n  Learning from Data Streams n  Motivation n  Big Data Stream n  Novelty detection n  Clustering Learning n  Predictive Learning n  Frequent Pattern Mining

n  Counting Algorithms n  Frequent Items

n  Tools and Applications

Main Topics

2

n  Traditional datasets n  Data streams n  Novelty detection n  Algorithms n  Examples n  Challenges

Motivation

3

n  DM techniques were developed and are usually applied to static datasets n  All the data are available n  Machine learning algorithm induces a static decision model n  Small to medium datasets

Data Mining

4

n  Previous practice n  Few companies generate data n  All the rest consume data

n  Current practice n  Everybody produces data n  Everybody consumes data

5

Data production is changing

n  Machines are continuously collecting data n  And sending them to other machines

6

Data explosion

n  Everybody is a movie maker n  And wants a big audience

n  Everybody has a great video taste n  And shares what likes

n  Everybody is being watched n  Everywhere and everytime

7

Data Mining

n  Real-life problems are dynamic n  Data are generated continuously and at

high speed n  Medium to large size n  Data streams n  New techniques and modification of

existing techniques

8

Data Mining

9

http

s://

ww

w.d

omo.

com

Data never sleeps

10

http

s://

ww

w.d

omo.

com

11

12

http://www.flightradar24.com

13

Data never sleeps

Data never sleeps

14

15

Data never sleeps

16

Day 1 - Afternoon

Day 2 - Morning

Data never sleeps

Data never sleeps

17

18

Data never sleeps

19

Data never sleeps

20

Real data from smartphones

Portugal

http://www.publico.pt/ciencia/noticia/telemoveis-fornecem-quase-em-tempo-real-mapas-da-densidade-populacional-portuguesa-1677020

21

Real data from smartphones

Population dynamics between the main holiday period (July and August) and working periods in France. Credit: Catherine Linard http://phys.org/news/2014-10-cellphone-population-density.html#jCp

n  For each taxi in Porto, predict passenger demand n  30 minutes horizon

n  ECML/PKDD data science challenge

22

Real-time taxi demand prediction

23

Data never sleeps

n  Walmart n  Data Center occupies 11.000 m2 n  > 1 million transactions per hour n  Process 40 petabytes per day n  > 2000 times content of all books in the

American Congress library n  Largest world library in space and number of books (> 155 million items)

André Ponce de Leon F de Carvalho 24

Data sources

24

n  Youtube n  More than 1 billion users n  At each day, billions of accesses and

hundreds of millions of watching hours n  Number of hours each person watch per

month grows 50% each year n  300 hours of video uploaded each minute

André Ponce de Leon F de Carvalho 25

Data sources

25

26

Big Data relevance

http://hadoopadmin.com/big-data-hadoop-what-it-is-why-it-matters/sas-volume-variety-verlocity-value/

27

Mismanaged data cost

n  The new characteristics of data: n  Time and space:

n  The objects of analysis exist in time and space n  Often they are able to move

n  Dynamic environment: n  The objects exist in a dynamic and evolving

environment

n  Information processing capability: n  The objects have limited information processing

capabilities

28

A World in movement

n  The new characteristics of data: n  Locality:

n  The objects know only their local spatio-temporal environment

n  Distributed Environment: n  Objects are able to exchange information with

other objects

n  Main Goal: n  Real-Time Analysis:

n  Decision models must evolve in correspondence with the evolving environment

29

A World in movement

n  These characteristics imply: n  Switch from one-shot learning to continuously

learning dynamic models that evolve over time n  In the perspective induced by ubiquitous

environments n  Finite training sets, static models, and stationary

distributions will have to be completely thought anew

n  The algorithms will have to use limited computational resources

n  In terms of processing, memory space and communication time

30

Challenges of Real Time Stream Mining

n  Usual features of and

Task

Classification Regression

Data generation

Asynchronous Synchronous

Labelled observations?

No Yes

DS TS

Sequence dependence?

No Yes

31

Time Series x Data Streams

n  Stock market n  Currency value n  Energy demand and consumption n  Hydro-electrical energy generation n  Weather forecasting

Time series sources

32

n  Data arrive sequentially and, usually n  With high speed n  Dynamically, time-changing environments n  Without control on the arrival order n  Different intervals between arrivals

n  Stream usually have unlimited size n  Data distribution may change over time n  Arriving objects are unlabelled

33

Data streams main features

n  Data must be accessed only once n  Data cannot be stored in memory

n  After processed, object is discarded

n  Decision model must be continuously updated n  Be able to detect novelties

n  Novelty detection

n  Model update must be fast n  Concept drift

34

Data streams solution requirements

n  DS mining can use incremental learning algorithms

n  Model is adapted as new examples become available n  Training never stops n  Alternative: wait and train again with the

expanded training set (retraining) n  Ignore previous model

n  Several incremental learning algorithms

35

Incremental Learning

n  Ability to identify new or unknown situations n  Usually a classification task

n  Novelty, anomaly and outlier detection n  Different definitions in statistics and

machine learning n  Find patterns that are different from the

normal, usual, patterns

36

Novelty Detection

n  Few examples that are unexpected and do not represent a new concept n  Anomaly

n  Exception to what is known n  Cohesive and representative group of examples

representing a new concept n  Can be a novelty

n  Decision model must be adapted to incorporate the anomaly

n  Outlier n  Abnormality or noise

37

Anomalies and Outliers

n  Concept evolution n  New concept (class) emerges in the stream

n  Concept drift n  Change in the profile (data distribution) of

an existing concept (class) n  Recurring concepts

n  Concepts that appeared in the past and disappeared may occur again in the future

38

Novelty Detection modalities

Variable 1 Time n

Variable 1 Time n + m

Varia

ble

2

Varia

ble

2

Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012

39

Feature Drift

Variable 1 Variable 1

Varia

ble

2

Varia

ble

2

Time n Time n + m

Offline - first data Online - new data

Variable 1 Variable 1 Va

riabl

e 2

Varia

ble

2

Offline Online - first data

New model

New model

Initial model

Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012

40

Concept Drift

Variable 1 Variable 1

Varia

ble

2

Varia

ble

2

Time n Time n + m

Offline - first data Online - new data

Variable 1 Variable 1 Va

riabl

e 2

Varia

ble

2

Offline Online - first data

New model

New model

Initial model

41

Concept Evolution

Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012

Time n + m + k Time n + m + k + l

Variable 1 Variable 1 Va

riabl

e 2

Varia

ble

2

Offline Online - first data New model

Variable 1 Variable 1

Offline - first data Online - new data

New model

Time n Time n + m

Varia

ble

2 Initial model

Varia

ble

2

New model

42

Concept Re-occurrence

Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012

Time

Mea

n da

ta

Time

Mea

n da

ta

Time

Mea

n da

ta

Abrupt change

Time

Mea

n da

ta

Time

Mea

n da

ta

Time

Mea

n da

ta

Abrupt change Multiple streams

Incremental Gradual

Outlier Reocurring concepts Multiple streams

43

Profiles of changes over time

Adapted from Albert Bifet, Joao Gama, Ricard Gavalda,Georg Krempl, Mykola Pechenizkiy,Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite Advanced topics in Data Stream Mining, ECML PKDD 2012

n  Data may become outdated and no longer useful n  Outdated data should be discharged

n  Several mechanisms n  Choice depends on

n  How we expect the changes to occur in the data distribution

n  Trade-off between intensity (reactivity) and robustness of noise

n  Faster reactivity ⇒ more abrupt ⇒ higher risk of keeping noise

44

Forgetting mechanisms

n  Forgetting can be: n  Abrupt (crisp forgetting)

n  At each time, a given observation is kept or removed from a learning window

n  Gradual (soft forgetting) n  All observations are kept in a full memory n  Observations are weighted, reflecting their age

(relevance) n  Importance of the observation in the training set

should reduce with aging

45

Forgetting mechanisms

n  We do not know all the classes during training n  Only the classes in the training set are

known n  Unknown classes can appear in the test set n  Does not mean that

n  Unknown classes did not exist when training set was obtained

n  Data comes in a stream n  Data distribution changed

46

Open Set Recognition

n  Non profit movements to bring social benefits to people and communities n  Some of them adopted by companies

n  How does it occur? n  Meetings n  Events n  Academic internships n  Social networks

n  Current trend: data stream mining for social good

47

Data Science for Social Good

n  Existing approaches n  Using (open) data to solve civic problems

n  Usually want development of web/mobile apps

n  Using data science techniques to solve social problems

n  Mainly want insights from data scientists

n  Data democratization n  Allow anyone access to data n  First U.S. Chief Data Scientist was named

n  Precision medicine, open data, data-driven decision

48

Data Science for Social Good

n  Different forms of engagement n  Challenges and competitions

n  Predictive data analytics to preventing fires n  http://ibmhadoop.devpost.com/

n  University internship n  Volunteer n  Part time jobs n  Full time jobs

49

http://www.kdnuggets.com/2014/07/data-for-good-data-driven-projects-social-good.html

Data Science for Social Good

n  Bring social benefits to people and communities n  Good health care for all n  Economical development of poor countries n  Good education for all n  Clean and cheap energy n  Citizenship n  Environmental protection n  Better and cleaner transport

50

Data Science for Social Good

Education

n  Monitor student performance n  Support development of better teaching

platforms n  Dynamically adapted to students

performance and needs

n  Evaluate teachers and schools n  Replicate good experiences n  Act before late

51

Data Science for Social Good

Finance

n  Improve the financial health of communities

n  Support small business n  Direct social initiatives n  Fraud detection in the use of public

resources

52

Data Science for Social Good

Environmental

n  Reduce global warming n  Decrease deforestation n  Reduce effects of draughts n  Predict natural disasters n  Detect invasive species n  Increase species diversity

53

Data Science for Social Good

Health care

n  Monitor patient status in intensive units n  Accelerate and make cheaper medical

research n  Look at millions of patient records arriving

in streams

n  Discover epidemics n  Elderly fall prevention

54

Data Science for Social Good

Relavant links

n  Data Science for Social Good Fellowship n  DataLook n  civisanalytics.com n  digitalhumanitarians.com n  www.data4good.co n  http://www.meetup.com/DataKind-UK

55

Data Science for Social Good

Big Data Stream Mining

Albert Bifet, Andre Carvalho, Joao Gamajgama@fep.up.pt

LIAAD-INESC TEC, University of Porto, Portugal

Learning from Data StreamsPowerful IdeasClustering LearningPredictive LearningNovelty DetectionFrequent Pattern Mining

Outline

Learning from Data StreamsPowerful IdeasClustering LearningPredictive LearningNovelty DetectionFrequent Pattern Mining

Data Streams

Data Streams: Continuous flow of data generated at high-speedin Dynamic, Time-changing environments.We need to maintain Decision models in real time.Decision Models must be capable of:

I incorporating new information at the speed data arrives;

I detecting changes and adapting the decision models to themost recent information.

I forgetting outdated information;

Unbounded training sets, dynamic models.

Data Stream Processing

1. One example at at time,used at most once

2. Limited memory

3. Limited time

4. Anytime prediction

Approximate Algorithms

Powerful ideas

I Summarization:Compact and fast summaries to store sufficient statistics

I Approximation:How much information we need to learn, with high probability,an hypothesis H that is within small error of the truehypothesis ?Pr(|H − H| < ε|H|) > 1− δ

I Estimation:Useful for change detection

Adaptive Learning Algorithms

A survey on concept drift adaptation, Gama, Zliobaite, Bifet et al, ACM-CSUR 2014

Clustering Data Streams

I New requirements in stream clustering:I Generate high-quality clusters in one scanI High quality, efficient incremental clusteringI Analysis for different time granularityI Tracking the evolution of clusters

I Clustering: A stream data reduction technique

Cluster Feature Vector

Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang,

Ramakrishnan, Livny 1996

Cluster Feature Vector: CF = (N, LS ,SS)

I N: Number of data points

I LS :∑N

1 ~xi

I SS :∑N

1 (~xi )2

Constant space irrespective to the number of examples!

Micro clusters

The sufficient statistics of a cluster A are CFA = (N, LS ,SS).

I N, the number of data objects,

I LS, the linear sum of the data objects,

I SS, the sum of squared the data objects.

Properties:

I Centroid = LS/N

I Radius =√

SS/N − (LS/N)2

I Diameter =√

2×N∗SS−2×LS2

N×(N−1)

Micro clusters

Given the sufficient statistics of a cluster A, CFA = (NA, LSA, SSA).Updates are:

I Incremental: a point x is added to the cluster:LSA ← LSA + x ; SSA ← SSA + x2; NA ← NA + 1

I Additive: merging clusters A and B:LSC ← LSA + LSB ; SSC ← SSA + SSB ; NC ← NA + NB

I Subtractive:CF (C1 − C2) = CF (C1)− FV (C2)

CluStream

CluStream: A Framework for Clustering Evolving Data Streams, Aggarwal, J. Han, J.

Wang, P. Yu (VLDB03)

I Divide the clustering process into online and offlinecomponents

I Online: periodically stores summary statistics about the streamdata

I Micro-clustering: better quality than k-meansI Incremental, online processing and maintenance

I Offline: answers various user queries based on the storedsummary statistics

I Tilted time framework: register dynamic changes

I With limited overhead to achieve high efficiency, scalability,quality of results and power of evolution/change detection

CluStream: Online Phase

Inputs:

I Maximum micro-cluster diameter Dmax

For each x in the stream:I Find the nearest micro-cluster Mi

I IF the diameter of (Mi ∪ x) < Dmax

I THEN assign x to that micro-clusterMi ← Mi ∪ x

I ELSE Start a new micro-cluster based on x

Pyramidal Time Frame

I The micro-clusters are stored at snapshots.

I The snapshots follow a pyramidal pattern

I The micro-clusters might be aggregated using tiltedhistograms

Any Time Stream Clustering

The ClusTree: indexing micro-clusters for anytime stream mining, Kranen, Assent,

Baldauf, Seidl, KAIS 2011

Properties of anytime algorithms

I Deliver a model at any timeI Improve the model if more time is available

I Model adaptation whenever an instance arrivesI Model refinement whenever time permits

I an online component to learn micro-clusters

I Any variety of online components can be utilized

I Micro-clusters are subject to exponential aging

Clustering Evaluation

An effective evaluation measure for clustering on evolving data streams; Kremer,

Kranen, Jansen, Seidl, Bifet, Holmes, Pfahringer, KDD 2011

I Clusters may: appear, fade, move, mergeI Missed points (unassigned)I Misplaced points (assigned to different cluster)I Noise

I Cluster Mapping Measure CMMI External (ground truth)I Normalized sum of penalties of these errors

Cluster Evolution

Analysis

I find the cluster structure in the current window,

I find the cluster structure over time ranges with granularityconfined by the specification of window size and boundary,

I put different weights on different windows to mine variouskinds of weighted cluster structures,

I mine the evolution of cluster structures based on the changesof their occurrences in a sequence of windows

Bibliography: Cluster data streams

I Birch: an efficient data clustering method for very large databases Zhang, T.,Ramakrishnan, R., e Livny, M. ACM SIGMOD 1996

I Clustering data streams: Theory and practice. Guha, S., Meyerson, A., Mishra,N., Motwani, R., e O’Callaghan, L. IEEE TKDE 2003

I CluStream: A Framework for Clustering Evolving Data Streams, Aggarwal, J.Han, J. Wang, P. Yu VLDB03

I Monic: modeling and monitoring cluster transitions Spiliopoulou, M., Ntoutsi,I., Theodoridis, Y., e Schult, R. ACM SIGKDD 2006

I The clustree: indexing micro-clusters for anytime stream mining Kranen, P.,Assent, I., Baldauf, C., and Seidl, T. KAIS 2011

I An effective evaluation measure for clustering on evolving data streams; Kremer,Kranen, Jansen, Seidl, Bifet, Holmes, Pfahringer, KDD 2011

I Data stream clustering: A survey Silva, J. A., Faria, E., Barros, R., Hruschka,E., Carvalho, A., Gama, J. ACM Computing Surveys, 2013

Learning Decision Trees

The base Idea

I Which attribute to choose at each splitting node?

I A small sample can often be enough to choose the optimalsplitting attribute

I Collect sufficient statistics from a small set of examples

I Estimate the merit of each attribute

How large should be the sample?

I The wrong idea: Fixed sized, defined apriori without lookingfor the data;

I The right idea: Choose the sample size that allow todifferentiate between the alternatives.

Very Fast Decision Trees

Mining High-Speed Data Streams, P. Domingos, G. Hulten; KDD 2000

The base IdeaA small sample can often be enough to choose the optimalsplitting attribute

I Collect sufficient statistics from a small set of examples

I Estimate the merit of each attributeI Use Hoeffding bound to guarantee that the best attribute is

really the best.I Statistical evidence that it is better than the second best

Very Fast Decision Trees: Main Algorithm

I Input: δ desired probability level.

I Output: T A decision Tree

I Init: T ← Empty Leaf (Root)I While (TRUE)

I Read next exampleI Propagate example through the tree from the root till a leafI Update sufficient statistics at leafI If leaf (#examples) > Nmin

I Evaluate the merit of each attributeI Let A1 the best attribute and A2 the second bestI Let ε =

√R2ln(1/δ)/(2n)

I If G(A1)− G(A2) > εI Install a splitting test based on A1

I Expand the tree with two descendant leaves

VFDT

Concept-adapting VFDT

G. Hulten, L. Spencer, P. Domingos: Mining Time-Changing Data Streams KDD 2001

I Model consistent with sliding window on streamI Keep sufficient statistics also at internal nodes

I Recheck periodically if splits pass Hoeffding testI If test fails, grow alternate subtree and swap-in when accuracy

of alternate is better

I Processing updates O(1), time +O(W) memoryI Increase counters for incoming instance, decrease counters for

instance going out window

Hoeffding Adaptive Tree

A. Bifet, R. Gavalda: Adaptive Parameter-free Learning from Evolving Data Streams

IDA, 2009

I Replace frequency counters by estimatorsI No need for window of examplesI Sufficient statistics kept by estimators separately

I Parameter-free change detector + estimator with theoreticalguarantees for subtree swap (ADWIN)

I Keeps sliding window consistent with the no-change hypothesis

Hoeffding Algorithms

I Classification:Mining high-speed data streams, P. Domingos, G. Hulten, KDD, 2000

I Regression:Learning model trees from evolving data streams; Ikonomovska, Gama,Dzeroski; Data Min. Knowl. Discov. 2011

I Rules:Learning Decision Rules from Data Streams, J. Gama, P. Kosina; IJCAI 2011

I Clustering:Hierarchical Clustering of Time-Series Data Streams. Rodrigues, Gama, IEEETKDE 20(5): 615-627 (2008)

I Multiple Models:Ensembles of Restricted Hoeffding Trees. Bifet, Frank, Holmes, Pfahringer;ACM TIST; 2012J. Duarte, J. Gama, Ensembles of Adaptive Model Rules from High-Speed DataStreams. BigMine 2014.

I . . .

Option Trees

Speeding-Up Hoeffding-Based Regression Trees With Options, Ikonomovska, et al,

ICML 2011

Use option nodes to solve ties

Rules

Problem: very large decision treeshave context that is complex and hardto understand

I Rules: self-contained, modular,easier to interpret, no need tocover the universe

I L keeps sufficient statistics to:make predictionsexpand the ruledetect changes and anomalies

Adaptive Model Rules

Adaptive Model Rules from Data Streams, Almeida, Ferreira, Gama; ECML/PKDD

2013

I Ruleset: ensemble of rules

I Rule prediction: mean, linearmodel

I Ruleset prediction:I Ordered: only first rule covers

instanceI Unordered: weighted avg. of

predictions of rules coveringinstance x

I Weights inversely proportionalto error

AMRules Induction

I Rule creation: default ruleexpansion

I Rule expansion: split onattribute maximizing σreduction

I Hoeffding boundε =

√R2ln(1/δ)/(2n)

I Expand whenσ1st/σ2nd < 1− ε

I Evict rule when P-H signals analarm

I Detect and explain localanomalies

Clustering Time-series

Hierarchical Clustering of Time-Series Data Streams. Rodrigues, Gama, TKDE, 2008

Using Pearson correlation as splitting criteria.

Hoeffding Algorithms: Analysis

The number of examples required to expand a node only dependson the Hoeffding bound: ε decreases with

√N.

I Low variance models:Stable decisions with statistical support.

I Low overfiting:Examples are processed only once.

I No need for pruning;Decisions with statistical support;

I Convergence: Hoeffding Algorithms becomes asymptoticallyclose to that of a batch learner. The expected disagreement isδ/p; where p is the probability that an example fall into a leaf.

Bibliography on Predictive Learning

I Mining High Speed Data Streams, by Domingos, Hulten, SIGKDD 2000.

I Mining time-changing data streams, Hulten, Spencer, Domingos, KDD2001.

I Efficient Decision Tree Construction on Streaming Data, by R. Jin, G.Agrawal, SIGKDD 2003.

I Accurate Decision Trees for Mining High Speed Data Streams, by J.Gama, R. Rocha, P. Medas, SIGKDD 2003.

I Forest trees for on-line data; J. Gama, P. Medas, R. Rocha; SAC 2004.

I Learning decision trees from dynamic data streams, Gama, Medas, andRodrigues; SAC 2005

I Decision trees for mining data streams, Gama, Fernandes, and Rocha,Intelligent Data Analysis, Vol. 10, 2006.

I Handling Time Changing Data with Adaptive Very Fast Decision Rules,Kosina, Gama; ECML-PKDD 2012

I Learning model trees from evolving data streams, Ikonomovska, Gama,Dzeroski: Data Min. Knowl. Discov. 2011

Definition

I Novelty Detection refers to the automatic identification ofunforeseen phenomena embedded in a large amount of normaldata.

I Novelty is a relative concept with regard to our currentknowledge:

I It must be defined in the context of a representation of ourcurrent knowledge.

I Specially useful when novel concepts represent abnormal orunexpected conditions

I Expensive to obtain abnormal examplesI Probably impossible to simulate all possible abnormal

conditions

Context

I In real problems, as time goes by

I The distribution of known concepts may changeI New concepts may appear

I By monitoring the data stream, emerging concepts may bediscovered

I Emerging concepts may represent

I An extension to a known concept (Extension)I A novel concept (Novelty)

I Several interesting applications: Early Detection of Fault in JetEngines, Intrusion Detection in computer networks, Breaking Newsin a flow of text documents (news articles), Burst of Gamma-ray(astronomical data),

One-Class Classification

Autoassociator Networks

Concept-learning in the absence of counter-examples: an

autoassociaton-based approach Nathalie Japcowicz, 1999

I Three layer network

I The nr. of neurons in the outputlayer is equal to the input layer

I Train the network such that ~y isequal to the ~x

I The network is trained toreproduce the input at theoutput layer

Autoassociator Networks

To classify a test example ~x

I Propagate ~x through the network and let ~y be thecorresponding output;

I If∑k

i (xi − yi )2 < Threshold Then the example is considered

from class normal;

I Otherwise, ~x is a counter-example of the normal class.

Novelty detection

I Training set (Offline Phase )I Dtr = (X1, y1), (X2, y2), . . . , (Xm, ym)I Xi : vector of input attributes for the ith example

yi : target attributeI yi ∈ Ytr where Ytr = c1, c2, . . . , cL

I When new data arrive (Online Phase)I Given a sequence of unlabelled examples Xnew

Goal: Classify Xnew in Yall where Yall = c1, c2, . . . , cL, . . . , cKand K > L

Novelty Detection Systems

I ECSMiner: Assume that the class label of new examples isknown

I OLINDDA: unsupervised, but restricted to binary classificationproblems

I MINAS (MultI-class learNing Algorithm for data Streams)I Does not use the class labels of new examplesI Can deal with novelty detection in data streams multi-class

problem

OLINDDA algorithm

OnLIne Novelty and Drift Detection AlgorithmSpinosa, Carvalho, Gama: OLINDDA: a cluster-based approach for

detecting novelty and concept drift in data streams SAC 2007

I Offline and Online phases

I Models: normal, extension and novelty

I Each model is represented by a set of clusters

I Not suitable for multi-class problem

OLLINDA

ECSMiner algorithm

Masud, Gao, Khan, Han, and Thuraisingham, Classification and novel

class detection in concept-drifting data streams under time constraints,

TKDE 2011

Supervised algorithm integrating novel concepts and concept drift

I Ensemble of classifiersI Creates a new model when all examples in a chunk are labeled

I Supposes that all examples in the stream will be labeled (aftera delay of Tl time units)

I An instance will be classified in until Tc time units of its arrival

Minas algorithm

MINAS: Multiclass Learning Algorithm for Novelty Detection in Data

Streams, E. Faria, J. Gama, A. Carvalho, DAMI (to appear)

I Unsupervised algorithm for novelty detection in data streamsmulti-class problemsRepresents each known class by a set of hyperspheres

I Use of offline (training) and online phasesIn each phase learns one or more classes

I Cohesive set of examples is necessary to learn new concepts orextensionsIsolated examples are not considered as novelty

MINAS - Offline phase

I Learns a decision model based on the known concept aboutthe problemKMeans or Clustream

I Run only once

I Each class is represent by a set of clusters (hyperspheres)

MINAS - Online phase

I Receives new examples from the streamI Classify each new example

I In one of the known classes orI As unknown

I Cohesive group of unknown examples are used to detect newclasses or extensions

Minas

Novelty Detection Bibliography

I Masud, Gao, Khan, Han, and Thuraisingham, Classification and

novel class detection in concept-drifting data streams under time

constraints, TKDE 2011

I Spinosa, Carvalho, Gama: OLINDDA: a cluster-based approach for

detecting novelty and concept drift in data streams SAC 2007

I MINAS: Multiclass Learning Algorithm for Novelty Detection in

Data Streams, E. Faria, J. Gama, A. Carvalho, DAMI (to appear)

I P. Angelov and X. Zhou, Evolving fuzzy-rule-based classifiersfrom data streams Trans. Fuz Syst. 2008.

I D. Tax and R. Duin, Growing a multi-class classifier with areject option Pattern Recognit. Lett., 2008.

I F. Denis, R. Gilleron, and F. Letouzey, Learning from positiveand unlabeled examples, Theoretical Comput. Sci., 2005.

I D. Cardoso and F. Franca A Bounded Neural Network forOpen Set Recognition, IJCNN 2015

Introduction

I Frequent pattern mining refers to finding patterns that occurgreater than a pre-specified threshold value.

I Patterns refer to items, itemsets, or sequences.

I Threshold refers to the percentage of the pattern occurrencesto the total number of transactions. It is termed as Support.

Introduction

I Finding frequent patterns is the first step for the discovery ofassociation rules in the form of A→ B.

I Apriori algorithm represents a pioneering work for associationrules discoveryR Agrawal and R Srikant, Fast Algorithms for Mining Association Rules.VLDB 2004.

I An important step towards improving the performance of association rulesdiscovery was FP-GrowthJ. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without CandidateGeneration SIGMOD 2000

Introduction

I Many measurements have been proposed for finding thestrength of the rules.

I The very frequently used measure is support.I The support Supp(X ) of an itemset X is defined as the

proportion of transactions in the data set which contain theitemset.

I Another frequently used measure is confidence.I Confidence refers to the probability that set B exists given that

A already exists in a transaction.I Confidence (A→ B) = Supp (AB) / Supp (A)

Frequent Pattern Mining in Data Streams

The process of frequent pattern mining over data streams differsfrom the conventional one as follows:

I The technique should be linear or sublinear: You Have OnlyOne Look.

I heavy hitters, top-k, frequent items, and itemsets.

Frequent Items (Heavy Hitters) in Data Streams

Manku and Motwani have two master algorithms in this area:

I Sticky Sampling

I Lossy Counting

G. S. Manku and R. Motwani. Approximate Frequency Counts over DataStreams, in Proceedings of the 28th International Conference on Very LargeData Bases (VLDB), Hong Kong, China, August 2002.

Sticky Sampling

Sticky sampling is a probabilistic technique.I The user inputs three parameters

I Minimum Support(s)I Admissible Error (ε)I Probability of failure (δ)

I A simple data structure is maintained that has entries of dataelements and their associated frequencies (e, f).

I The sampling rate decreases gradually with the increase in thenumber of processed data elements: t = 1

ε log(s−1δ−1)

Sticky Sampling

I For each incoming element in a data stream, the datastructure is checked for an entry

I If an entry exists, then increment the frequencyI Otherwise sample the element with the current sampling rate.I If selected, then add a new entry, else the element is ignored.

I With every change in sampling rate, an unbiased coin toss isdone for each entry with decreasing the frequency with everyunsuccessful coin toss

I If the frequency goes down to zero, the entry is released

Lossy Counting

I Lossy counting is a deterministic technique.I The user inputs two parameters

I Minimum Support (s)I Admissible Error (ε)

I The data structure has entries of data elements, theirassociated frequencies (e, f, 4) where 4 is the maximumpossible error in f.

I The stream is conceptually divided into buckets with a widthw = 1/ε.

I Each bucket is labeled by a value of N/w , where N startsfrom 1 and increases by 1.

Lossy Counting

I For a new incoming element, the data structure is checkedI If an entry exists, then increment the frequencyI Otherwise, add a new entry with 4 = bcurrent − 1 where

bcurrent is the current bucket label.

I When switching to a new bucket, all entries withf +4 < bcurrent are deleted.

Error Analysis

Output:

I Elements with counter values exceeding s × N − ε× N

How much do we undercount?

I If the current size of stream is N and window-size = 1/ε thenfrequency error ≤ #window = ε× N

Approximation guarantees:

I Frequencies underestimated by at most ε× N

I No false negatives

I False positives have true frequency at least s × N − ε× N

How many counters do we need?

I Worst case: 1/εlog(εN) counters

Pattern mining: definitions

Patterns: sets with a subpattern relation ⊂

{cheese,milk} ⊂ {milk, peanuts, cheese, butter}

(search?buy) ⊂ (home?search?cart?buy?exit)

Applications: market basket analysis, intrusion detection, churnprediction, feature selection, XML query analysis, query andclickstream analysis, anomaly detection . . .

Pattern mining in streams: definitions

I The support of a pattern T in a stream S at time t is theprobability that a pattern T ′ drawn from S ′s distribution attime t is such that T ⊂ T ′

I Typical task: Given access to S , at all times t, produce theset of patterns T with support at least ε at time t

I A pattern is closed if no superpattern has the same support.

I No information is lost if we focus only on closed patterns.

Key data structure: Lattice of patterns, with counts

Fundamentals

I A priori property: t ⊆ t ′ ⇒ support(t) ≥ support(t ′)

I Closed: none of its supersets has the same supportCan generate all freq. itemsets and their support

I Maximal: none of its supersets is frequentCan generate all freq. itemsets (without support)

I Maximal ⊆ Closed ⊆ Frequent ⊆ D

FP-Stream

C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: Mining frequentpatterns in data streams at multiple time granularities. NGDM(2003)

I Multiple time granularities

I Based on FP-Growth (depth-first search over itemset lattice)

I Pattern-tree with Tilted-time windowTilted-time window: logarithmically aggregated time slots (lognumber of levels, aggregate when the level is full, push theaggregate one level up)

I Time sensitive queries, emphasis on recent history

I High time and memory complexity

Moment

Y. Chi , H. Wang, P. Yu , R. Muntz: Moment: Maintaining Closed

Frequent Itemsets over a Stream Sliding Window. ICDM 2004

I Keeps track of boundary below frequent itemsetsI Closed Enumeration Tree (CET) (≈ prefix tree)

I Infrequent gateway nodes (infrequent)I Unpromising gateway nodes (infrequent, dominated)I Intermediate nodes (frequent, dominated)I Closed nodes (frequent)

I By adding/removing transactions closed/infreq. do notchange

Itemset mining

I MOMENT (Chi+ 04) (Sliding window, frequent closed, exact)

I CLOSTREAM (Yen+ 09) (Sliding window, all closed, exact)

I MFI (Li+ 09) (Transaction-sensitive window, frequent closed,exact)

I IncMine (Cheng+ 08) (Sliding window, frequent closed,approximate; faster for moderate approximate ratios)

Sequence, trees, and graph mining

I Frequent subsequence mining:MILE (Chen+05), SMDS (Marascu-Masseglia 06), SSBE(Koper-Nguyen 11)

I Bifet+08: Frequent closed unlbeled subtree mining

I Bifet+11: Frequent closed labeled subtree mining; Frequentclosed labeled subgraph mining

Bibliography on Frequent Item’s

I What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically,by G. Cormode, S. Muthukrishnan, PODS 2003.

I Dynamically Maintaining Frequent Items Over A Data Stream, by C. Jin,W. Qian, C. Sha, J. Yu, A. Zhou; CIKM 2003.

I Processing Frequent Itemset Discovery Queries by Division and SetContainment Join Operators, by R. Rantzau, DMKD 2003.

I Approximate Frequency Counts over Data Streams, by G. Singh Manku,R. Motawani, VLDB 2002.

I Finding Hierarchical Heavy Hitters in Data Streams, by G. Cormode, F.Korn, S. Muthukrishnan, D. Srivastava, VLDB 2003.

I J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without CandidateGeneration SIGMOD 2000

I Metwally, D. Agrawal, A. Abbadi, Efficient Computation of Frequent andTop-k Elements in Data Streams, ICDT 2005

I Y. Chi , H. Wang, P. Yu , R. Muntz: Moment: Maintaining ClosedFrequent Itemsets over a Stream Sliding Window ICDM04

I C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: Mining frequent patternsin data streams at multiple time granularities. NGDM (2003)

Outline

1 Evaluation

2 Non Distributed Open Source Tools

3 Distributed Open Source Tools

4 Applications

Data stream classification cycle

1 Process an example at a time,and inspect it only once (atmost)

2 Use a limited amount of memory

3 Work in a limited amount oftime

4 Be ready to predict at any point

Evaluation

1 Error estimation: Hold-out or Prequential

2 Evaluation performance measures: Accuracy or κ-statistic

3 Statistical significance validation: MacNemar or Nemenyi test

Evaluation Framework

Error Estimation

Data available for testing

Holdout an independent test set

Apply the current decision model to the test set, at regulartime intervals

The loss estimated in the holdout is an unbiased estimator

Holdout Evaluation

1. Error Estimation

No data available for testing

The error of a model is computed from the sequence ofexamples.

For each example in the stream, the actual model makes aprediction, and then uses it to update the model.

Prequential orInterleaved-Test-Then-Train

1. Error Estimation

Hold-out or Prequential?

Hold-out is more accurate, but needs data for testing.

Use prequential to approximate Hold-out

Estimate accuracy using sliding windows or fading factors

Hold-out or Prequential orInterleaved-Test-Then-Train

2. Evaluation performance measures

Predicted PredictedClass+ Class- Total

Correct Class+ 75 8 83Correct Class- 7 10 17

Total 82 18 100

Table: Simple confusion matrix example

Accuracy = 75100 + 10

100 = 7583

83100 + 10

1717100 = 85%

Arithmetic mean = (7583 + 1017)/2 = 74.59%

Geometric mean =√

7583

1017 = 72.90%

2. Performance Measures with Unbalanced Classes

Predicted PredictedClass+ Class- Total

Correct Class+ 75 8 83Correct Class- 7 10 17

Total 82 18 100

Table: Simple confusion matrix example

Predicted PredictedClass+ Class- Total

Correct Class+ 68.06 14.94 83Correct Class- 13.94 3.06 17

Total 82 18 100

Table: Confusion matrix for chance predictor

2. Performance Measures with Unbalanced Classes

Kappa Statistic

p0: classifier’s prequential accuracy

pc : probability that a chance classifier makes a correctprediction.

κ statistic

κ =p0−pc1−pc

κ = 1 if the classifier is always correct

κ = 0 if the predictions coincide with the correct ones as oftenas those of the chance classifier

Forgetting mechanism for estimating prequential kappa

Sliding window of size w with the most recent observations

Outline

1 Evaluation

2 Non Distributed Open Source Tools

3 Distributed Open Source Tools

4 Applications

VFML

Very Fast Machine Learning

Developed by Pedro Domingos and his team

Contains first implementation of Hoeffding Tree

VFDT: Very Fast Decision TreeCVFDT: Concept-adapting Very Fast Decision Tree

Does not contain ensembles

Implemented in C

Not longer maintained since 2003

VW

Vowpal Wabbit

Developed by John Langford at Yahoo Research and MicrosoftResearch

Used in Microsoft Azure Machine Learning

Single Classifier until 2013

Distributed using MPI

Based on the Hashing Trick

Sofia-ML

Developed by David Sculley, at Google

Good design of the software

Contains

Fast online learnersFast k-means clustering

{M}assive {O}nline {A}nalysis MOA (Bifet et al. 2010)

{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.

It is closely related to WEKA

It includes a collection of offline and online as well as tools forevaluation:

classification, regressionclusteringfrequent pattern mining

Easy to extend

Easy to design and run experiments

WEKA

Waikato Environment for Knowledge Analysis

Collection of state-of-the-art machine learning algorithms anddata processing tools implemented in Java

Released under the GPL

Support for the whole process of experimental data mining

Preparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning

Used for education, research and applications

Complements “Data Mining” by Witten & Frank & Hall

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

Classification Experimental Setting

Classification Experimental Setting

Evaluation procedures for DataStreams

Holdout

Interleaved Test-Then-Train orPrequential

Classification Experimental Setting

Data Sources

Random Tree Generator

Random RBF Generator

LED Generator

Waveform Generator

Hyperplane

SEA Generator

STAGGER Generator

Classification Experimental Setting

Classifiers

Naive Bayes

Decision stumps

Hoeffding Tree

Hoeffding Option Tree

Bagging and Boosting

ADWIN Bagging andLeveraging Bagging

Clustering Experimental Setting

Clustering Experimental Setting

Internal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure

Table: Internal and external clustering evaluation measures.

Clustering Experimental Setting

Clusterers

StreamKM++

CluStream

ClusTree

Den-Stream

D-Stream

CobWeb

Web

http://www.moa.cms.waikato.ac.nz

Easy Design of a MOA classifier

void resetLearningImpl ()

void trainOnInstanceImpl (Instance inst)

double[] getVotesForInstance (Instance i)

Easy Design of a MOA clusterer

void resetLearningImpl ()

void trainOnInstanceImpl (Instance inst)

Clustering getClusteringResult()

Extensions of MOA

Multi-label Classification

Active Learning

Regression

Closed Frequent Graph Mining

Twitter Sentiment Analysis

streamDM C++

http://streamdm.noahlab.com.hk/

Outline

1 Evaluation

2 Non Distributed Open Source Tools

3 Distributed Open Source Tools

4 Applications

streams Framework

Developed by Christian Bockermann at University ofDortmund

Uses MOA for Machine Learning methods

Integrates with Storm

RapidMiner Streams Plugin

Apache Mahout

Scalable machine learning library

Current version runs on Hadoop

Some methods are streaming to scale

New version in Scala, to run on Spark

Jubatus

Developed by Nippon Telegraph and Telephone

Open source online machine learning and distributedcomputing framework

Implemented in C++

Apache SAMOA(De Francisci & Bifet 2015)

samoa is distributed streaming machine learning(ML) framework that contains a programing

abstraction for distributed streaming ML algorithms.

Apache SAMOA

samoa-SPE

SAMOA

Algorithm and API

SPE-adapter

S4 Storm other SPEs

ML-

adap

ter MOA

Other ML frameworks

samoa-S4 samoa-storm samoa-other-SPEs

Apache SAMOA

SAMOA  SA

SAMOA ML Developer API

Processing ItemProcessor

Stream

SAMOA ML Developer API

Web

http://samoa-project.net/

Apache Flink

streamDM

http://streamdm.noahlab.com.hk/

streamDM

New project specific designed for Spark Streaming

Spark Streaming: latency in seconds

Easy to integrate in Spark systems

Designed in Scala

Classification, Regression, Clustering, Frequent Pattern Mining

Outline

1 Evaluation

2 Non Distributed Open Source Tools

3 Distributed Open Source Tools

4 Applications

Twitter: A Massive Data Stream

Web 2.0

Micro-blogging service

Built to discover what is happening at any moment in time,anywhere in the world.

3 billion requests a day via its API.

Twitter Streaming API

Twitter APIs

Streaming API

Two discrete REST APIs

Real-time access to Tweets

sampled form

filtered form

HTTP based

GET

POST

DELETE

Sentiment Analysis on Twitter

Sentiment analysis

Classifying messages into two categories depending on whetherthey convey positive or negative feelings

Emoticons are visual cues associated with emotional states, whichcan be used to define class labels for sentiment classification

Positive Emoticons Negative Emoticons

:) :(:-) :-(: ) : (:D=)

Table: List of positive and negative emoticons.

Outline

Final Comments

Open Challenges

Open Challenges

I Structured input and output

I Multi-target, multi-task and transfer learning

I Millions of classes

I Visualization

I Distributed Streams

I Representation learning

I Ease of use

Lessons Learned

Learning from data streams:

I Learning is not one-shot: is an evolving process;

I We need to monitor the learning process;

I Opens the possibility to reasoning about the learning

Reasoning about the Learning Process

Intelligent systems must:

I be able to adapt continuously to changing environmentalconditions and evolving user habits and needs.

I be capable of predictive self-diagnosis.

The development of such self-configuring, self-optimizing, andself-repairing systems is a major scientific and engineeringchallenge.