Download - MAIDS: Mining Alarming Incidents in Data Streams (PPT 156 KB)

MAIDS: Mining Alarming Incidents

in Data Streams

A discussion on the MAIDS project

May 20, 2003

04/13/23 Mining Dynamics in Data Streams 2

Outline

Characteristics of data streams

Mining dynamics in data streams

Multi-dimensional analysis of data streams

Stream query and stream cubing Querying, statistical summary, OLAP, regression, gradient

analysis, …

Mining frequent patterns in data streams

Clustering data streams

Classification of data streams

Planning for implementation and testing


Characteristics of Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can

only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing


Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)


Architecture: Stream Query Processing

Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

Continuous QueryContinuous Query

Stream QueryStream QueryProcessorProcessor

ResultsResults

Multiple streamsMultiple streams

SDMS (Stream Data Management System)


Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous

Evaluated continuously as stream data arrives Answer updated over time

Queries are often complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in

nature


Processing Stream Queries

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce exact

answers High-quality approximate answers are desired Data reduction and synopsis construction methods

Sketches, random sampling, histograms, wavelets, etc.


Projects on DSMS (Data Stream Management System)

Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer & MAIDS Streaminer & MAIDS (UIUC & NCSA): new projects for stream

data mining


Stream Data Mining vs. Stream Querying

Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis

Not necessarily continuous queries

Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data


Stream Data Mining Tasks

Multi-dimensional (on-line) statistical analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……, more?


Challenges for Mining Dynamics in Data Streams

Most stream data are at pretty low-level or multi-

dimensional in nature: needs ML/MD processing

Analysis requirements

Multi-dimensional trends and unusual patterns

Capturing important changes at multi-dimensions/levels

Fast, real-time detection and response

Comparing with data cube: Similarity and differences

Stream (data) cube or stream OLAP: Is this feasible?

Can we implement it efficiently?


Multi-Dimensional Stream Analysis: Examples

Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP

addresses, … Analysts want: changes, trends, unusual patterns, at reasonable

levels of details E.g., Average clicking traffic in North America on sports in the last

15 minutes is 40% higher than that in the last 24 hours.”

Analysis of power consumption streams Raw data: power consumption flow for every household, every

minute Patterns one may find: average hourly power consumption

surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago


A Key Step—Stream Data Reduction

Challenges of OLAPing stream data Raw data cannot be stored

Simple aggregates are not powerful enough

History shape and patterns at different levels are desirable:

multi-dimensional regression analysis

Proposal A scalable multi-dimensional stream “data cube” that can

aggregate regression model of stream data efficiently without

accessing the raw data

Stream data compression Compress the stream data to support memory- and time-

efficient multi-dimensional regression analysis


A Tilted Time-Frame Model

31 days 24 hours 4 qtrs12 months

Time Now

24hrs 4qtrs 15minutes7 days

Time Now

25sec.Up to 7 days:

Up to a year:

Logarithmic (exponential) scale:2t 1t

Time Now

4t8t16t


Benefits of Tilted Time-Frame Model

Each cell stores the measures according to tilt-time-frame

Limited memory space: Impossible to store the history in full scale

Emphasis more on recent data

Most applications emphasize on recent data (slide window)

Natural partition on different time granularities

Putting different weights on remote data

Useful even for uniform weight

Tilted time-frame forms a new time dimension

for mining changes and evolutions

Essential for mining dynamic patterns or outliers

Finding those with dramatic changes

E.g., exceptional stocks—not following the trends


A Stream Cube Architecture

A tilted time frame Different time granularities

second, minute, quarter, hour, day, week, …

Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down

down to m-layer

Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?


Two Critical Layers in the Stream Cube

(*, theme, quarter)

(user-group, URL-group, minute)

m-layer (minimal interest)

(individual-user, URL, second)

(primitive) stream data layer

o-layer (observation)


Stream Cube Structure: from m-layer to o-layer

(A1, *, C1)

(A1, *, C2) (A1, *, C2)

(A1, *, C2)

(A1, B1, C2) (A1, B2, C1)

(A2, *, C2)

(A2, B1, C1)

(A1, B2, C2)

(A2, B1, C2)

A2, B2, C1)

(A2, B2, C2)


An H-Tree Cubing Structure

root

entertainmentsports politics

uic uiuc uic uiuc

jeff

mary jeff Jim

Q.I.Q.I. Q.I.

Regression:

Sum: xxxxCnt: yyyy

Quant-Info

Observation layer

Minimal int. layer


Benefits of H-Tree and H-Cubing

H-tree and H-cubing Developed for computing data cubes and iceberg cubes

J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01

D. Xin, J. Han, X. Li, B. Wah, “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, VLDB'03, Berlin, Germany, Sept. 2003.

Compressed database, fast cubing, space preserving

Using H-tree for stream cubing Space preserving: Intermediate aggregates can be computed

incrementally and saved in tree nodes Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as stream cube


Use of Stream Cubing: Regression Analysis

Regression modeling of cells at all dimensions and levels

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional

regression analysis of time-series data streams”, VLDB'02.

Efficient storage and scalable and fast aggregation

Lossless aggregation without accessing the raw data

Covered a large and the most popular class of regression

Including quadratic, polynomial, and nonlinear models

Methodology can be used for other statistical summary,

gradient analysis, and so on


Clustering Data Streams

Network intrusion detection: one example Detect bursts of activities or abrupt changes in real time—by on-

line clustering

Two major methodologies: Motwani et al. (Stanford and HP Lab)

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan,

“Clustering data streams”, FOCS'00.

Merging and changing k-media cluster centers

Our approach (UIUC and IBM)

Tilted time frame to store historical data in compressed way

Mining evolving data streams


Clustering Evolving Data Streams

Why clustering evolving data streams? Finding evolutions of clusters not just current clusters

C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for

Clustering Evolving Data Streams”, VLDB'03

Methodology Tilted time frame work: compression + mining changes

Micro-clustering: better quality than k-means/k-median

incremental, online processing and maintenance

Two stages: micro-clustering and macro-clustering

With limited “overhead” to achieve high efficiency, scalability,

quality of results and power of evolution/change detection


Decision Tree Induction with Stream Data

VFDT/CVFDT P. Domingos and G. Hulten, “Mining high-speed data streams”,

KDD'00

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing

data streams”,KDD'01

VFDT (Very Fast Decision Tree) (Domingos and Hulten’00)

With high probability, constructs an identical model that a

traditional (greedy) method would learn

If it cannot be inserted into the same branch, construct “shadow

branches” as preparation for changes

If the shadow becomes dominant, switch of tree branches occurs

CVFDT: Extension to time changing data


Single-Pass Algorithm (An Example)

yes no

Packets > 10

Protocol = http

Protocol = ftp

yes

yes no

Packets > 10

Bytes > 60K

Protocol = http

SP(Bytes) - SP(Packets) >

Data Stream

Data Stream

From Gehrke’s SIGMOD tutorial slides


Our Approaches for Stream Classification

Why our approaches? Is decision-tree good for modeling fast changing data?—too fast

changes may make trees obsolete

May other models have better survivability (adaptability)?

Can we find and compare evolution behavior?

Our methodology Tilted time framework (compression and evolution)

Instead of decision-trees, consider other models that do not need

dramatic changes, e.g., Naïve Bayesian with boosting, K-NN?

Incremental updating, dynamic maintenance, and model

construction

Comparing of models to find changes


Stream Classification by Naïve Bayesian

Naïve Bayesian + boosting A working paper with Jiong Yang, Wei Wang and

Xifeng Yan A stable model that registers attribute-value pairs

(similar to Raiforest-based classification) History can be recorded using tilted time framework Using boosting to increase classification accuracy Adaptable to dramatic changes Incremental updating, dynamic maintenance, and fast

model construction Comparing of models to find changes


Stream Classification by K-Nearest Neighbor

Classification based on nearest neighbors C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu (IBM), “A Framework

for Effective Classification of Evolving Data”, a working paper

Two kinds of stream classification tasks Type I: Prediction of peers in the current stream

Type II: Prediction of the behavior in the next window

Registration of basic statistics (clustering features) using

B-tree (Birch-like) structure

Different philosophies for training and model construction

for type I and II tasks


Mining Frequent Patterns for Stream Data

Frequent pattern mining is valuable in stream applications e.g., network intrusion mining (Dokas, et al’02)

Mining precise freq. patterns in stream data: unrealistic Even store them in a compressed form, such as FPtree

How to mine frequent patterns with good approximation? Approximate frequent patterns (Manku & Motwani, VLDB’02)

Major ideas: not tracing items until it becomes first frequent

Adv: guarantee error bound

Disadv: keep a large set of traces

Our comments

Keep only current frequent patterns? No changes can be detected


Our Approach on Frequent Stream Patterns

Approach 1: Mining with Multiple Time Granularities C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu, “Mining

Frequent Patterns in Data Streams at Multiple Time Granularities”, Next Gen. Data Mining, MIT Press, 2003

Keep pattern-trees at the tilted time window frame (using tree-sharing method)

Mining evolution and dramatic changes of frequent patterns

Approach 2: Mining only interested itemsets Identify interested items in stream environment Keep precise/compressed history in tilted time window Mining using FP-tree and related fast mining method


A Discussion of Our Work Plan

System architecture design Need preprocessing of stream data flow Two working modes

Disk-based and true stream processing Main memory algorithms

“cube” and tilted time frame structure Test data sets

Network flow (multi-dimensional data) Web click stream analysis Stock trading data? Multiple stream weather data?


Conclusions

Stream data mining: A rich and largely unexplored field Current research focus in database community:

DSMS system architecture, continuous query processing, supporting mechanisms

Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems

Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data


References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data

stream systems”, PODS'02 (tutorial).

S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,

30:109--120, 2001.

Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing

stream data: Is it feasible?”, DMKD'02.

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression

analysis of time-series data streams”, VLDB'02.

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.

M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only

get one look”, SIGMOD'02 (tutorial).

J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over

continuous data streams”, SIGMOD'01.

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,

FOCS'00.

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.


www.cs.uiuc.edu/~hanj

Thank you !!!Thank you !!!