MAIDS: Mining Alarming Incidents
in Data Streams
A discussion on the MAIDS project
May 20, 2003
04/13/23 Mining Dynamics in Data Streams 2
Outline
Characteristics of data streams
Mining dynamics in data streams
Multi-dimensional analysis of data streams
Stream query and stream cubing Querying, statistical summary, OLAP, regression, gradient
analysis, …
Mining frequent patterns in data streams
Clustering data streams
Classification of data streams
Planning for implementation and testing
04/13/23 Mining Dynamics in Data Streams 3
Characteristics of Data Streams
Data Streams Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data setsdata sets
Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can
only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
04/13/23 Mining Dynamics in Data Streams 4
Stream Data Applications
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
expensive)
04/13/23 Mining Dynamics in Data Streams 5
Architecture: Stream Query Processing
Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
Continuous QueryContinuous Query
Stream QueryStream QueryProcessorProcessor
ResultsResults
Multiple streamsMultiple streams
SDMS (Stream Data Management System)
04/13/23 Mining Dynamics in Data Streams 6
Challenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous
Evaluated continuously as stream data arrives Answer updated over time
Queries are often complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in
nature
04/13/23 Mining Dynamics in Data Streams 7
Processing Stream Queries
Query types One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples
Approximate query answering With bounded memory, it is not always possible to produce exact
answers High-quality approximate answers are desired Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.
04/13/23 Mining Dynamics in Data Streams 8
Projects on DSMS (Data Stream Management System)
Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer & MAIDS Streaminer & MAIDS (UIUC & NCSA): new projects for stream
data mining
04/13/23 Mining Dynamics in Data Streams 9
Stream Data Mining vs. Stream Querying
Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data
04/13/23 Mining Dynamics in Data Streams 10
Stream Data Mining Tasks
Multi-dimensional (on-line) statistical analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……, more?
04/13/23 Mining Dynamics in Data Streams 11
Challenges for Mining Dynamics in Data Streams
Most stream data are at pretty low-level or multi-
dimensional in nature: needs ML/MD processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels
Fast, real-time detection and response
Comparing with data cube: Similarity and differences
Stream (data) cube or stream OLAP: Is this feasible?
Can we implement it efficiently?
04/13/23 Mining Dynamics in Data Streams 12
Multi-Dimensional Stream Analysis: Examples
Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP
addresses, … Analysts want: changes, trends, unusual patterns, at reasonable
levels of details E.g., Average clicking traffic in North America on sports in the last
15 minutes is 40% higher than that in the last 24 hours.”
Analysis of power consumption streams Raw data: power consumption flow for every household, every
minute Patterns one may find: average hourly power consumption
surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago
04/13/23 Mining Dynamics in Data Streams 13
A Key Step—Stream Data Reduction
Challenges of OLAPing stream data Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels are desirable:
multi-dimensional regression analysis
Proposal A scalable multi-dimensional stream “data cube” that can
aggregate regression model of stream data efficiently without
accessing the raw data
Stream data compression Compress the stream data to support memory- and time-
efficient multi-dimensional regression analysis
04/13/23 Mining Dynamics in Data Streams 14
A Tilted Time-Frame Model
31 days 24 hours 4 qtrs12 months
Time Now
24hrs 4qtrs 15minutes7 days
Time Now
25sec.Up to 7 days:
Up to a year:
Logarithmic (exponential) scale:2t 1t
Time Now
4t8t16t
04/13/23 Mining Dynamics in Data Streams 15
Benefits of Tilted Time-Frame Model
Each cell stores the measures according to tilt-time-frame
Limited memory space: Impossible to store the history in full scale
Emphasis more on recent data
Most applications emphasize on recent data (slide window)
Natural partition on different time granularities
Putting different weights on remote data
Useful even for uniform weight
Tilted time-frame forms a new time dimension
for mining changes and evolutions
Essential for mining dynamic patterns or outliers
Finding those with dramatic changes
E.g., exceptional stocks—not following the trends
04/13/23 Mining Dynamics in Data Streams 16
A Stream Cube Architecture
A tilted time frame Different time granularities
second, minute, quarter, hour, day, week, …
Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down
down to m-layer
Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?
04/13/23 Mining Dynamics in Data Streams 17
Two Critical Layers in the Stream Cube
(*, theme, quarter)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
o-layer (observation)
04/13/23 Mining Dynamics in Data Streams 18
Stream Cube Structure: from m-layer to o-layer
(A1, *, C1)
(A1, *, C2) (A1, *, C2)
(A1, *, C2)
(A1, B1, C2) (A1, B2, C1)
(A2, *, C2)
(A2, B1, C1)
(A1, B2, C2)
(A2, B1, C2)
A2, B2, C1)
(A2, B2, C2)
04/13/23 Mining Dynamics in Data Streams 19
An H-Tree Cubing Structure
root
entertainmentsports politics
uic uiuc uic uiuc
jeff
mary jeff Jim
Q.I.Q.I. Q.I.
Regression:
Sum: xxxxCnt: yyyy
Quant-Info
Observation layer
Minimal int. layer
04/13/23 Mining Dynamics in Data Streams 20
Benefits of H-Tree and H-Cubing
H-tree and H-cubing Developed for computing data cubes and iceberg cubes
J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01
D. Xin, J. Han, X. Li, B. Wah, “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, VLDB'03, Berlin, Germany, Sept. 2003.
Compressed database, fast cubing, space preserving
Using H-tree for stream cubing Space preserving: Intermediate aggregates can be computed
incrementally and saved in tree nodes Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as stream cube
04/13/23 Mining Dynamics in Data Streams 21
Use of Stream Cubing: Regression Analysis
Regression modeling of cells at all dimensions and levels
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional
regression analysis of time-series data streams”, VLDB'02.
Efficient storage and scalable and fast aggregation
Lossless aggregation without accessing the raw data
Covered a large and the most popular class of regression
Including quadratic, polynomial, and nonlinear models
Methodology can be used for other statistical summary,
gradient analysis, and so on
04/13/23 Mining Dynamics in Data Streams 22
Clustering Data Streams
Network intrusion detection: one example Detect bursts of activities or abrupt changes in real time—by on-
line clustering
Two major methodologies: Motwani et al. (Stanford and HP Lab)
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan,
“Clustering data streams”, FOCS'00.
Merging and changing k-media cluster centers
Our approach (UIUC and IBM)
Tilted time frame to store historical data in compressed way
Mining evolving data streams
04/13/23 Mining Dynamics in Data Streams 23
Clustering Evolving Data Streams
Why clustering evolving data streams? Finding evolutions of clusters not just current clusters
C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for
Clustering Evolving Data Streams”, VLDB'03
Methodology Tilted time frame work: compression + mining changes
Micro-clustering: better quality than k-means/k-median
incremental, online processing and maintenance
Two stages: micro-clustering and macro-clustering
With limited “overhead” to achieve high efficiency, scalability,
quality of results and power of evolution/change detection
04/13/23 Mining Dynamics in Data Streams 24
Decision Tree Induction with Stream Data
VFDT/CVFDT P. Domingos and G. Hulten, “Mining high-speed data streams”,
KDD'00
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing
data streams”,KDD'01
VFDT (Very Fast Decision Tree) (Domingos and Hulten’00)
With high probability, constructs an identical model that a
traditional (greedy) method would learn
If it cannot be inserted into the same branch, construct “shadow
branches” as preparation for changes
If the shadow becomes dominant, switch of tree branches occurs
CVFDT: Extension to time changing data
04/13/23 Mining Dynamics in Data Streams 25
Single-Pass Algorithm (An Example)
yes no
Packets > 10
Protocol = http
Protocol = ftp
yes
yes no
Packets > 10
Bytes > 60K
Protocol = http
SP(Bytes) - SP(Packets) >
Data Stream
Data Stream
From Gehrke’s SIGMOD tutorial slides
04/13/23 Mining Dynamics in Data Streams 26
Our Approaches for Stream Classification
Why our approaches? Is decision-tree good for modeling fast changing data?—too fast
changes may make trees obsolete
May other models have better survivability (adaptability)?
Can we find and compare evolution behavior?
Our methodology Tilted time framework (compression and evolution)
Instead of decision-trees, consider other models that do not need
dramatic changes, e.g., Naïve Bayesian with boosting, K-NN?
Incremental updating, dynamic maintenance, and model
construction
Comparing of models to find changes
04/13/23 Mining Dynamics in Data Streams 27
Stream Classification by Naïve Bayesian
Naïve Bayesian + boosting A working paper with Jiong Yang, Wei Wang and
Xifeng Yan A stable model that registers attribute-value pairs
(similar to Raiforest-based classification) History can be recorded using tilted time framework Using boosting to increase classification accuracy Adaptable to dramatic changes Incremental updating, dynamic maintenance, and fast
model construction Comparing of models to find changes
04/13/23 Mining Dynamics in Data Streams 28
Stream Classification by K-Nearest Neighbor
Classification based on nearest neighbors C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu (IBM), “A Framework
for Effective Classification of Evolving Data”, a working paper
Two kinds of stream classification tasks Type I: Prediction of peers in the current stream
Type II: Prediction of the behavior in the next window
Registration of basic statistics (clustering features) using
B-tree (Birch-like) structure
Different philosophies for training and model construction
for type I and II tasks
04/13/23 Mining Dynamics in Data Streams 29
Mining Frequent Patterns for Stream Data
Frequent pattern mining is valuable in stream applications e.g., network intrusion mining (Dokas, et al’02)
Mining precise freq. patterns in stream data: unrealistic Even store them in a compressed form, such as FPtree
How to mine frequent patterns with good approximation? Approximate frequent patterns (Manku & Motwani, VLDB’02)
Major ideas: not tracing items until it becomes first frequent
Adv: guarantee error bound
Disadv: keep a large set of traces
Our comments
Keep only current frequent patterns? No changes can be detected
04/13/23 Mining Dynamics in Data Streams 30
Our Approach on Frequent Stream Patterns
Approach 1: Mining with Multiple Time Granularities C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu, “Mining
Frequent Patterns in Data Streams at Multiple Time Granularities”, Next Gen. Data Mining, MIT Press, 2003
Keep pattern-trees at the tilted time window frame (using tree-sharing method)
Mining evolution and dramatic changes of frequent patterns
Approach 2: Mining only interested itemsets Identify interested items in stream environment Keep precise/compressed history in tilted time window Mining using FP-tree and related fast mining method
04/13/23 Mining Dynamics in Data Streams 31
A Discussion of Our Work Plan
System architecture design Need preprocessing of stream data flow Two working modes
Disk-based and true stream processing Main memory algorithms
“cube” and tilted time frame structure Test data sets
Network flow (multi-dimensional data) Web click stream analysis Stock trading data? Multiple stream weather data?
04/13/23 Mining Dynamics in Data Streams 32
Conclusions
Stream data mining: A rich and largely unexplored field Current research focus in database community:
DSMS system architecture, continuous query processing, supporting mechanisms
Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems
Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data
04/13/23 Mining Dynamics in Data Streams 33
References
B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data
stream systems”, PODS'02 (tutorial).
S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,
30:109--120, 2001.
Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing
stream data: Is it feasible?”, DMKD'02.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression
analysis of time-series data streams”, VLDB'02.
P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.
M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only
get one look”, SIGMOD'02 (tutorial).
J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over
continuous data streams”, SIGMOD'01.
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,
FOCS'00.
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.
04/13/23 Mining Dynamics in Data Streams 34
www.cs.uiuc.edu/~hanj
Thank you !!!Thank you !!!
Top Related