Real-Time Querying of Live and Historical Stream Data
description
Transcript of Real-Time Querying of Live and Historical Stream Data
Real-Time Querying of Live and Historical Stream Data
Joe HellersteinUC Berkeley
Joint Work• Fred Reiss
UC Berkeley (IBM Almaden)
• Kurt Stockinger, Kesheng Wu, Arie Shoshani Lawrence Berkeley National Lab
Outline
• A challenging stream query problem– Real-world example: US DOE network monitoring
• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit
• Performance Study– Stream Analysis, Load, Lookup
• Handling Bursts: Data Triage
Outline
• A challenging stream query problem– Real-world example: US DOE network monitoring
• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit
• Performance Study– Stream Analysis, Load, Lookup
• Handling Bursts: Data Triage
Agenda
• Study a practical application of stream queries– High data rates– Data-rich: needs to consult “history”
• Obvious settings– Financial – System Monitoring
• Keep it real
DOE Network Monitoring
• U.S. Department of Energy (DOE) runs a nationwide network of laboratories– Including our colleagues at LBL
• Labs send data over a number of long-haul networks
• DOE is building a nationwide network operations center– Need software to help operators monitor network
security and reliability
Monitoring infrastructure
Challenges
• Live Streams– Continuous queries over
unpredictable streams
• Archival Streams– Load/index all data– Access on demand as part of
continuous queries
Open Source
Outline
• A challenging stream query problem– Real-world example: US DOE network monitoring
• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit
• Performance Study– Stream Analysis, Load, Lookup
• Handling Bursts: Data Triage
Telegraph Project
• 1999-2006, joint with Mike Franklin– An “adaptive dataflow” system
• v1 in 2000: Java-based– Deep-Web Bush/Gore demo presages live web mashups
• V2: TCQ, rewrite of PostgreSQL– External data & streams– Open source with active users, mostly in net monitoring
– Commercialization at Truviso, Inc– “Big” academic software
• 2 faculty, 9 PhDs, 3 MS
Some Key Telegraph Features
• Eddies: Continuous Query Optimization– Reoptimize queries at any point in execution
• FLuX: Fault-Tolerant Load-balanced eXchange– Cluster Parallelism with High Availability
• Shared query processing– Query processing is a join of queries and data
• Data Triage– Robust statistical approximation under stress
• See http://telegraph.cs.berkeley.edu for more
FastBit
• Background• Vertically-partitioned relational storewith
bitmap indexes– Word-Aligned Hybrid (WAH) compression tuned
for CPU efficiency as well as disk bandwidth• LBL internal project since 2004• Open source LPGL in 2008
Introduction to Bitmap Indexing
Column 1 Column 2
321543…
342512…
Column 11 2 3 4 5
000010…
001001…
100000…
010000…
000100…
Column 2
001000…
1 2 3 4 5
010000…
100001…
000010…
000100…
Relational Table Bitmap Index
Why Bitmap Indexes?
• Fast incremental appending– One index per stream– No sorting or hashing of the input data
• Efficient multidimensional range lookups– Example: Find number of sessions from prefix 192.168/16
with size between 100 and 200 bytes• Efficient batch lookups
– Can retrieve entries for multiple keys in a single pass over the bitmap
Outline
• A challenging stream query problem– Real-world example: US DOE network monitoring
• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit
• Performance Study– Stream Analysis, Load, Lookup
• Handling Bursts: Data Triage
The DOE dataset• 42 weeks, 08/2004 – 06/2005
– Est. 1/30 of DOE traffic• Projection:
– 15K records/sec typical– 1.7M records/sec peak
• TPC-C today– 4,092,799 tpmC (2/27/07)
• i.e. 68.2K per sec
• 1.5 orders of magnitude needed!– And please touch 2 orders of magnitude more random data, too– But.. append-only updates + streaming queries (temporal locality)
Our Focus: Flagging abnormal traffic
• When a network is behaving abnormally…– Notify network operators– Trigger in-depth analysis or countermeasures
• How to detect abnormal behavior?– Analyze multiple aspects of live monitoring data– Compute relevant information about “normal”
behavior from historicalmonitoring data– Compare current behavior against this baseline
Example: “Elephants”
• The query:– Find the k most significant sources of traffic on the
network over the past t seconds.– Alert the network operator if any of these sources
is sending an unusually large amount of traffic for this time of the day/week, compared with its usual traffic patterns.
System Architecture
Query Workload
• Five monitoring queries, based on discussions with network researchers and practitioners
• Each query has three parts:– Analyze flow record stream (TelegraphCQ query)– Retrieve and analyze relevant historical monitoring
data (FastBit query)– Compare current behavior against baseline
Query Workload Summary• Elephants
– Find heavy sources of network traffic that are not normally heavy network users
• Mice– Examine the current behavior of hosts that normally send very little traffic
• Portscans– Find hosts that appear to be probing for vulnerable network ports– Filter out “suspicious” behavior that is actually normal
• Anomaly Detection– Compare the current traffic matrix (all source, destination pairs) against past
traffic patterns• Dispersion
– Retrieve historical traffic data for sub-networks that exhibit suspicious timing patterns
• Full queries are in the paper
Best-Case Numbers• Single PC, dual 2.8GHz single-core Pentium 4, 2GB RAM, IDE RAID
(60 MB/sec throughput)• TCQ performance up to 25Krecords/sec
– Depends heavily on query, esp. window size • Fastbit can load 213K tups/sec
– NW packet trace schema– Depends on batch size: 10Mtups per batch
• Fastbit can “fetch” 5 Mrecords/sec – 8-bytes of output per record only! 40MB/sec near RAID I/O throughput
• Best end-to-end: 20Ktups/sec– Recall desire of 15Ktups/sec steady state, 1.7Mtups/sec burst
Streaming Query Processing
Streaming Query Processing
Index Insertion
Index Insertion
Index Lookup
Index Lookup
End-to-End Throughput
End-to-End Throughput
Summary of DOE Results
• With sufficiently large load window, FastBit can handle expected peak data rates
• Streaming query processing becomes the bottleneck– Next step: Data Triage
Outline
• A challenging stream query problem– Real-world example: US DOE network monitoring
• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit
• Performance Study– Stream Analysis, Load, Lookup
• Handling Bursts: Data Triage
ICDE 2006 33Feb 21, 2006
Data Triage
• Provision for the typical data rate• Fall back on approximation during bursts
– But always do as much “exact” work as you can!• Benefits:
– Monitor fast links with cheap hardware– Focus on query processing features, not speed– Graceful degradation during bursts– 100% result accuracy most of the time
ICDE 2006 34Feb 21, 2006
Summarizer
Data Triage
• Bursty data goes to the triage process first
• Place a triage queue in front of each data source
• Summarize excess tuples to prevent missing deadlines
Triage Process
RelationalTuples
Triage Queue
SummariesOf TriagedTuples
TriagedTuples
To QueryEngine
Initial ParsingAnd Filtering
Packets
Tupl
es
ICDE 2006 35Feb 21, 2006
Data Triage
• Query engine receives tuples and summaries
• Use a shadow query to compute approximation of missing results
MainQuery
ShadowQuery
Query Engine
Merge
User
RelationalTuples
SummariesOf TriagedTuples
SummariesOf MissingResults
ICDE 2006 36Feb 21, 2006
Read the paper for…• Provisioning
– Where are the performance bottlenecks in this pipeline?– How do we mitigate those bottlenecks?
• Implementation– How do we “plug in” different approximation schemes
without modifying the query engine?– How do we build shadow queries?
• Interface– How do we present the merged query results to the user?
ICDE 200637Feb 21, 2006
Delay Constraints
…must be delivered by this time
Window 1
Window 2
DelayConstraint
All results from this window…
Time
ICDE 2006 38
Experiments• System
– Data Triage implemented on TelegraphCQ
– Pentium 3 server, 1.5 GB of memory
• Data stream– Timing-accurate playback
of real network traffic from www.lbl.gov web server
– Trace sped up 10x to simulate an embedded CPU
Feb 21, 2006
select W.adminContact, avg(P.length) asavgLength, stdev(P.length) asstdevLengthwtime(*) aswindowTimefrom
Packet P [range ’1 min’ slide ’1 min’ ],WHOIS W
whereP. srcIP>WHOIS.minIP
andP.srcIP<WHOIS.maxIPgroupbyW.adminContactlimit delay to ‘10 seconds’;
ICDE 200639Feb 21, 2006
Experimental Results: Latency
ICDE 200640Feb 21, 2006
Experimental Results: Accuracy
• Compare accuracy of Data Triage with previous work
• Comparison 1: Drop excess tuples– Both methods using 5-
second delay constraint– No summarization
ICDE 200641Feb 21, 2006
Experimental Results: Accuracy
• Comparison 2: Summarize all tuples– Reservoir sample– 5 second delay
constraint– Size of reservoir =
number of tuples query engine can process in 5 sec
Conclusions & Issues• Stream Query + Append Only Warehouse
– Good match, can scale a long way at modest $$• Data Triage combats Stream Query bottleneck
– Provision for the common case– Approximate on excess load during bursts– Keep approximation limited, extensible
• Parallelism needed– See FLuX work for High Availability
• Shah et al, ICDE ‘03 and SIGMOD ‘04• vs. Google’s MapReduce
• Query Optimization for streams? Adaptivity!– Eddies: Avnur & Hellerstein SIGMOD ‘00 – SteMs: Raman, Deshpande, Hellerstein, ICDE ‘03– STAIRs: Deshpande & Hellerstein VLDB ’04– Deshpande/Ives/Raman survey, F&T-DB ‘07
More?• http://telegraph.cs.berkeley.edu
• Frederick Reiss and Joseph M. Hellerstein. “Declarative Network Monitoring with an Underprovisioned Query Processor”. ICDE 2006.
• F. Reiss, K. Stockinger, K. Wu, A. Shoshani, J. M. Hellerstein. “Enabling Real-Time Querying of Live and Historical Stream Data”. SSDBM 2007.
• Frederick Reiss. “Data Triage”. Ph.D. thesis, UC Berkeley, 2007.