A Research Sampler [email protected] dex.html.
-
Upload
merry-miles -
Category
Documents
-
view
218 -
download
0
Transcript of A Research Sampler [email protected] dex.html.
![Page 2: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/2.jpg)
Philosophy
• Research should be fun -- good puzzles, interesting algorithms.
• Research should be useful -- work with real users whenever possible.
• Implementation should be fast (I use a very powerful programming environment that I expect my students to learn)
![Page 3: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/3.jpg)
Thesis Philosophy
• Ideal thesis should have an interesting algorithm with analysis, an implementation, and users.
• Of the 15 theses I have supervised, 13 follow this model. The other two were pure systems theses.
![Page 4: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/4.jpg)
Current Research Topics
• Time series analysis: finding correlation/bursts. Query by humming.
• AQuery: Database for ordered data (like time series)
• Computational biology: data analysis, visualization, proteomics
![Page 5: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/5.jpg)
Online Pattern Discovery• Sensor-less: Pairs-trading in stock trading (find highly
correlated pairs in n log n time)• Sensor-full: Gamma Ray Detection in astrophysics (burst
detection over a large number of window sizes in almost linear time)
• Dennis Shasha (joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, Tyler Neylon, Xin Zhang and Prof Richard Cole)
![Page 6: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/6.jpg)
Goal of this work• Time series are important in so many
applications – biology, medicine, finance, music, physics, …
• A few fundamental operations occur all the time: burst detection, correlation, pattern matching.
• Extend functionality for music and science.
![Page 7: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/7.jpg)
StatStream (VLDB,2002): Example
• Stock prices streams– The New York Stock Exchange (NYSE) – 50,000 securities (streams); 100,000 ticks (trade
and quote)
• Pairs Trading, a.k.a. Correlation Trading
• Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?”
![Page 8: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/8.jpg)
StatStream (VLDB,2002): Example
XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …
![Page 9: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/9.jpg)
Online Detection of High Correlation
• Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.
• Real time– high update frequency of the data stream– fixed response time, online
![Page 10: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/10.jpg)
Online Detection of High Correlation
Correlated!
![Page 11: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/11.jpg)
StatStream: Algorithm
• Naive algorithm– N : number of streams
– w : size of sliding window
– space O(N) and time O(N2w) VS space O(N2) and time O(N2) .
• Suppose that the streams are updated every second.– With a Pentium 4 PC, the exact computing method can only monitor 700
streams with a delay of 2 minutes.
• Our Approach – Use Discrete Fourier Transform to approximate correlation
– Use grid structure to filter out unlikely pairs
– Our approach can monitor 10,000 streams with a delay of 2 minutes.
![Page 12: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/12.jpg)
Empirical Study : Speed
Comparison of processing time
0
100
200
300400
500
600
700
800
200 400 600 800 1000 1200 1400 1600
Number of Streams
Wa
ll C
loc
k T
ime
(s
ec
on
ds
)
Exact
DFT
Our algorithm is parallelizable.
![Page 13: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/13.jpg)
Sketches : Random Projection
• Correlation between time series of the returns of stock – Since most stock price time series are close to random walks, their return
time series are close to white noise
– DFT/DWT can’t capture approximate white noise series because there is no clear trend (too many frequency components).
• Solution : Sketches (a form of random landmark)– Sketches pool: matrix of random variables drawn from stable distribution
– Sketches : The random projection of all time series to lower dimensions by multiplication with the same matrix
– The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.
![Page 14: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/14.jpg)
Burst Detection
![Page 15: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/15.jpg)
Burst Detection: Applications
• Discovering intervals with unusually large numbers of events.
– In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. Might last milliseconds or days…
– In telecommunications, if the number of packages lost within a certain time period exceeds some threshold, it might indicate some network anomaly. Exact duration is unknown.
– In finance, stocks with unusual high trading volumes should attract the notice of traders (or perhaps regulators).
![Page 16: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/16.jpg)
Bursts across different window sizes in Gamma Rays
• Challenge : to discover not only the time of the burst, but also the duration of the burst.
![Page 17: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/17.jpg)
Elastic Burst Detection: Problem Statement
• Problem: Given a time series of positive numbers x1, x2,..., xn, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds:– all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥ f(w)
• Brute force search : O(n^2) time
• Our shifted wavelet tree (SWT): O(n+k) time. – k is the size of the output, i.e. the number of windows with bursts
![Page 18: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/18.jpg)
Burst Detection: Data Structure and Algorithm
– Define threshold for node for size 2k to be threshold for window of size 1+ 2k-1
![Page 19: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/19.jpg)
Empirical Study : Stock Price Spread Burst
Processing time vs. Number of Windows
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50
Number of Windows
Pro
cess
ing
time
(ms)
SWT Algorithm
Direct Algorithm
![Page 20: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/20.jpg)
Elastic Burst in two dimensions
• Population Distribution in the US
![Page 21: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/21.jpg)
Summary• Able to detect bursts of many different
durations in essentially linear time.• Can be used both for time series and for
spatial searching.• Can specify thresholds either with absolute
numbers or with probability of hit.• Algorithm is simple to implement and has low
constants (code is available).• Ok, it’s embarrassingly simple.
![Page 22: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/22.jpg)
With a Little Help From My Warped Correlation
• Karen’s humming Match:
• Dennis’s humming Match:
• “What would you do if I sang out of tune?"• Yunyue’s humming Match:
![Page 23: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/23.jpg)
Related Work in Query by Humming
• Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99]– Music represented by string of pitch directions: U, D, S (degenerated
interval)– Hum query is segmented to discrete notes, then string of pitch directions – Edit Distance between hum query and music score
• Problem– Very hard to segment the hum query– Partial solution: users are asked to hum articulately
• New Method : matching directly from audio [Mazzoni and Dannenberg 00]
• Problem– slowed down by DTW
![Page 24: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/24.jpg)
Time Series Representation of Query
• An example hum query
• Note segmentation is hard!
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 9 10 11
time (seconds)
pit
ch v
alu
esSegment this!
![Page 25: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/25.jpg)
How to deal with poor hum queries?
• No absolute pitch– Solution: the average pitch is subtracted
• Incorrect tempo– Solution: Uniform Time Warping
• Inaccurate pitch intervals– Solution: return the k-nearest neighbors
• Local timing variations– Solution: Dynamic Time Warping
![Page 26: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/26.jpg)
Dynamic Time Warping
• Euclidean distance: sum of point-by-point distance
• DTW distance: allowing stretching or squeezing the time axis locally
![Page 27: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/27.jpg)
Dynamic Time Warping
Time Series 1
Time Series 2
![Page 28: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/28.jpg)
AQuery A Database System for Order
Dennis Shashajoint work with Alberto Lerner
{lerner,shasha}@cs.nyu.edu
![Page 29: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/29.jpg)
Idea
• Whatever can be done on a table can be done on an ordered table (arrable). Not vice-versa.
• AQuery – query language on arrables
• Expresses many queries easily
• Elegant new optimizations.
![Page 30: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/30.jpg)
And Streams?
• AQuery has no special facilities for streaming data, but it is expressive enough.
• Idea for streaming data is to split the tables into tables that are indexed with old data and a buffer table with recent data.
• Optimizer works over both transparently.
![Page 31: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/31.jpg)
Computational Biology
• Collaborations with several groups at NYU (plant and worm), Duke, Yale.
• Growth area: biologists need us, but we have a lot to learn.
• Big issues: control experimental space, evaluate data, infer an active (rather than just paper) model – combinatorial design.
• Visualization.
![Page 32: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/32.jpg)
Sungear Design
• Generalizes Venn diagrams to more than three
• Visual outline is an ellipse having anchors on borders and vessels in the interior.
• Each vessel points to associated anchors.• Linked views to hierarchies, lists, and graphs,
so can simultaneously update data depending on user queries (selection events).
![Page 33: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/33.jpg)
Venn Diagram: great for three factors
![Page 34: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/34.jpg)
![Page 35: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/35.jpg)
Sungear Principle
• “Sungear is stupid”
• Doesn’t care which kind of data it is representing, though there is built-in support for genes (because of links to GO and to cytoscape).
• Basic Sungear representation could be used to describe anything from yachting gear to demographics.
![Page 36: A Research Sampler shasha@cs.nyu.edu dex.html.](https://reader038.fdocuments.in/reader038/viewer/2022103122/56649f505503460f94c72c03/html5/thumbnails/36.jpg)
Summary
• Hard problems with practical motivation.
• Fun algorithms – not afraid of heuristics.
• Fast, maintainable, portable applications.