New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
-
Upload
spark-summit -
Category
Data & Analytics
-
view
488 -
download
1
Transcript of New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
![Page 1: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/1.jpg)
PySpark for Time Series Analysis
David Palaitis Two Sigma Investments
![Page 2: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/2.jpg)
About Me
![Page 3: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/3.jpg)
Important Legal InformationThe information presented here is offered for recruiting purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
![Page 4: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/4.jpg)
![Page 5: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/5.jpg)
Time Series
IOT feeds
sensor data
economic data
An ordered sequence of values of a variable
![Page 6: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/6.jpg)
Time Series Analysis
![Page 7: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/7.jpg)
Time Series Analysis
![Page 8: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/8.jpg)
Time Series Analysis
![Page 9: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/9.jpg)
Time Series at Two Sigma
Millions of Time Series
Big and Small
(1GB – 1PB)
Narrow (10 columns) and Wide (1MM Columns)
Evenly and Unevenly
Spaced Observations
![Page 10: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/10.jpg)
![Page 11: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/11.jpg)
Let’s start from the beginning …
![Page 12: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/12.jpg)
![Page 13: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/13.jpg)
![Page 14: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/14.jpg)
![Page 15: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/15.jpg)
![Page 16: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/16.jpg)
![Page 17: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/17.jpg)
![Page 18: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/18.jpg)
![Page 19: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/19.jpg)
![Page 20: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/20.jpg)
![Page 21: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/21.jpg)
Examples!
![Page 22: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/22.jpg)
What’s Missing?
![Page 23: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/23.jpg)
You can’t even do “Word Count”
![Page 24: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/24.jpg)
![Page 25: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/25.jpg)
![Page 26: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/26.jpg)
![Page 27: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/27.jpg)
![Page 28: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/28.jpg)
![Page 29: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/29.jpg)
![Page 30: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/30.jpg)
![Page 31: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/31.jpg)
![Page 32: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/32.jpg)
“Word Count” !
![Page 33: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/33.jpg)
What’s missing? Time.
![Page 34: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/34.jpg)
Windowed Aggregations
![Page 35: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/35.jpg)
Temporal Joins
} window
![Page 36: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/36.jpg)
![Page 37: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/37.jpg)
![Page 38: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/38.jpg)
![Page 39: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/39.jpg)
![Page 40: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/40.jpg)
w is a window specification e.g. 500ms, 5s, 3 business days
RDD[(K,V)] -> RDD[(K,Seq[V])]
![Page 41: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/41.jpg)
reduceByWindow(f: (V, V) => V, w):
RDD[(K, W)] => RDD[(K, V)]
![Page 42: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/42.jpg)
reduceByWindow(f: (V, V) => V, w):
RDD[(K, V)] => RDD[(K, V)]
![Page 43: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/43.jpg)
https://github.com/twosigma/flint
![Page 44: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/44.jpg)
Getting Started …
![Page 45: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/45.jpg)
![Page 46: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/46.jpg)
![Page 47: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/47.jpg)
![Page 48: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/48.jpg)
![Page 49: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/49.jpg)
![Page 50: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/50.jpg)
![Page 51: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/51.jpg)
![Page 52: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/52.jpg)
Looking ahead.
![Page 53: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis](https://reader031.fdocuments.in/reader031/viewer/2022021814/58abf07e1a28ab504e8b6495/html5/thumbnails/53.jpg)
Thank You.Find me after the talk to see Flint in action.