Spark Summit EU talk by Larisa Sawyer

Post on 16-Apr-2017

447 views 1 download

Transcript of Spark Summit EU talk by Larisa Sawyer

SPARK SUMMIT EUROPE 2016

Distributed Time Series Analysis Framework For Spark

Larisa SawyerTwo Sigma

Larisa Sawyer

November 1, 2016 2

$0.0

$500.0

$1,000.0

$1,500.0

$2,000.0

$2,500.0

1/3/

1950

1/3/

1953

1/3/

1956

1/3/

1959

1/3/

1962

1/3/

1965

1/3/

1968

1/3/

1971

1/3/

1974

1/3/

1977

1/3/

1980

1/3/

1983

1/3/

1986

1/3/

1989

1/3/

1992

1/3/

1995

1/3/

1998

1/3/

2001

1/3/

2004

1/3/

2007

1/3/

2010

1/3/

2013

1/3/

2016

S&P 500

Time series examples

November 1, 2016

w Stock market prices

w Temperatures

w Height

w …

18°C

20°C

22°C

24°C

26°C

28°C

30°C

32°C

34°C

New York

Brussels

100cm

110cm

120cm

130cm

140cm

150cm

160cm

170cm

180cm

5 6 7 8 9 10 11 12 13 14 15

Age (years)

Avg US female

100cm

110cm

120cm

130cm

140cm

150cm

160cm

170cm

180cm

5 6 7 8 9 10 11 12 13 14 15

Age (years)

Avg US female

Larisa

3

What do we do with time series data?

November 1, 2016

w Forecast future values given past observations

$8.90 $8.95

$8.90

$9.06 $9.10

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price??

?

4

November 1, 2016

Univariate time series

5

Multivariate time series

November 1, 2016

w We can forecast better by joining multiple time series

w Our framework enables fast distributed temporal join of large scale unaligned time series

w Temporal join is a fundamental operation for time series analysis

$8.90 $8.95

$8.90

$9.06 $9.10

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price

75°F72°F 71°F 72°F

68°F 67°F65°F

temperature

6

Multivariate time series

November 1, 2016

w We can forecast better by joining multiple time series

w Our framework enables fast distributed temporal join of large scale unaligned time series

w Temporal join is a fundamental operation for time series analysis

€7.94€7.98

€7.94

€8.08 €8.12

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price

23°C22°C 21°C 22°C

20°C 19°C18°C

temperature

7

What is a left join?

November 1, 2016

time series 1 time series 2

8

What is temporal join?

November 1, 2016

w A particular join function defined by a matching criteria over time

w Examples of criteria

w look-backward

w look-forward

time series 1 time series 2

look-forward

time series 1 time series 2

look-backwardobservation

9

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

10

Important Legal Information

November 1, 2016 11

The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

12

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

13

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

14

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

15

Temporal joins in practice

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

16

Time Series Scale

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AMWe need fast and scalabledistributed temporal join

17

Existing solutions

November 1, 2016

w Existing packages don’t support temporal join or can’t handle large time series

w Pandas / R / Matlab

w Limited to single machine

w Spark

w Does scale, but all data is unordered

w spark-ts

w Expects univariate time series to fit on single machine

w Splits by col

w Supports only snapshot data

18

Flint: A new time series library for Spark

November 1, 2016

w Goal

w Provide a collection of functions to manipulate and analyze time series at scale

w Group, temporal join, summarize, aggregate …

w How

w Build a time series aware data structure

w TimeSeriesRDD extends RDD

w Optimize using temporal locality

w Reduce shuffling

w Reduce memory pressure by streaming

19

What is a TimeSeriesRDD?

November 1, 2016

w TimeSeriesRDD vs RDD

w Associate time range on each partition

w Track partition time-ranges

w Preserve temporal order

20

RDD

November 1, 2016

time temperature6:00 AM 60°F

6:01 AM 61°F

… …

7:00 AM 70°F

7:01 AM 71°F

… …

8:00 AM 80°F

8:01 AM 81°F

… …

RDD(6:00 AM, 60°F)(6:01 AM, 61°F)

…(7:00 AM, 70°F)(7:01 AM, 71°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(6:58 AM, 64°F)(6:59 AM, 65°F)

…(7:34 AM, 74°F)(7:35 AM, 74°F)

…(7:58 AM, 76°F)(7:59 AM, 77°F)

Raw Data

21

TimeSeriesRDD

November 1, 2016

time temperature6:00 AM 60°F

6:01 AM 61°F

… …

7:00 AM 70°F

7:01 AM 71°F

… …

8:00 AM 80°F

8:01 AM 81°F

… …

RDD(6:00 AM, 60°F)(6:01 AM, 61°F)

…(7:00 AM, 70°F)(7:01 AM, 71°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(6:58 AM, 64°F)(6:59 AM, 65°F)

…(7:34 AM, 74°F)(7:35 AM, 74°F)

…(7:58 AM, 76°F)(7:59 AM, 77°F)

(6:00 AM, 60°F)(6:01 AM, 61°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(7:00 AM, 70°F)(7:01 AM, 71°F)

TSRDD[06:00 AM, 07:00 AM)

[07:00 AM, 8:00 AM)

[8:00 AM, ∞)

Raw Data

22

Group function

November 1, 2016

w A group function groups rows with exactly the same timestamps

time city temperature

1:00 PM New York 70°F

1:00 PM Brussels 60°F

2:00 PM New York 71°F

2:00 PM Brussels 61°F

3:00 PM New York 72°F

3:00 PM Brussels 62°F

4:00 PM New York 73°F

4:00 PM Brussels 63°F

group 1

group 2

group 3

group 4

23

Group function

November 1, 2016

w A group function groups rows with nearby timestamps

time city temperature

1:00 PM New York 70°F

1:00 PM Brussels 60°F

2:00 PM New York 71°F

2:00 PM Brussels 61°F

3:00 PM New York 72°F

3:00 PM Brussels 62°F

4:00 PM New York 73°F

4:00 PM Brussels 63°F

group 1

group 2

24

Group in Spark

November 1, 2016

w Groups rows with exactly the same timestamps

RDD

1:00PM

2:00PM

2:00PM

1:00PM

3:00PM

3:00PM

4:00PM

4:00PM

25

w Data is shuffled and materialized on the workers

Group in Spark

November 1, 2016

RDD

groupBy

RDD

1:00PM

2:00PM

2:00PM

1:00PM

3:00PM

3:00PM

4:00PM

4:00PM

sortBy

RDD

w Back to Temporal Orderw Temporal order is not preserved

26

Group in TimeSeriesRDD

November 1, 2016

w Data is grouped per partition locally as streams

TimeSeriesRDD

2:00PM

1:00PM

3:00PM

4:00PM

1:00PM

1:00PM

2:00PM

2:00PM

3:00PM

3:00PM

4:00PM

4:00PM

27

• Running time of count after group• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS

Benchmark for group + count

November 1, 2016

0s 20s 40s 60s 80s 100s

20M

40M

60M

80M

100M TimeseriesRDD

DataFrame

RDD50 - 100X5 - 10X28

Temporal join

November 1, 2016

w A temporal join function is defined by a matching criteria over time

w A typical matching criteria has two parameters

w direction – look-backward or look-forward

w window – how much to look-backward or look-forward

look-backward temporal join

window

29

Temporal join

November 1, 2016

w A temporal join function is defined by a matching criteria over time

w A typical matching criteria has two parameters

w direction – look-backward or look-forward

w window – how much to look-backward or look-forward

look-backward temporal join

window

30

Temporal join

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

time series 1

31

time series 2

Temporal join

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w How do we do temporal join in TimeSeriesRDD?

TimeSeriesRDD TimeSeriesRDD

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

32

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w partition time space into disjoint intervals

TimeSeriesRDD TimeSeriesRDDjoined

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

33

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Build dependency graph for the joined TimeSeriesRDD

TimeSeriesRDD TimeSeriesRDDjoined

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

[1:00 AM, 4:00 AM)

[4:00 AM, 6:00 AM)

[1:00 AM, 4:00 AM)

[4:00 AM, 6:00 AM)

34

1:00AM1:00AM

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams per partition

1:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

2:00AM

4:00AM

5:00AM

3:00AM

5:00AM

35

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM 1:00AM1:00AM

2:00AM

36

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

5:00AM

1:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

1:00AM

1:00AM

2:00AM

4:00AM

3:00AM

4:00AM

3:00AM

37

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

1:00AM

1:00AM

2:00AM

4:00AM 3:00AM

5:00AM 5:00AM

38

Benchmark for temporal join + count

November 1, 2016

0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s

20M

40M

60M

80M

100M TimeseriesRDD

DataFrame

RDD20 - 50X5 - 10X39

• Running time of count after temporal join• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS

Functions over TimeSeriesRDD

November 1, 2016

w Grouping functions

w Temporal joins such as look-forward, look-backward etc.

w Summarizers such as average, variance, z-score etc. over grouping functions

40

Open Source

November 1, 2016

w True!

w https://github.com/twosigma/flint

41

What’s next?

November 1, 2016

w TimeSeriesDataframe / TimeSeriesDataset

w Speed upw Richer APIs

w Python bindings

w Additional summarizers

42

Key contributors

November 1, 2016

w Christopher Aycock

w Yuri Bogomolov

w Jonathan Coveney

w Li Jin

w David Medina

w Julia Meinwald

w David Palaitis

w Larisa Sawyer

w Leif Walsh

w Wenbo Zhao

43

Flint: Time Series For Spark

November 1, 2016

A library to solve for general time series analysis operations at massive scale

Anne Hathaway has nothing to do with Berkshire Hathaway

Check it out in open source, and contribute

https://github.com/twosigma/flint

44

SPARK SUMMIT EUROPE 2016

THANK YOU.Larisa.Sawyer@twosigma.com