Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

43

Transcript of Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Page 1: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22
Page 2: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Classical Distributed

Computing Studies

title inspired by http://prog21.dadgum.com/210.html

Page 3: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Can Catalyst save us

from Amdahl's Law?

(Sorry, no.)

Page 4: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Gene Amdahl

born 1922

in South Dakota

4

(CC BY 2.0) https://www.flickr.com/photos/mwichary/

Page 5: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

5

WWII Naval Veteran

then went to SD State

then only got into Wisconsin for theoretical physics

Page 6: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

6

While working with slide rules on physics calculations he

thought the whole thing could be faster if he made a computer

to do it.

Page 7: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

7

So he did.

WISC

Wisconsin

Integrally

Syncronized

Computer

(CC BY 2.0) https://www.flickr.com/photos/pargon/

Page 8: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

8

6J6 and 12AU7 Vacuum

Tubes

Magnetic Drum MemoryCC BY 2.0 https://www.flickr.com/photos/mwichary/

Page 9: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

9

First non-Government

sponsored computer.

CC BY 2.0 https://www.flickr.com/photos/mwichary/

Page 10: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

10

Invented floating point

CC BY 2.0 https://www.flickr.com/photos/mwichary/

Page 11: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

11

When he filed a patent on Floating Point he found out that

von Neumann had already done so.

http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis

.pdf

Page 12: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

12

http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis.pdf

Page 13: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

13

Hired immediately by IBM and worked on

the arithmetic unit for the IBM 360

Page 14: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

14

Worked on STRETCH

the first transistorized IBM computer

via https://en.wikipedia.org/wiki/IBM_7030_Stretch

Page 15: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

15

photo CC by https://www.flickr.com/photos/jurvetson/

Then founded in partnership with Fujitsu

Air cooled Amdahl 470

The first IBM clone of the IBM S/370!

Page 16: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

16

Memo while still at IBM:

Validity of the single processor approach to achieving large

scale computing capabilities

Creates what is known as Amdahl’s Law

Page 17: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

17

No equation in the memo, which has led to it

being written many different ways.

But it’s easiest to understand graphically.

Page 18: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

18

parallelizableserial

total run time on 1 processor

total run time with infinite

parallelization

Page 19: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

19

If your familiar with the Critical Path Method from

business or operations research

or if you’ve ever worked in a restaurant

or on an assembly line

Amdahl’s law should be common sense

Page 20: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Now some other

historical notes

eventually tying to

Spark. :)

Page 21: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

21

Rear Admiral Grace Hopper

1906-1992

https://www.youtube.com/watch?v=JEpsKnWZrJ8

Page 22: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

22

Rear Admiral Grace Hopper

1906-1992

https://www.youtube.com/watch?v=JEpsKnWZrJ8

what do nanoseconds look like?

Page 23: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

23

Table from Amdahl’s PhD Thesis

(1952)

Page 24: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

24

https://gist.github.com/jboner/2841832

Latency Comparison Numbers

--------------------------

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns

Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms

Read 4K randomly from SSD* 150,000 ns 0.15 ms

Read 1 MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same datacenter 500,000 ns 0.5 ms

Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory

Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Page 25: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

25

L1 cache reference : 0:00:01

Branch mispredict : 0:00:10

L2 cache reference : 0:00:14

Mutex lock/unlock : 0:00:50

Main memory reference : 0:03:20

Compress 1K bytes with Zippy : 1:40:00

Send 1K bytes over 1 Gbps network : 5:33:20

Read 4K randomly from SSD : 3 days, 11:20:00

Read 1 MB sequentially from memory : 5 days, 18:53:20

Round trip within same datacenter : 11 days, 13:46:40

Read 1 MB sequentially from SSD : 23 days, 3:33:20

Disk seek : 231 days, 11:33:20

Read 1 MB sequentially from disk : 462 days, 23:06:40

Send packet CA->Netherlands->CA : 3472 days, 5:20:00

comment from https://gist.github.com/kofemann

“humanized scale” where 1ns = 1s

Page 26: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

26

American Documentation

Volume 20, Issue 1, pages 21–26, January 1969

Page 27: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

27

What computerization and statistics

can add...

Page 28: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

28

Karen Spärck Jones FBA

(1935-2007)

Page 29: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

29

Karen Spärck Jones FBA

(1935-2007)

Invented Inverse Document

Frequency

http://nlp.cs.swarthmore.edu/~richardw/papers/sparckjones1972-statistical.pdf

“The specificity of a term can be

quantified as an inverse function of

the number of documents in which it

occurs.”

Page 30: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

SparkSQL

Page 31: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

31

The Promise of SparkSQL

(the catalyst planner)

Page 32: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

32

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

INNER JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

Page 33: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

33

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

an imaginary SQL statement that could be parallelized

Page 34: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

34

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

Page 35: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

35

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

But what if Customers is on your local HDFS and Orders is at

your on a data center at your warehouse?

Page 36: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

36

Computerized query planning is the future, but for the time

being you the user are going to have to recognize your

latency issues.

Page 37: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

37

Quick fix

Page 38: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

38

https://gist.github.com/jboner/2841832

Latency Comparison Numbers

--------------------------

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns

Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms

Read 4K randomly from SSD* 150,000 ns 0.15 ms

Read 1 MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same datacenter 500,000 ns 0.5 ms

Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory

Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Page 39: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

39

Quick fix

CACHE [LAZY] TABLE [AS SELECT]

Page 40: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

40

Premature optimization is the root of

all evil

- Donald Knuth (misquoted)

Page 41: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

41

We should forget about small efficiencies, say about

97% of the time: premature optimization is the root of

all evil.

Yet we should not pass up our opportunities in that

critical 3%.

A good programmer will not be lulled into

complacency by such reasoning, he will be wise to

look carefully at the critical code; but only after that

code has been identified.

Donald Knuth

ACM Computing Surveys, Vol 6, No. 4, Dec. 1974

Structured Programming with go to Statements

Page 42: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Thank You

Page 43: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22