Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22
-
Upload
richard-seymour -
Category
Data & Analytics
-
view
580 -
download
1
Transcript of Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22
![Page 1: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/1.jpg)
![Page 2: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/2.jpg)
Classical Distributed
Computing Studies
title inspired by http://prog21.dadgum.com/210.html
![Page 3: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/3.jpg)
Can Catalyst save us
from Amdahl's Law?
(Sorry, no.)
![Page 4: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/4.jpg)
Gene Amdahl
born 1922
in South Dakota
4
(CC BY 2.0) https://www.flickr.com/photos/mwichary/
![Page 5: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/5.jpg)
5
WWII Naval Veteran
then went to SD State
then only got into Wisconsin for theoretical physics
![Page 6: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/6.jpg)
6
While working with slide rules on physics calculations he
thought the whole thing could be faster if he made a computer
to do it.
![Page 7: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/7.jpg)
7
So he did.
WISC
Wisconsin
Integrally
Syncronized
Computer
(CC BY 2.0) https://www.flickr.com/photos/pargon/
![Page 8: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/8.jpg)
8
6J6 and 12AU7 Vacuum
Tubes
Magnetic Drum MemoryCC BY 2.0 https://www.flickr.com/photos/mwichary/
![Page 9: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/9.jpg)
9
First non-Government
sponsored computer.
CC BY 2.0 https://www.flickr.com/photos/mwichary/
![Page 10: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/10.jpg)
10
Invented floating point
CC BY 2.0 https://www.flickr.com/photos/mwichary/
![Page 11: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/11.jpg)
11
When he filed a patent on Floating Point he found out that
von Neumann had already done so.
http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis
![Page 12: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/12.jpg)
12
http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis.pdf
![Page 13: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/13.jpg)
13
Hired immediately by IBM and worked on
the arithmetic unit for the IBM 360
![Page 14: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/14.jpg)
14
Worked on STRETCH
the first transistorized IBM computer
via https://en.wikipedia.org/wiki/IBM_7030_Stretch
![Page 15: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/15.jpg)
15
photo CC by https://www.flickr.com/photos/jurvetson/
Then founded in partnership with Fujitsu
Air cooled Amdahl 470
The first IBM clone of the IBM S/370!
![Page 16: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/16.jpg)
16
Memo while still at IBM:
Validity of the single processor approach to achieving large
scale computing capabilities
Creates what is known as Amdahl’s Law
![Page 17: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/17.jpg)
17
No equation in the memo, which has led to it
being written many different ways.
But it’s easiest to understand graphically.
![Page 18: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/18.jpg)
18
parallelizableserial
total run time on 1 processor
total run time with infinite
parallelization
![Page 19: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/19.jpg)
19
If your familiar with the Critical Path Method from
business or operations research
or if you’ve ever worked in a restaurant
or on an assembly line
Amdahl’s law should be common sense
![Page 20: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/20.jpg)
Now some other
historical notes
eventually tying to
Spark. :)
![Page 21: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/21.jpg)
21
Rear Admiral Grace Hopper
1906-1992
https://www.youtube.com/watch?v=JEpsKnWZrJ8
![Page 22: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/22.jpg)
22
Rear Admiral Grace Hopper
1906-1992
https://www.youtube.com/watch?v=JEpsKnWZrJ8
what do nanoseconds look like?
![Page 23: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/23.jpg)
23
Table from Amdahl’s PhD Thesis
(1952)
![Page 24: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/24.jpg)
24
https://gist.github.com/jboner/2841832
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD* 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
![Page 25: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/25.jpg)
25
L1 cache reference : 0:00:01
Branch mispredict : 0:00:10
L2 cache reference : 0:00:14
Mutex lock/unlock : 0:00:50
Main memory reference : 0:03:20
Compress 1K bytes with Zippy : 1:40:00
Send 1K bytes over 1 Gbps network : 5:33:20
Read 4K randomly from SSD : 3 days, 11:20:00
Read 1 MB sequentially from memory : 5 days, 18:53:20
Round trip within same datacenter : 11 days, 13:46:40
Read 1 MB sequentially from SSD : 23 days, 3:33:20
Disk seek : 231 days, 11:33:20
Read 1 MB sequentially from disk : 462 days, 23:06:40
Send packet CA->Netherlands->CA : 3472 days, 5:20:00
comment from https://gist.github.com/kofemann
“humanized scale” where 1ns = 1s
![Page 26: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/26.jpg)
26
American Documentation
Volume 20, Issue 1, pages 21–26, January 1969
![Page 27: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/27.jpg)
27
What computerization and statistics
can add...
![Page 28: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/28.jpg)
28
Karen Spärck Jones FBA
(1935-2007)
![Page 29: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/29.jpg)
29
Karen Spärck Jones FBA
(1935-2007)
Invented Inverse Document
Frequency
http://nlp.cs.swarthmore.edu/~richardw/papers/sparckjones1972-statistical.pdf
“The specificity of a term can be
quantified as an inverse function of
the number of documents in which it
occurs.”
![Page 30: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/30.jpg)
SparkSQL
![Page 31: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/31.jpg)
31
The Promise of SparkSQL
(the catalyst planner)
![Page 32: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/32.jpg)
32
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
![Page 33: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/33.jpg)
33
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
an imaginary SQL statement that could be parallelized
![Page 34: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/34.jpg)
34
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
![Page 35: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/35.jpg)
35
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
But what if Customers is on your local HDFS and Orders is at
your on a data center at your warehouse?
![Page 36: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/36.jpg)
36
Computerized query planning is the future, but for the time
being you the user are going to have to recognize your
latency issues.
![Page 37: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/37.jpg)
37
Quick fix
![Page 38: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/38.jpg)
38
https://gist.github.com/jboner/2841832
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD* 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
![Page 39: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/39.jpg)
39
Quick fix
CACHE [LAZY] TABLE [AS SELECT]
![Page 40: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/40.jpg)
40
Premature optimization is the root of
all evil
- Donald Knuth (misquoted)
![Page 41: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/41.jpg)
41
We should forget about small efficiencies, say about
97% of the time: premature optimization is the root of
all evil.
Yet we should not pass up our opportunities in that
critical 3%.
A good programmer will not be lulled into
complacency by such reasoning, he will be wise to
look carefully at the critical code; but only after that
code has been identified.
Donald Knuth
ACM Computing Surveys, Vol 6, No. 4, Dec. 1974
Structured Programming with go to Statements
![Page 42: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/42.jpg)
Thank You
![Page 43: Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22](https://reader035.fdocuments.in/reader035/viewer/2022062523/5a6ecf1e7f8b9ae8728b4fc7/html5/thumbnails/43.jpg)