1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

98
1 Distributed Computing Economics Slides at: http:// research.microsoft.com/~gray/talks Jim Gray Microsoft Research [email protected] Talk at IEEE Computer Society: 11 December 2003 Palo Alto, CA.

Transcript of 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

Page 1: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

1

Distributed Computing Economics

Slides at: http://research.microsoft.com/~gray/talksJim GrayMicrosoft Research [email protected] at IEEE Computer Society: 11 December 2003 Palo Alto, CA.

Page 2: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

2

Two (?) Talks

• Distributed Computing Economics• What I’m doing

– Online Science – World Wide Telescope

– TerraServer Brick Design/Deploy/Operate

– Paxos Commit– Spatial Data done relationally

• With Alex Szalay JHU

• With Tom Barclay• With Leslie Lamport• With Alex Szalay JHU

Page 3: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

3

Distributed Computing Economics

• Why is Seti@Home a great idea?

• Why is Napster a great deal?

• Why is the Computational Grid uneconomic?

• When does Computing on Demand work?

• What is the “right” level of abstraction?

• Is the Access Grid the real killer app?

Based on: Distributed Computing Economics, Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24

http://research.microsoft.com/research/pubs/view.aspx?tr_id=655

Page 4: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

4

Computing is Free

• Computers cost 1k$ (if you shop) (yes, there are 1μ$ to 1M$ computers, but..)

• So 1 cpu day == 1$ (computers last 3 years)

• If you pay the phone bill Internet bandwidth costs 50 … 500$/mbps/m(not including routers and management).

• So 1GB costs 1$ to send and 1$ to receive

Caveat: All numbers rounded to nearest factor of 3.

Page 5: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

5

Why is Seti@Home a Good Deal?

• Send 300 KB costs 3e-4$

• User computes for ½ day: benefit .5e-1$

• ROI: 1500:1

• Finance guys will tell you that is a good Return On Investment (ROI)

Page 6: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

6

Seti@HomeThe worlds most powerful computer• 61 TF is sum of top 4 of Top 500.• 61 TF is 9x the number 2 system.• 61 TF more than the sum of systems 2..10

Seti@Homehttp://setiathome.ssl.berkeley.edu/totals.html

20 May 2003

  Total Last 24 Hours

Users 4,493,731 1,900

Results received 886 M 1.4 M

Total CPU time 1.5 M years 1,514 years

Floating Point Operations

3 E+21 ops3 zeta ops

5 E+18 FLOPS/day

61.3 TeraFLOPs

Page 7: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

7

Why was Napster a Good Deal?

• Send 5 MB costs 5e-3$

½ a penny per song• Both sender and receiver can afford it.

• Same logic powers web sites (Yahoo!...):– 1e-3$/page view advertising revenue– 1e-5$/page view cost of serving web page– 100:1 ROI

Page 8: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

8

Computing is Free!!! This is not a Surprise

Everywhere I go I see Beowulfs

• Clusters of PCs (or high-slice-price micros)• True: I have not visited Earth Simulator,

but… Google, MSN, Hotmail, Yahoo, NCBI, FNAL, Los Alamos, Cal Tech, MIT, Berkeley, NARO, Smithsonian, Wisconsin, eBay, Amazon.com, Schwab, Citicorp, Beijing, Cern, BaBar, NCSA, Cornell, UCSD, and of course NASA and Cal Tech

Page 9: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

9

The Cost of Computing:Computers are NOT free!

• IBM, HP, Dell make billions• Capital Cost of a TpcC

system is mostly storage and storage software (database)

• IBM 32 cpu, 512 GB ram 2,500 disks, 43 TB (680,613 tpmC @ 11.13 $/tpmc available 11/08/03)http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf

• A 7.5M$ super-computer

• Total Data Center Cost: 40% capital & facilities60% staff

(includes app development)

TpcC Cost Components DB2/AIXhttp://www.tpc.org/results/individual_results/IBM /IBM p690es_05092003.pdf

cpu/mem29%

storage61%

software10%

Page 10: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

10

Computing Equivalents1 $ buys

• 1 day of cpu time• 4 GB (fast) ram for a day • 1 GB of network bandwidth• 1 GB of disk storage for 3 years• 10 M database accesses • 10 TB of disk access (sequential)• 10 TB of LAN bandwidth (bulk)• 10 KWhrs == 4 days of computer time

Depreciating over 3 years, and there are about 1k days in 3 years.

Page 11: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

11

Some consequences• Beowulf networking is

10,000x cheaper than WAN networkingfactors of 105 matter.

• The cheapest and fastest way to move Terabytes cross country is sneakernet.24 hours ~ 92 Mbps ~ 12 MB/s50$ shipping vs 1,000$ wan cost.

• Sending 10PB CERN data via network is silly: buy disk bricks in Geneva, fill them, ship them.

TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data ExchangeJim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBergMicrosoft Technical Report may 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569

Page 12: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

12

How Do You Move A Terabyte?

14 minutes6172001,920,0009600OC 192

2.2 hours1000Gbps

1 day100100 Mpbs

14 hours97631649,000155OC3

2 days2,01065128,00043T3

2 months2,4698001,2001.5T1

5 months360117500.6Home DSL

6 years3,0861,000400.04Home phone

Time/TB$/TBSent

$/MbpsRent

$/monthSpeedMbps

Context

Source: TeraScale Sneakernet, Microsoft Research, Gray et. all Source: TeraScale Sneakernet, Microsoft Research, Gray et. all

Page 13: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

13

Computational Grid Economics• To the extent that computational grid is like

Seti@Home or ZetaNet or Folding@home or… it is a great thing

• The extent that the computational grid is MPI or data analysis, it fails on economic grounds: move the programs to the data, not the data to the programs.

• The Internet is NOT the cpu backplane.• An alternate reality: Nearly free networking

– Telcos go bankrupt an price=cost=0– Taxpayers pay your phone bill so price=0

and telcos get BIG government subsidy

Page 14: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

14

When to Export a Task

IF instruction density > 100,000 instructions/byte

AND remote computer is free (costs you nothing)

THEN ROI > 0ELSE ROI < 0

Finance guys will tell you negative ROI is bad

Page 15: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

15

Computing on Demand• Was called outsourcing or service bureaus

in my youth. CSC and IBM did it.• It is not a new way of doing things: think payroll.

Payroll is standard outsource.• Now Hotmail, Salesforce.com, Oracle.com,….• Works for standard apps.• COD works for commoditized services.• Airlines outsource reservations.

Banks outsource ATMs.• But Amazon, Amex, Wal-Mart, eTrade, eBay...

Can’t outsource their core competence.

Page 16: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

16

What’s the right abstraction level for Internet Scale Distributed Computing?• Disk block? No too low (Ø).• File? No too low (XDrvive)• Database? No too low (SkyServer).• RPC Yes,

– TerraService, MapQuest,….– Blast search– Google search

• Application? Yes, even better.– Send/Get eMail– Expedia– Amazon– Portals that federate astronomy archives

(http://skyQuery.Net/)

– Web Services (.NET, EJB, OGSA) give plumbing for rpc/App abstraction level.

Page 17: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

17

Access Grid• Q: What comes after the telephone?

• A: eMail?

• A: Instant messaging?

• Both seem retro: text & emotons.

• Access Grid could revolutionize human communication.

• But, it needs a new idea.

• Q: What comes after the telephone?

Page 18: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

18

Distributed Computing Economics

• Why is Seti@Home a great idea?

• Why is Napster a great deal?

• Why is the Computational Grid uneconomic

• When does computing on demand work?

• What is the “right” level of abstraction?

• Is the Access Grid the real killer app?

Based on: Distributed Computing Economics, Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24

http://research.microsoft.com/research/pubs/view.aspx?tr_id=655

Page 19: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

19

Two (?) Talks

• Distributed Computing Economics• What I’m doing

– Online Science – World Wide Telescope

– TerraServer Brick Design/Deploy/Operate

– Paxos Commit– Spatial Data done relationally

• With Alex Szalay JHU

• With Tom Barclay• With Leslie Lamport• With Alex Szalay JHU

Page 20: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

20

Online ScienceThe World Wide Telescope

• I have been looking for a distributed DB for most of my career.

• I think I found one! (sort of).

Page 21: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

21

The Evolution of Science

• Observational Science – Scientist gathers data by direct observation– Scientist analyzes Information

• Analytical Science – Scientist builds analytical model– Makes predictions.

• Computational Science – Simulate analytical model– Validate model and makes predictions

Page 22: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

22

Computational Science Evolves • Historically, Computational Science = simulation.• Science - Informatics

Information Exploration Science Information captured by instrumentsOr Information generated by simulator– Processed by software– Placed in a database / files– Scientist analyzes database / files

• New emphasis on informatics:– Capturing, – Organizing, – Summarizing, – Analyzing, – Visualizing

• Largely driven by observational science, but also needed by simulations.

• Too soon to say if comp-X and X-info will unify or compete.

BaBar, Stanford

Space Telescope

P&E Gene SequencerFromhttp://www.genome.uci.edu/

Page 23: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

23

Both comp-X and X-infoGenerating Petabytes

• Comp-Science generating anInformation avalanche

comp-chem, comp-physics,

comp-bio, comp-astro, comp-linguistics, comp-music, comp-

entertainment, comp-warfare

• Science-Info dealing with Information avalanche

bio-info, astro-info, text-info,

Page 24: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

24

Information Avalanche Stories• Turbulence: 100 TB simulation

then mine the Information • BaBar: Grows 1TB/day

2/3 simulation Information 1/3 observational Information

• CERN: LHC will generate 1GB/s10 PB/y

• VLBA (NRAO) generates 1GB/s today• NCBI: “only ½ TB” but doubling each year

very rich dataset.• Pixar: 100 TB/Movie

Page 25: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

25

Astro-InfoWorld Wide Telescope

http://www.astro.caltech.edu/nvoconf/http://www.voforum.org/

• Premise: Most data is (or could be online)• Internet is the world’s best telescope:

– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..

– As deep as the best instruments (2 years ago).– It is up when you are up.

The “seeing” is always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

Page 26: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

26

Why Astronomy Data?•It has no commercial value

–No privacy concerns–Can freely share results with others–Great for experimenting with algorithms

•It is real and well documented– High-dimensional data (with confidence intervals)– Spatial data– Temporal data

•Many different instruments from many different places and many different times•But, it’s the same universe

so comparisons make sense & are interesting.•Federation is a goal•There is a lot of it (petabytes)•Great sandbox for data mining algorithms

–Can share cross company–University researchers

•Great way to teach both Astronomy and Computational Science

IRAS 100

ROSAT ~keV

DSS Optical

2MASS 2

IRAS 25

NVSS 20cm

WENSS 92cm

GB 6cm

Page 27: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

27

What X-info Needs from us (cs)(not drawn to scale)

Science Data & Questions

Scientists

DatabaseTo store

dataExecuteQueries

Plumbers

Data Mining

Algorithms

Miners

Question & AnswerVisualizat

ion

Tools

Page 28: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

28

Show Maria’s 5-minute PPT

SDSS Image Cutout slide show by Maria A. Nieto-Santisteban of JHU

http://www.research.microsoft.com/~Gray/talks/FDIS_ImgCutoutPresentation.ppt

Page 29: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

29

Data Access is hitting a wallFTP and GREP are not adequate

• You can GREP 1 MB in a second• You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days• You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~5,000 disks

• At some point you need indices to limit searchparallel data search and analysis

• This is where databases can help

• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (= 1 $/GB)

• … 2 days and 1K$• … 3 years and 1M$

Page 30: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

30

Next-Generation Data Analysis• Looking for

– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• Global statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

• As data and processing grow at same rate, we can only keep up with N logN

• A way out? – Discard notion of optimal (data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory

• Requires combination of statistics & computer science

Page 31: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

31

Analysis and Databases• Statistical analysis deals with

– Creating uniform samples – data filtering & censoring bad data– Assembling subsets– Estimating completeness – Counting and building histograms– Generating Monte-Carlo subsets– Likelihood calculations– Hypothesis testing

• Traditionally these are performed on files• Most of these tasks are much better done inside a

databaseclose to the data.

• Move Mohamed to the mountain, not the mountain to Mohamed.

Page 32: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

32

Goal: Easy Data Publication & Access

• Augment FTP with data query: Return intelligent data subsets

• Make it easy to – Publish: Record structured data– Find:

• Find data anywhere in the network• Get the subset you need

– Explore datasets interactively

• Realistic goal: – Make it as easy as

publishing/reading web sites today.

Page 33: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

33

Federation

Data Federations of Web Services• Massive datasets live near their owners:

– Near the instrument’s software pipeline– Near the applications– Near data knowledge and curation– Super Computer centers become Super Data Centers

• Each Archive publishes a web service– Schema: documents the data– Methods on objects (queries)

• Scientists get “personalized” extracts

• Uniform access to multiple Archives– A common global schema

Page 34: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

34

Web Services: The Key?• Web SERVER:

– Given a url + parameters – Returns a web page (often dynamic)

• Web SERVICE:– Given a XML document (soap msg)– Returns an XML document– Tools make this look like an RPC.

• F(x,y,z) returns (u, v, w)

– Distributed objects for the web.– + naming, discovery, security,..

• Internet-scale distributed computing

Yourprogram

DataIn your address

space

Web Service

soap

object

in

xml

Yourprogram Web

Server

http

Web

page

Page 35: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

35

The Challenge• This has failed several times before–

understand why.

• Develop – Common data models (schemas),– Common interfaces (class/method)

• Build useful prototypes (nodes and portals)

• Create a community that uses the prototypes and evolves the prototypes.

Page 36: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

36

Grid and Web Services Synergy• I believe the Grid will be many web services• IETF standards Provide

– Naming– Authorization / Security / Privacy– Distributed Objects

Discovery, Definition, Invocation, Object Model

– Higher level services: workflow, transactions, DB,..

• Synergy: commercial Internet & Grid tools

Page 37: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

37

Some Interesting Things We are Doing in SDSS

(what’s new)• SkyServer is “done.”

Now it is 99% perspiration to load 25 TB (many times)and manage it.

• I’m using it as a research vehicle to explore new DB ideas.

• Others are cloning it for other surveys.Some doing DB2 & Oracle variants.

Page 38: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

38

SkyServer Overview (10 min)• 10 minute SkyServer tour

– Pixel space http://skyserver.sdss.org/en/

– Record space: http://skyserver.sdss.org/en/tools/explore/obj.asp?id=2255030989160697 – Doc space: Ned– Set space: – Web & Query Logs: – Dr1 WebService

• You can download (thanks to Cathan Cook )– Data + Database code:– Website:

• Data Mining the SDSS SkyServer Database MSR-TR-2002-01

select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc

select top 10 * from weblog..sqlLogorder by theTime Desc

http://skyserver.pha.jhu.edu/dr1/en/tools/chart/navi.asp

http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm

Page 39: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

39

Cutout Service (10 min)A typical web service

• Show it

• Show WSDL

• Show fixing a bug

• Rush through code.• You can download it.

Maria A. Nieto-Santisteban did most of this (Alex and I started it)

http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm

Page 40: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

40

SkyQuery: http://skyquery.net/• Distributed Query tool using a set of web services• Four astronomy archives from

Pasadena, Chicago, Baltimore, Cambridge (England).• Feasibility study, built in 6 weeks

– Tanu Malik (JHU CS grad student) – Tamas Budavari (JHU astro postdoc)– With help from Szalay, Thakar, Gray

• Implemented in C# and .NET• Allows queries like:

SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

Page 41: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

412MASS

INT

SDSS

FIRST

SkyQueryPortal

ImageCutout

SkyQuery Structure• Each SkyNode publishes

– Schema Web Service– Database Web Service

• Portal is – Plans Query (2 phase) – Integrates answers– Is itself a web service

Page 42: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

422MASS

INT

SDSS

FIRST

SkyQueryPortal

ImageCutout

SkyQuery and The Grid• This is a DataGrid• It works today• It is challenging for OGSA-DAIS

(hello world in OGSI-DAI is complex)

• SkyQuery is being used as a vehicle to explore OGSA and DAIS requirements.

Page 43: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

432MASS

INT

SDSS

FIRST

SkyQueryPortal

ImageCutout

MyDB added to SkyQuery• Let users add personal DB

1GB for now.• Use it as a workbook.• Online and batch queries.

• Moves analysis to the data• Users can cooperate

(share MyDB)• Still exploring this

MyDB

Page 44: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

44

Two (?) Talks

• Distributed Computing Economics• What I’m doing

– Online Science – World Wide Telescope

– TerraServer Brick Design/Deploy/Operate

– Paxos Commit– Spatial Data done relationally

• With Alex Szalay JHU

• With Tom Barclay• With Leslie Lamport• With Alex Szalay JHU

Page 45: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

45

SQL x4SQL x4

SANSAN

TerraServer V4

• 8 web front end• 4x8cpu+4GB DB • 18TB triplicate disks

Classic SAN(tape not shown)

• ~2M$ capital expense• Works GREAT!• 2000…2004• Now replaced by..

WEBWEBx8x8

Page 46: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

46

KVM / IPKVM / IP

TerraServer V5• Storage Bricks

– “White-box commodity servers”– 4tb raw / 2TB Raid1 SATA storage– Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM

• Partitioned Databases (PACS – partitioned array)– 3 Storage Bricks = 1 TerraServer data – Data partitioned across 20 databases– More data & partitions coming

• Low Cost Availability– 4 copies of the data

• RAID1 SATA Mirroring• 2 redundant “Bunches”

– Spare brick to repair failed brick 2N+1 design

– Web Application “bunch aware”• Load balances between redundant databases• Fails over to surviving database on failure

• ~100K$ capital expense.

Page 47: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

47

Two (?) Talks

• Distributed Computing Economics• What I’m doing

– Online Science – World Wide Telescope

– TerraServer Brick Design/Deploy/Operate

– Paxos Commit– Spatial Data done relationally

• With Alex Szalay JHU

• With Tom Barclay• With Leslie Lamport• With Alex Szalay JHU

Page 48: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

48

Two Phase Commit• N Resource Managers (RMs)• Want all RMs to commit or all abort.• Coordinated by Transaction Manager (TM)

TM sends Prepare, Commit-Abort• RM responds Prepared, Aborted• 3N+1 messages• N+1 stable writes• Delay

– 3 message– 2 stable write

• Blocking: if TM fails, Commit-Abort stalls

working

committed aborted

Transaction Manager

working

prepared

committed aborted

Resource Manager

RequestCommit

PreparePreparePreparePrepare

PreparePreparePrepareCommit

PreparePreparePreparePrepared

Page 49: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

49

Two Phase Commit: 2PC

• Atomicity – all or nothing

• Consistency/Reliability – does right thing

• Isolation – no concurrency anomalies

• Durability – state survives failures

• Availability: always up

• ACID-A

Page 50: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

50

I can do better

• Those 2PC wimps are– Stupid – they do not understand my app– Fascists – the force me to send messages

• I can do better– I can write async code– I can keep logs– I can deal with failures and complexities– Indeed, this is my destiny

a full employment act

Page 51: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

51

Commit

• KISS

• Simple fault / failure model

• It is hard to get these “optimizations” right.

• But you want availability…

• OK…

• No 2PC just C

Page 52: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

52

2PC Commit

• Availability: always up• Atomicity – all or nothing• Consistency/Reliability – does right thing• Isolation – no concurrency anomalies• Durability – state survives failures• => 2PC++ = 3PC =

Non Blocking Commit Solves the availability problem

• AACID

Page 53: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

53

Consensus• N processes want to agree on a value

• Want to tolerate F faults– Tolerate F processes stopping– Tolerate F Messages delayed or lost

• If there are less than F faults in a windowThen consensus achieved.

• Byzantine faults need 3F “acceptors”

• Benign faults need 2F+1 “acceptors”stalls but safe if more than F faults

Page 54: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

54

Paxos Consensus

• Group has a leader known to all– leader election is a subroutine

• Process proposes a value v to leader.

• Leader sends proposal (phase 2) (ballot, value) to all acceptors

• Acceptors respond with:max(ballot, value) they have seen

• If leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value)

• Protocol is 3-phase • Phase 1:

– Leader starts new ballot

• Phase 2– Leader proposes value

• Phase 3– If value accepted by F+1

then value is accepted. – If not, leader tries to get

majority value accepted.

6F+4 messages, F+1 stable writes4 message delays and 2 stable writes

Page 55: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

55

Paxos Commit• Obvious idea:

Have TM use Paxos consensus of RMs prepared • More efficient idea: • 2F+1 acceptors (~2F+1 TMs)• Each RM leads a Paxos on: I’m Prepared.• If F+1 acceptors see all RMs prepared,

then transaction committed.• 2F(N+1) + 3N + 1 messages

5 message delays (one extra delay)2 stable write delays.

• == 2PC when F=0

RM0CommitLeader RM0…N

Acceptors0…2F

requestcommit

prepare

prepared

all prepared

commit

Page 56: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

56

Paxos Commit (success case)

Acceptors

working

prepared

committed aborted

Resource Managers

working

AllPrepared aborted

Commit Leader

working

committed aborted

Request Commit

Prepare

Prepared Prepared

Prepared

Commit

All Prepared

Page 57: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

57

Two (?) Talks

• Distributed Computing Economics• What I’m doing

– Online Science – World Wide Telescope

– TerraServer Brick Design/Deploy/Operate

– Paxos Commit– Spatial Data done relationally

• With Alex Szalay JHU

• With Tom Barclay• With Leslie Lamport• With Alex Szalay JHU

Page 58: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

58

There Goes the Neighborhood!Spatial (or N-Dimensional) Search

in a Relational World

Jim Gray, Microsoft

Alex Szalay, Johns Hopkins U.

Page 59: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

59

Page 60: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

60

Background• I have been working with

Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB

• A great Web Services app• The work here

joint with Alex Szalay

• SkyServer.Sdss.Org is first installment,

• SkyQuery.Netis second installment (federated web services)

Page 61: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

61

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

Page 62: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

62

Spatial Data Access – SQL extension

Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm

• Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins

• Every object has a 20-deep Mesh ID• Given a spatial definition,

routine returns up to 10 covering triangles

• Spatial query is then up to 10 range queries

• Fast: 1,000 triangles / second / Ghz2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,2

2,1

2,0

2,32,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,3,0

2,3,12,3,2 2,3,3

Page 63: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

63

A typical call-- find objects within 1 arcminute of (60,20)select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and <geometry test on (ra,dec) – (60,20) < 1 arcmin>

-- or better yetselect objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1)

careful distance test rejects

false positives

Coarse distance test

Coarse filterCoarse filter Correct filterCorrect filter

Page 64: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

64

Integration with CLR Makes it Nicer

• Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code.

• Now we are converting library to C#

• Also, Cross Apply is VERY useful select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

Page 65: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

65

Object Relational Has Arrived• VMs are moving inside the DB• Yukon includes Common Language Runtime

(Oracle & DB2 have similar mechanisms).• So, C++, VB, C# and Java are

co-equal with TransactSQL.• You can define classes and methods

SQL will store the instancesAccess them via methods

• You can put your analysis code INSIDE the database.

• Minimizes data movement.You can’t move petabytes to the client

But we will soon have petabyte databases.

datacode

datacode

+code

Page 66: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

66

The HTM code body

Spatial Data Search The Pre CLR design

Transact SQL sp_HTM(20 lines) 469 lines of

“glue”looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed...");

// Is Coordinate param a character stringif (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR &&

bType1 != SRVVARCHAR && bType1 != SRVCHAR)ErrorExit("Coordinate param should be a string.");

// Is Coordinate param non-nullif (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1)

ErrorExit("Coordinate param is null.");

// Get pointer to Coordinate parampzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1);if (pzCoordinateSpec == NULL)

ErrorExit("Coordinate param is null.");pzCoordinateSpec[cbActualLen1] = 0;

// Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param...");

Page 67: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

67

The “glue” CLR designDiscard 450 lines of UGLY code

The HTM code body

C# SQL sp_HTM(50 lines)

using System;using System.Data;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace HTM {

public class HTM_wrapper {[DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str);public static unsafe void HTM_cover_RS(string input) {

// convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output;

// invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input);

// Convert the array to a tableSqlResultSet outputTable = SqlContext.GetReturnResultSet();

if (output[0] == 'O') { // if Output is “OK”uint c = *(UInt32 *)(s + 4); // cast results as datasetInt64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2]

for (int i = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]);

newRecord.SetSqlInt64(1, r[1]); r++;r++;outputTable.Insert(newRecord);

} } // return outputTable;} } }

Thanks!!! To Peter Kukol (who wrote this)

Page 68: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

68

The Clean CLR designDiscard all glue code

return array cast as tableCREATE ASSEMBLY HTM_AFROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION HTM_cover( @input NVARCHAR(100) )RETURNS @t TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL

)ASEXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover

using System;using System.Data;using System.Data.Sql;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace HTM_NS {

public class HTM_C {public static Int64[,2] HTM_cover(string input) {

// invoke the HTM routine return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here.

} } }

Your/My code

goes here

Page 69: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

69

Performance (Beta1) On a 2.2 Ghz Xeon

• Call a Transact SQL function 33μs

• Call a C# function 50μs

• Table valued function not good in β1

• Array (== table) valued function 200 μs

+ per row 27 μs

Page 70: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

70

CREATE ASSEMBLY ReturnOneAFROM '\\localhost\C:\ReturnOne.dll'GOCREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INTAS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_IntGO----------------------------------------------- time echo an integerdeclare @i int, @j int, @cpu_seconds float, @null_loop floatdeclare @start datetime, @end datetimeset @j = 0set @i = 10000set @start = current_Timestampwhile(@i > 0) begin

set @j = @j + 1set @i = @i -1

endset @end = current_Timestampset @null_loop = datediff(ms, @start,@end) / 10.0

set @i = 10000set @start = current_Timestampwhile(@i > 0) begin

select @j = dbo.ReturnOne_Int(@i)set @j = @j + 1set @i = @i -1

endset @end = current_Timestamp set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loopprint 'average cpu time for 1,000 calls to ReturnOne_Int was ' + str(@cpu_seconds,8,2)+ ' micro seconds'

The Code

using System;using System.Data;using System.Data.SqlServer;using System.Data.SqlTypes;using System.Runtime.InteropServices;namespace ReturnOneNS { public class ReturnOneC {

public static int ReturnOne_Int(int input) {return input;

} }}

Function written in C# inside the DB

Program in DB in different

language (Tsql) calling function

Page 71: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

71

What Is the Significance?• No more inside/outside DB dichotomy.

• You can put your code near the data.

• Indeed, we are letting users put personal databases near the data archive.

• This avoids moving large datasets.

• Just move questions and answers.

Page 72: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

72

Meta-Message

• Trying to fit science data into databases

• When it does not fit, something is wrong.

• Look for solutions– Many solutions come from OR extensions– Some are fundamental engine changes

• More structure in DB• Richer operator sets• Better statistics

Page 73: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

73

But…

• Wanted a faster way to do this: some computations were taking toooooo long (see below).

• Wanted to define areas in relational form.

• Wanted a portable way that works on any relational system.

• So, developed a “constraint database” approach – see below.

Page 74: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

74

The Idea:Equations Define Subspaces

• For (x,y) above the lineax+by > c

• Reverse the space by-ax + -by > -c

• Intersect 3 half-spaces: a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3

x

y

x=c/a

y=c/b

ax + by = c

x

y

Page 75: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

75

The Idea:Equations Define Subspaces

a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3x

y

select count(*)from convex where a*@x + b*@y < c

3

2

22

1 1

1

select count(*)from convex where a*@x + b*@y > c

x

y

0

1

11

2 2

2

Page 76: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

76

Domain is Union of Convex Hulls

• Simple volumes are unions of convex hulls.

• Higher order curves also work

• Complex volumes have holes and their holes have holes. (that is harder).

Not a convex hull

+

Page 77: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

77

Now in Relational Termscreate table HalfSpace (

domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space c float not null, -- the constraint (“c” above) primary key (domainID, convexID, halfSpaceID)

(x,y) inside a convex if it is inside all lines of the convex(x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex

Convexes containing point (@x,@y):select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere (@x * a + @y * b) < c -- point outside the line?group by all convexID -- insist no line of convexhaving count(*) = 0 -- is outside (count outside ==

0)

Page 78: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

78

All Domains Containing this Point

• The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!).

select distinct domainID -- return domains

from HalfSpace -- from constraints

where (@x * a + @y * b) < c -- point outside

group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

Page 79: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

79

The Algebra is Simple (Boolean)@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))@convexID = spDomainNewConvex (@domainID int)@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float)@returnCode = spDomainDrop(@domainID)

select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations.@domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

Page 80: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

80

What! No Bounding Box?

• Bounding box limits search.A subset of the convex hulls.

• If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines.

• But, if you have a lot of half-spaces then bounding box is good.

Page 81: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

81

OK: “solved” Areas Contain Point? What about: Points near point?

• Table-valued function find points near a point

– Select * from fGetNearbyEq(ra,dec,r)• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/

– Space filling curve, bounding triangles…– Standard approach

• 13 ms/call… So 70 objects/second.• Too slow, so pre-compute neighbors:

Materialized view.• At 70 objects/sec: takes 6 months

to compute materialized view on billion objects.

Page 82: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

82

Zone Based Spatial Join• Divide space into zones• Key points by Zone, offset

(on the sphere this need wrap-around margin.)

• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has

1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel

6 days, not 6 months!

r ra-zoneMax

√(r2+(ra-zoneMax)2)cos(radians(zoneMax))

zoneMax

x

Ra ± x

Page 83: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

83

In SQL: points near point

select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #

floor((@dec-@r)/@zoneHeight) and -- overlaps the circleceiling((@dec+@r)/@zoneHeight)

and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))

< @r -- careful filter on distance

Eliminates the ~ 21% = 1-π/4False positives

Bounding box

Page 84: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

84

Quantitative Evaluation: 7x faster than external stored proc:

(linkage is expensive)time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate.

time vs. best time @ various radius. A zoneHeight of 4” is near-optimal

Rows vs elapsed timefit is 1.46+2.2e-4*r^2 ms/asec

1

10

100

1000

10 100 1000

r (asec)

tim

e (

se

c)

7.5 asec

15 asec30 asec1 amin

2 amin4 amin

64 aminr 2̂ fit

Relative time vs zone height (asec)4 minute zone is near optimal

2 & 8 minute are slower

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1.90

2.00

10 100 1000r (asec)

tim

e vs

bes

t

7 asec15 asec30 asec1 amin2 amin4 amin16 amin1 degree

Page 85: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

85

All Neighbors of All points(can Batch Process the Joins)

• A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper

insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs

o2.objID as NeighborObjID, .. other fields elided

from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra

and o2.ra between o1.ra - @r and o1.ra + @r -- points near rawhere -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2))

< @r -- careful filter on distance

Page 86: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

86

Spatial Stuff Summary• Easy

– Point in polygon– Polygons containing points– (instance and batch)

• Works in higher dimensions

• Side note: Spherical polygons are – hard in 2-space– Easy in 3-space

Page 87: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

87

Spatial Stuff Summary• Constraint databases are in

– Streams (data is query, query is in DB)– Notification: subscription in DB, data is query– Spatial: constraints in DB, data is query

• You can express constraints as rows

• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.

• Benefits from SQL parallelism

• SQL == Prolog // DataLog?

Page 88: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

88

References

• Representing Polygon Areas and Testing Point-in-Polygon

Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc

• A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc

Page 89: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

89

Some Database Topics• Sparse tables:

column vs row storetag and index tablespivot

• Maplist (cross apply)

• Dealing with bad statistics: .

Page 90: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

90

Column Store Pyramid• Users see fat base tables

(universal relation)

• Define popular columns index tag table 10% ~ 100 columns

• Make many skinny indices 1% ~ 10 columns

• Query optimizer picks right plan• Automate definition & use• Fast read, slow insert/update • Data warehouse

• Note: prior to Yukon, index had 16 column limit. A bane of my existence. Simpl

e

Typical Semi-join Fat quer

y

Obese query

BASEBASE

INDICIESINDICIES

TAGTAG

Page 91: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

91

Examples

create table base (id bigint, f1 int primary key,f2 int, …,f1000 int)

create index tag on base (id) include (f1, …, f100)

create index skinny on base(f2,…f17)

Simple

Typical Semi-join

Fat quer

y

Obese query

BASE

INDICIESINDICIES

TAGTAG

Page 92: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

92

A Semi-Join Example

create table fat(a int primary key, b int, c int, fat char (988))declare @i int, @j int; set @i = 0again: insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ') set @i = @i + 1; if (@i < 1000000) goto again create index ab on fat(a,b)create index ac on fat(a,c)

dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b-- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s.

dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c-- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s.

1GB

8MB 8MB

b=c 3.4K IO 1.4 sec

abab acac

b=c 137 K IO

31 sec

Page 93: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

93

Moving From Rows to ColumnsPivot & UnPivot

What if the table is sparse? LDAP has 7 mandatory

and 1,000 optional attributes

Store row, col, value

create table Features ( object varchar , attribute varchar,

value varchar, primary key ( object,

attribute)) select *from (features pivot value on attribute

in (year, color) ) as Twhere object = ‘4PNC450’

Featuresobject attribute value

●●●●4PNC450 year 20004PNC450 color white4PNC450 make Ford4PNC450 model Taurus●●●●

TObject year color 4PNC450 2000 white

Page 94: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

94

Maplist Meets SQL – cross apply

• Your table-valued function F(a,b,c) returns all objects related to a,b,c.

• spatial neighbors,

• sub-assemblies,

• members of a group,

• items in a folder,…

• Apply this function to each row• Classic drill-down

use outer apply if f() may be null

select p.*, q.*from parent as p cross apply f(p.a, p.b, p.c) as qwhere p.type = 1

p1

f(p1)

p2

f(p2)

pn

f(pn)

Page 95: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

95

When SQL Optimizer Guesses Wrong,Life is DREADFUL

• SQL is a non-procedural language.

• The compiler/optimizer picks the procedurebased on statistics.

• If the stats are wrong or missing….Bad things happen.Queries can run VERY slowly.

• Strategy 1: allow users to specify plan.

• Strategy 2: make the optimizer smarter(and accept hints from the user.)

Page 96: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

96

An Example of the Problem• A query selects some fields

of an index and of huge table.

• Bookmark plan: – look in index for a subset.– Lookup subset in Fat table.

• This is – great if subset << table.– terrible if subset ~ table.

• If statistics are wrong, or if predicates not independent,you get the wrong plan.

• How to fix the statistics?

Ind

ex

Huge table

Page 97: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

97

A Fix: Let user ask for stats

• Create Statistics on View(f1,..,fn)• Then the optimizer has the right data

Picks the right plan.Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003,

• Q3: Select count(*) from Galaxy where r < 22 and r_extinction > 0.120

Bookmark: 34 M random IO, 520 minutes Create Statistics on Galaxy(objID )

Scan: 5 M sequential IO 18 minutes

• Ultimately this should be automated, but for now,… it’s a step in the right direction.

Page 98: 1 Distributed Computing Economics Slides at: gray/talksJim Graygray/talks Microsoft Research.

98

Two (?) Talks

• Distributed Computing Economics

• Online Science (what I have been doing).