CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures ›...

77
CSC 261/461 Database Systems Lecture 19 Fall 2017 CSC 261, Fall 2017, UR

Transcript of CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures ›...

Page 1: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

CSC 261/461 – Database SystemsLecture 19

Fall 2017

CSC261,Fall2017,UR

Page 2: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Announcements

• CIRC:– CIRC is down!!!–MongoDB and Spark (mini) projects are at stake. L

• Project 1 Milestone 4– is out– Due date: Last date of class• We will check your website after that date• But, finish early

• Due Dates:– Suggestions:

CSC261,Fall2017,UR

Page 3: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Due Dates

• 11/12 to 11/18

• 11/19 to 11/25 (Thanksgiving Week)

• 11/26 to 12/02

• 12/03 to 12/09: – Term Paper Due: 12/08

• 12/10 to 12/13 (Last Class): – Poster Session on: 12/11 – Project 1 Milestone 4 is due on 12/13

• Final: December 18, 2017 at 7:15 pm

CSC261,Fall2017,UR

MongoDB

Spark

TermPaper

PosterSession

Page 4: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Topics for Today

• Query Processing (Chapter 18) • Query Optimization (Chapter 19) on Wednesday

CSC261,Fall2017,UR

Page 5: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

QUERY PROCESSING

CSC261,Fall2017,UR

Page 6: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Steps in Query Processing

• Scanning

• Parsing

• Validation

• Query Tree Creation

• Query Optimization (Query planning)

• Code generation (to execute the plan)

• Running the query code

CSC261,Fall2017,UR

Page 7: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Steps in Query Processing

CSC261,Fall2017,UR

Page 8: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

SQL Queries

• SQL Queries are decomposed into Query blocks:– Select…From…Where…Group By…Having

• Translate Query blocks into Relational Algebraic expression

• Remember, SQL includes aggregate operators:–MIN, MAX, SUM, COUNT etc.– Part of the extended algebra– Let’s go back to Chapter 8 (Section 8.4.2)

CSC261,Fall2017,UR

Page 9: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Aggregate Functions and Grouping (Relational Algebra)

• Aggregate function: ℑ

•< 𝑔𝑟𝑜𝑢𝑝𝑖𝑛𝑔𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 >

ℑ< 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑙𝑖𝑠𝑡 >

(R)

CSC261,Fall2017,UR

Dno ℑ COUNT Ssn, AVERAGE Salary(EMPLOYEE).

Page 10: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Semijoin (⋉)

• R ⋉ S = P A1,…,An (R ⋈ S)• Where A1, …, An are the

attributes in R• Example:– Employee ⋉Dependents

SELECT DISTINCTsid,sname,gpa

FROM Students,People

WHEREsname = pname;

SQL:

RA:𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋉ 𝑃𝑒𝑜𝑝𝑙𝑒

Students(sid,sname,gpa)People(ssn,pname,address)

SELECT DISTINCTsid,sname,gpa

FROM Students

WHEREsname IN

(SELECT pname FROM People);

OR

CSC261,Fall2017,UR

Page 11: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

EXTERNAL SORTING

CSC261,Fall2017,UR

Page 12: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort

Page 13: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Why are Sort Algorithms Important?

• Data requested from DB in sorted order is extremely common– e.g., find students in increasing GPA order

• Why not just use quicksort in main memory??–What about if we need to sort 1TB of data with 1GB of

RAM…

Aclassicproblemincomputerscience!

Page 14: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

So how do we sort big files?

1. Split into chunks small enough to sort in memory (“runs”)

2. Merge pairs (or groups) of runs using the external merge algorithm

3. Keep merging the resulting runs (each time = a “pass”) until left with one sorted file!

Page 15: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

2. EXTERNAL MERGE & SORT

Page 16: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Challenge: Merging Big Files with Small Memory

How do we efficiently merge two sorted files when both are much larger than our main memory buffer?

Page 17: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

• Input: 2 sorted lists of length M and N

• Output: 1 sorted list of length M + N

• Required: At least 3 Buffer Pages

• IOs: 2(M+N)

Page 18: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Key (Simple) Idea

To find an element that is no larger than all elements in two lists, one only needs to compare minimum elements from each list.

If:𝐴: ≤ 𝐴< ≤ ⋯ ≤ 𝐴>𝐵: ≤ 𝐵< ≤ ⋯ ≤ 𝐵@

Then:𝑀𝑖𝑛(𝐴:, 𝐵:) ≤ 𝐴E𝑀𝑖𝑛(𝐴:, 𝐵:) ≤ 𝐵F

fori=1….Nandj=1….M

Page 19: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

7,11 20,31

23,24 25,30

Input:Twosortedfiles

Output:Onemergedsortedfile

Disk

MainMemory

Buffer1,5

2,22

F1

F2

Page 20: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

1,5 2,22Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

Page 21: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

5 22 1,2Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

Page 22: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

5 22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

Page 23: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

Thisisallthealgorithm“sees”…Whichfiletoloadapagefromnext?

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

7,11

Page 24: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

WeknowthatF2 onlycontainsvalues≥ 22…soweshouldloadfromF1!

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

7,11

Page 25: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F27,11

Page 26: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

20,31

23,24 25,30

Disk

MainMemory

Buffer

5,722

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

Page 27: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

20,31

23,24 25,30

Disk

MainMemory

Buffer

5,7

22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

Page 28: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Algorithm

23,24 25,30

Disk

MainMemory

Buffer

5,7

22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

20,31

Andsoon…

Page 29: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

We can merge lists of arbitrary length with only 3 buffer pages.

IflistsofsizeMandN,thenCost: 2(M+N)IOs

Eachpageisreadonce,writtenonce

Page 30: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

27,24 3,1

Example:• 3Buffer

pages• 6-pagefile

Disk MainMemory

Buffer

18,22

F1

F2

33,12 55,3144,10

1. Split into chunks small enough to sort in memory

Orangefile=unsorted

Page 31: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

EXTERNAL MERGE SORT (BEFORE MERGE)

CSC261,Spring2017,UR

Page 32: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F2

33,12 55,3144,10

1. Split into chunks small enough to sort in memory

Example:• 3Buffer

pages• 6-pagefile

Orangefile=unsorted

Page 33: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F233,12 55,3144,10

1. Split into chunks small enough to sort in memory

Example:• 3Buffer

pages• 6-pagefile

Orangefile=unsorted

Page 34: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F231,33 44,5510,12

Example:• 3Buffer

pages• 6-pagefile

1. Split into chunks small enough to sort in memory

Orangefile=unsorted

Page 35: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

Disk MainMemory

BufferF1

F2

31,33 44,5510,12

AndsimilarlyforF2

27,24 3,118,2218,22 24,271,3

1. Splitintochunkssmallenoughtosortinmemory

Example:• 3Buffer

pages• 6-pagefileEachsortedfileisacalledarun

Page 36: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

External Merge Sort Algorithm

Disk MainMemory

BufferF1

F2

2. Now just run the external merge algorithm & we’re done!

31,33 44,5510,12

18,22 24,271,3

Example:• 3Buffer

pages• 6-pagefile

Page 37: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Calculating IO Cost

For 3 buffer pages, 6 page file:

1. Split into two 3-page files and sort in memory = 1 R + 1 W for each file = 2*(3 + 3) = 12 IO operations

2. Merge each pair of sorted chunks using the external merge algorithm = 2*(3 + 3) = 12 IO operations

3. Total cost = 24 IO

Page 38: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

18,43 24,2745,38

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

31,33 47,5510,12

18,22 23,2041,3

31,33 39,5542,46

18,23 24,271,3

48,33 44,4010,12

18,22 24,2716,31

Page 39: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

18,43 24,2745,38

31,33 47,5510,12

18,22 23,2041,3

31,33 39,5542,46

18,23 24,271,3

48,33 44,4010,12

18,22 24,2716,31

1.Splitintofilessmallenoughtosortinbuffer…

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Page 40: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

1.Splitintofilessmallenoughtosortinbuffer…andsort

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Calleachofthesesortedfilesarun

Page 41: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

2.Nowmergepairsof(sorted)files…theresultingfileswillbesorted!

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Page 42: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

3.Andrepeat…

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Disk

10,12 12,183,10

22,23 24,2718,20

33,33 38,4131,31

45,47 55,5543,44

10,12 16,181,3

23,24 24,2718,22

31,33 33,3927,31

44,46 48,5540,42

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Calleachofthesestepsapass

Page 43: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Running External Merge Sort on Larger Files

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

4.Andrepeat!

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Disk

10,12 12,183,10

22,23 24,2718,20

33,33 38,4131,31

45,47 55,5543,44

10,12 16,181,3

23,24 24,2718,22

31,33 33,3927,31

44,46 48,5540,42

Disk

3,10 10,101,3

12,16 18,1812,12

20,22 22,2318,18

24,24 27,2723,24

31,31 31,3327,31

33,38 39,4033,33

43,44 44,4541,42

48,55 55,5546,47

Page 44: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Simplified 3-page Buffer Version

Assume for simplicity that we split an N-page file into N single-page runs and sort these; then:

• First pass: Merge N/2 pairs of runs each of length 1 page

• Second pass: Merge N/4 pairs of runs each of length 2 pages

• In general, for N pages, we do 𝒍𝒐𝒈𝟐 𝑵 passes– +1 for the initial split & sort

• Each pass involves reading in & writing out all the pages = 2N IO

Unsortedinputfile

Split&sort

Merge

Merge

Sorted!

à 2N*( 𝒍𝒐𝒈𝟐 𝑵 +1)totalIOcost!

Page 45: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Using B+1 buffer pages to reduce # of passes

Suppose we have B+1 buffer pages now; we can:

1. Increase length of initial runs. Sort B+1 at a time!At the beginning, we can split the N pages into runs of length B+1 and sort these in memory

2𝑁( log< 𝑁 + 1)

IOCost:

Startingwithrunsoflength1

2𝑁( log<𝑵

𝑩 + 𝟏 + 1)

StartingwithrunsoflengthB+1

Page 46: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Using B+1 buffer pages to reduce # of passes

Suppose we have B+1 buffer pages now; we can:

2. Perform a B-way merge. On each pass, we can merge groups of B runs at a time (vs. merging pairs of runs)!

IOCost:

2𝑁( log< 𝑁 + 1) 2𝑁( log<𝑵

𝑩 + 𝟏 + 1)

Startingwithrunsoflength1

StartingwithrunsoflengthB+1

2𝑁( logV𝑵

𝑩 + 𝟏 + 1)

PerformingB-waymerges

Page 47: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Algorithm fro Select Operation

• Read Section 18.3 (18.3.1 , 18.3.2, 18.3.3, 18.3.4)• Mostly covers searching:

• 1. Linear Search• 2. Binary Search• 3. Indexing• 4. Hashing• 5. B+ Tree

• (Skip bitmap index and functional index)

CSC261,Fall2017,UR

Page 48: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Algorithm for Join Operation

• The most time consuming operation

CSC261,Fall2017,UR

Page 49: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

What you will learn about in this section

1. NestedLoopJoin(NLJ)

2. BlockNestedLoopJoin(BNLJ)

3. IndexNestedLoopJoin(INLJ)

4. Sorted-MergeJoin

5. HashJoin

CSC261,Fall2017,UR

Page 50: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

RECAP: Joins

CSC261,Fall2017,UR

Page 51: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Joins: Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

𝐑 ⋈ 𝑺 SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

CSC261,Fall2017,UR

Page 52: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Joins: Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

𝐑 ⋈ 𝑺 SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

CSC261,Fall2017,UR

Page 53: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Joins: Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

𝐑 ⋈ 𝑺 SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

CSC261,Fall2017,UR

Page 54: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Joins: Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

𝐑 ⋈ 𝑺 SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

CSC261,Fall2017,UR

Page 55: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Joins: Example

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R SA B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

3 1 1 7

𝐑 ⋈ 𝑺 SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

CSC261,Fall2017,UR

Page 56: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Semantically: A Subset of the Cross Product

SELECT R.A,B,C,DFROM R, SWHERE R.A = S.A

Example: Returnsallpairsoftuplesr ∈ 𝑅, 𝑠 ∈ 𝑆suchthat𝑟. 𝐴 = 𝑠. 𝐴

A D

3 7

2 2

2 3

A B C

1 0 1

2 3 4

2 5 2

3 1 1

R S A B C D

2 3 4 2

2 3 4 3

2 5 2 2

2 5 2 3

3 1 1 7

×CrossProduct

Filterbyconditions(r.A =s.A)

… Canweactuallyimplementajoininthisway?

𝐑 ⋈ 𝑺

CSC261,Fall2017,UR

Page 57: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Notes

• We write 𝐑 ⋈ 𝑺 to mean join R and S by returning all tuple pairs where all shared attributes are equal

• We write 𝐑 ⋈ 𝑺 on A to mean join R and S by returning all tuple pairs where attribute(s) A are equal

• For simplicity, we’ll consider joins on two tables and with equality constraints (“equijoins”)

Howeverjoinscanmerge>2tables,andsomealgorithmsdosupportnon-equalityconstraints!

CSC261,Fall2017,UR

Page 58: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Joins

CSC261,Fall2017,UR

Page 59: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Notes

• We are again considering “IO aware” algorithms: care about disk IO

• Given a relation R, let:– T(R) = # of tuples in R– P(R) = # of pages in R

• Note also that we omit ceilings in calculations… good exercise to put back in!

Recallthatweread/writeentirepageswithdiskIO

CSC261,Fall2017,UR

Page 60: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

CSC261,Fall2017,UR

Page 61: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

P(R)

1. LoopoverthetuplesinR

NotethatourIOcostisbasedonthenumberofpages loaded,notthenumberoftuples!

Cost:

CSC261,Fall2017,UR

Page 62: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

P(R)+T(R)*P(S)

HavetoreadallofSfromdiskforeverytupleinR!

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

Cost:

CSC261,Fall2017,UR

Page 63: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

P(R)+T(R)*P(S)

NotethatNLJcanhandlethingsotherthanequalityconstraints…justcheckintheifstatement!

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

3. Checkagainstjoinconditions

Cost:

CSC261,Fall2017,UR

Page 64: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

P(R)+T(R)*P(S)+OUT

1. LoopoverthetuplesinR

2. ForeverytupleinR,loopoverallthetuplesinS

3. Checkagainstjoinconditions

4. Writeout(topage,thenwhenpagefull,todisk)

Cost:

CSC261,Fall2017,UR

Page 65: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Nested Loop Join (NLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for r in R:

for s in S:if r[A] == s[A]:

yield (r,s)

P(R)+T(R)*P(S)+OUT

WhatifR(“outer”)andS(“inner”)switched?

Cost:

P(S)+T(S)*P(R)+OUT

Outervs.innerselectionmakesahugedifference-DBMSneedstoknowwhichrelationissmaller!

CSC261,Fall2017,UR

Page 66: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ)

CSC261,Fall2017,UR

Page 67: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each page pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

P(𝑅)

Given3pagesofmemory

1. Loadin1pageofRatatime(leaving1pageeachfreeforS&output)

Cost:

Note:Therecouldbesomespeeduphereduetothefactthatwe’rereadinginmultiplepagessequentiallyhoweverwe’llignorethishere!

CSC261,Fall2017,UR

Page 68: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each page pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Block Nested Loop Join (BNLJ)

P 𝑅 + 𝑃 𝑅 . 𝑃(𝑆)

Given3pagesofmemory

Note:Fastertoiterateoverthesmaller relationfirst!

1. Loadin1pageofRatatime(leaving1pageeachfreeforS&output)

2. ForeachpagesegmentofR,loadeachpageofS

Cost:

CSC261,Fall2017,UR

Page 69: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each page pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Block Nested Loop Join (BNLJ)

Given3 pagesofmemory

1. Loadin1pageofRatatime(leaving1pageeachfreeforS&output)

2. ForeachpagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

BNLJcanalsohandlenon-equalityconstraints

Cost:

CSC261,Fall2017,UR

P 𝑅 + 𝑃 𝑅 . 𝑃(𝑆)

Page 70: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each page pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

Block Nested Loop Join (BNLJ)

Given3pagesofmemory

1. Load1pageofRatatime(leaving1pageeachfreeforS&output)

2. ForeachpagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

4. Writeout

Cost:

CSC261,Fall2017,UR

P 𝑅 + 𝑃 𝑅 . 𝑃(𝑆)

Page 71: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ) (B+1 pages of Memory)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:if r[A] == s[A]:

yield (r,s)

P(𝑅)

GivenB+1 pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

Cost:

Note:Therecouldbesomespeeduphereduetothefactthatwe’rereadinginmultiplepagessequentiallyhoweverwe’llignorethishere!

CSC261,Fall2017,UR

Page 72: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:

if r[A] == s[A]:yield (r,s)

P 𝑅 +𝑃 𝑅𝐵 − 1𝑃(𝑆)

GivenB+1pagesofmemory

Note:Fastertoiterateoverthesmaller relationfirst!

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

Cost:

CSC261,Fall2017,UR

Page 73: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:

if r[A] == s[A]:yield (r,s)

GivenB+1pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

BNLJcanalsohandlenon-equalityconstraints

Cost:

P 𝑅 +𝑃 𝑅𝐵 − 1𝑃(𝑆)

CSC261,Fall2017,UR

Page 74: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Block Nested Loop Join (BNLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:for each B-1 pages pr of R:

for page ps of S:for each tuple r in pr:

for each tuple s in ps:

if r[A] == s[A]:yield (r,s)

P 𝑅 +b cVd:

𝑃(𝑆) +OUT

GivenB+1pagesofmemory

1. LoadinB-1pagesofRatatime(leaving1pageeachfreeforS&output)

2. Foreach(B-1)-pagesegmentofR,loadeachpageofS

3. Checkagainstthejoinconditions

4. Writeout

Cost:

CSC261,Fall2017,UR

Page 75: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

BNLJ vs. NLJ: Benefits of IO Aware

• In BNLJ, by loading larger chunks of R, we minimize the number of full disk reads of S–We only read all of S from disk for every (B-1)-page segment of R!– Still the full cross-product, but more done only in memory

P 𝑅 +b cVd:

𝑃(𝑆) +OUTP(R)+T(R)*P(S)+OUTNLJ BNLJ

BNLJisfasterbyroughly(Vd:)e(c)b(c)

CSC261,Fall2017,UR

Page 76: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

BNLJ vs. NLJ: Benefits of IO Aware

• Example:– R: 500 pages– S: 1000 pages– 100 tuples / page– We have 12 pages of memory (B = 11)

• NLJ: Cost = 500 + 50,000*1000 = 50 Million IOs ~= 140 hours

• BNLJ: Cost = 500 + fgg∗:ggg:g

= 50 Thousand IOs ~= 0.14 hours

Averyrealdifferencefromasmallchangeinthealgorithm!

IgnoringOUThere…

CSC261,Fall2017,UR

Page 77: CSC 261/461 –Database Systems Lecture 19 › courses › 261 › fall2017 › lectures › l19.pdf · MongoDB Spark Term Paper Poster Session. Topics for Today •Query Processing

Acknowledgement

• Some of the slides in this presentation are taken from the slides provided by the authors.

• Many of these slides are taken from cs145 course offered byStanford University.

CSC261,Fall2017,UR