CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

38
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Page 1: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

CSE 326: Data StructuresLecture #11

B-Trees

Alon Halevy

Spring Quarter 2001

Page 2: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

B-Tree Properties• Properties

– maximum branching factor of M– the root has between 2 and M children– other internal nodes have between M/2 and M children– internal nodes contain only search keys (no data)– smallest datum between search keys x and y equals x– each (non-root) leaf contains between L/2 and L keys– all leaves are at the same depth

• Result– tree is (logM/2 n/(L/2)) +/- 1 deep (log n)

– all operations run in time proportional to depth– operations pull in at least M/2 or L/2 items at a time

Page 3: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

When Big-O is Not Enough

B-Tree is about logM/2 n/(L/2) deep

= logM/2 n - logM/2 L/2

= O(logM/2 n)

= O(log n) steps per operation (same as BST!)

Where’s the beef?!

log2( 10,000,000 ) = 24 disk accesses

log200/2( 10,000,000 ) < 4 disk accesses

Page 4: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

…__ __k1 k2 … ki

B-Tree Nodes

• Internal node– i search keys; i+1 subtrees; M - i - 1 inactive entries

• Leaf– j data keys; L - j inactive entries

k1 k2 … kj …__ __

1 2 M - 1

1 2 L

i

j

Page 5: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Example

B-Tree with M = 4

and L = 4

1 2

3 5 6 9

10 11 12

15 17

20 25 26

30 32 33 36

40 42

50 60 70

10 40

3 15 20 30 50

Page 6: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Making a B-Tree

The empty B-Tree

M = 3 L = 2

3 Insert(3)

3 14Insert(14)

Now, Insert(1)?

Page 7: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Splitting the Root

And createa new root

1 3 14

1 3 14

14

1 3 14 3 14

Insert(1)

Too many keys in a leaf!

So, split the leaf.

Page 8: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Insertions and Split Ends

Insert(59)

14

1 3 14 59

14

1 3 14

Insert(26)

14

1 3 14 26 59

14 26 59

14 59

1 3 14 26 59

And add a new child

Too many keys in a leaf!

So, split the leaf.

Page 9: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Propagating Splits

14 59

1 3 14 26 59

14 59

1 3 14 26 59 5

1 3 5

Insert(5)

5 14

14 26 59 1 3 5

59

5 59 5

1 3 5 14 26 59

59

14

Add newchild

Create anew root

Too many keys in an internal node!

So, split the node.

Page 10: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Insertion in Boring Text

• Insert the key in its leaf

• If the leaf ends up with L+1 items, overflow!– Split the leaf into two nodes:

• original with (L+1)/2 items

• new one with (L+1)/2 items

– Add the new child to the parent

– If the parent ends up with M+1 items, overflow!

• If an internal node ends up with M+1 items, overflow!– Split the node into two nodes:

• original with (M+1)/2 items• new one with (M+1)/2 items

– Add the new child to the parent– If the parent ends up with M+1

items, overflow!

• Split an overflowed root in two and hang the new nodes under a new root

This makes the tree deeper!

Page 11: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion in B-trees

• Come to section tomorrow. Slides follow.

Page 12: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

After More Routine Inserts

5

1 3 5 14 26 59

59

14

5

1 3 5 14 26 59 79

59 89

14

89

Insert(89)Insert(79)

Page 13: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion

5

1 3 5 14 26 59 79

59 89

14

89

5

1 3 5 14 26 79

79 89

14

89

Delete(59)

Page 14: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion and Adoption

5

1 3 5 14 26 79

79 89

14

89

Delete(5)?

1 3 14 26 79

79 89

14

89

3

1 3 3 14 26 79

79 89

14

89

A leaf has too few keys!

So, borrow from a neighbor

Page 15: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion with Propagation

3

1 3 14 26 79

79 89

14

89

Delete(3)?

1 14 26 79

79 89

14

89

1 14 26 79

79 89

14

89

A leaf has too few keys!

And no neighbor with surplus!

So, deletethe leaf

But now a node has too few subtrees!

Page 16: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Adopt aneighbor

1 14 26 79

79 89

14

89

14

1 14 26 79

89

79

89

Finishing the Propagation (More Adoption)

Page 17: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Delete(1)(adopt a

neighbor)

14

1 14 26 79

89

79

89

A Bit More Adoption

26

14 26 79

89

79

89

Page 18: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Delete(26)26

14 26 79

89

79

89

Pulling out the Root

14 79

89

79

89

A leaf has too few keys!And no neighbor with surplus!

14 79

89

79

89

So, delete the leaf

A node has too few subtrees and no neighbor with surplus!

14 79

79 89

89

Delete the leaf

But now the root has just one subtree!

Page 19: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Pulling out the Root (continued)

14 79

79 89

89

The root has just one subtree!

But that’s silly!

14 79

79 89

89

Just makethe one childthe new root!

Page 20: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion in Two Boring Slides of Text

• Remove the key from its leaf

• If the leaf ends up with fewer than L/2 items, underflow!– Adopt data from a neighbor;

update the parent

– If borrowing won’t work, delete node and divide keys between neighbors

– If the parent ends up with fewer than M/2 items, underflow!

Why will dumping keys always work if borrowing doesn’t?

Page 21: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Deletion Slide Two

• If a node ends up with fewer than M/2 items, underflow!– Adopt subtrees from a neighbor;

update the parent– If borrowing won’t work, delete

node and divide subtrees between neighbors

– If the parent ends up with fewer than M/2 items, underflow!

• If the root ends up with only one child, make the child the new root of the tree

This reduces the height of the tree!

Page 22: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Thinking about B-Trees

• B-Tree insertion can cause (expensive) splitting and propagation

• B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation

• Propagation is rare if M and L are large (Why?)

• Repeated insertions and deletion can cause thrashing

• If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items

– height 5: 2,000,000,000!

Page 23: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Tree Summary

• BST: fast finds, inserts, and deletes O(log n) on average (if data is random!)

• AVL trees: guaranteed O(log n) operations• B-Trees: also guaranteed O(log n), but shallower

depth makes them better for disk-based databases

• What would be even better?– How about: O(1) finds and inserts?

Page 24: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Hash Table Approach

But… is there a problem in this pipe-dream?

f(x)

Zasha

Steve

Nic

Brad

Ed

Page 25: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Hash Table Dictionary Data Structure

• Hash function: maps keys to integers– result: can quickly find the

right spot for a given entry

• Unordered and sparse table– result: cannot efficiently list

all entries, – Cannot find min and max

efficiently,– Cannot find all items within

a specified range efficiently.

f(x)ZashaSteve

NicBrad

Ed

Page 26: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Hash Table Terminology

f(x)

Zasha

Steve

Nic

Brad

Ed

hash function

collision

keys

load factor = # of entries in table

tableSize

Page 27: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Hash Table CodeFirst Pass

Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index];}

What should the hash function be?

What should the table size be?

How should we resolve collisions?

Page 28: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

A Good Hash Function…

…is easy (fast) to compute (O(1) and practically fast).…distributes the data evenly (hash(a) hash(b) ).…uses the whole hash table (for all 0 k < size, there’s an i

such that hash(i) % size = k).

Page 29: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Good Hash Function for Integers

• Choose – tableSize is prime– hash(n) = n % tableSize

• Example:– tableSize = 7

insert(4)insert(17)find(12)insert(9)delete(17)

3

2

1

0

6

5

4

Page 30: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Good Hash Function for Strings?

• Ideas?

Page 31: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Good Hash Function for Strings?

• Sum the ASCII values of the characters.• Consider only the first 3 characters.

– Uses only 2871 out of 17,576 entries in the table on English words.

• Let s = s1s2s3s4…s5: choose – hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n

• Problems:– hash(“really, really big”) = well… something really, really big

– hash(“one thing”) % 128 = hash(“other thing”) % 128

Think of the string as a base 128 number.

Page 32: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Making the String HashEasy to Compute

• Use Horner’s Rule

int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h; }

Page 33: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Universal Hashing• For any fixed hash function, there will be some

pathological sets of inputs– everything hashes to the same cell!

• Solution: Universal Hashing– Start with a large (parameterized) class of hash

functions• No sequence of inputs is bad for all of them!

– When your program starts up, pick one of the hash functions to use at random (for the entire time)

– Now: no bad inputs, only unlucky choices!• If universal class large, odds of making a bad choice very low• If you do find you are in trouble, just pick a different hash

function and re-hash the previous inputs

Page 34: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Universal Hash Function: “Random” Vector Approach

• Parameterized by prime size and vector:a = <a0 a1 … ar> where 0 <= ai < size

• Represent each key as r + 1 integers where ki < size– size = 11, key = 39752 ==> <3,9,7,5,2>

– size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4>

ha(k) = sizekar

iii mod

0

dot product with a “random” vector!

Page 35: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Universal Hash Function

• Strengths:– works on any type as long as you can form ki’s

– if we’re building a static table, we can try many a’s

– a random a has guaranteed good properties no matter what we’re hashing

• Weaknesses– must choose prime table size larger than any ki

Page 36: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Hash Function Summary

• Goals of a hash function– reproducible mapping from key to table entry

– evenly distribute keys across the table

– separate commonly occurring keys (neighboring keys?)

– complete quickly

• Hash functions– h(n) = n % size

– h(n) = string as base 128 number % size

– Universal hash function #1: dot product with random vector

Page 37: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

How to Design a Hash Function

• Know what your keys are• Study how your keys are distributed• Try to include all important information in a key

in the construction of its hash• Try to make “neighboring” keys hash to very

different places• Prune the features used to create the hash until it

runs “fast enough” (very application dependent)

Page 38: CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Collisions

• Pigeonhole principle says we can’t avoid all collisions– try to hash without collision m keys into n slots with m > n

– try to put 6 pigeons into 5 holes

• What do we do when two keys hash to the same entry?– open hashing: put little dictionaries in each entry

– closed hashing: pick a next entry to try