Finishing Up from Tuesday - Harvard...

Post on 09-Mar-2018

216 views 2 download

Transcript of Finishing Up from Tuesday - Harvard...

2/4/2010 11

Finishing Up from Tuesday• Topics

• Cursors• Hashing

2/4/2010 2

Cursors

cat dog elephant mouse

cursor

c_get elephantdelete currentwhere is cursor?insert kangarooinsert eagle

• They mark a position in the tree (used in iterating over a file).• Cannot lose the position in the face of a delete.

• Subsequent inserts must happen in the right spot.• Requires retaining the key value.• With multiple deletes and multiple cursors, you have to maintain positioning

between cursors

2/2/10 2

2/4/2010 3

Hashing• Your index is a collection of buckets (bucket = page)• Define a hash function, h, that maps a key to a bucket.• Store the corresponding data in that bucket.• Collisions

• Multiple keys hash to the same bucket.• Store multiple keys in the same bucket.

• What do you do when buckets fill?• Chaining: link new pages(overflow pages) off the bucket.• Open-hashing: look in the next bucket.

• Chaining versus open-hashing• Open-hashing does not support deletion well.

2/2/10

3

2/4/2010 4

Hash Example

• Assume:• H(cat) = 0• H(dog) = 1• H(mouse) = 0

• Operations1. Insert cat2. Insert dog3. Insert mouse4. Delete dog5. Lookup mouse

2/2/10 4

cat

dog

mouse

mouse

2/4/2010 5

Static vs Dynamic Hashing

• Static: number of buckets predefined; never changes.• Either, overflow chains grow very long, OR• A lot of wasted space in unused buckets.

• Dynamic: number of buckets changes over time.• Hash function must adapt.• Usually, start revealing more bits of the hash value as the

table grows.

2/2/10 5

2/4/2010 6

Practical Hashing (1)

• Buckets map to pages.• Must be able to directly translate from a bucket

number to a page number.• Where do you store overflow pages?• If number of buckets is fixed (static hashing), store overflow

buckets after regular buckets.• Use free list to manage overflow buckets.

• Static hashing isn’t very practical for databases.• Databases change in size fairly substantially.• If you have to preallocate, often waste space.

2/2/10 6

2/4/2010 7

Practical Hashing (2)

• Dynamic hash implementation.• Periodically double the size of the database.

• Rehash every key into new table.

• Dynamic Linear Hashing (Litwin)• Grow table one bucket at a time.• Split buckets sequentially; rehash just the splitting bucket.• Maintain overflow buckets as necessary.• Keep track of max bucket to identify the correct number of

bits to consider in the hash value.

2/2/10 7

2/4/2010 8

Using BDB from Tcl

• Topics• An Introduction to Tcl• The Berkeley DB Tcl API• Tools for performance tuning and analysis

• Learning Objectives• Write simple programs in Tcl• Create environments and databases in Tcl• Perform get, put, cursor, del operations• Use timing and statistics to analyze the behavior of Berkeley

DB databases.

2/4/2010 9

What is Tcl?• Toolkit command language -- a scripting language• Designed to be embedded easily into other systems.• Berkeley DB provides Tcl extensions (new

commands) that let you access BDB functionalityfrom a Tcl-based shell.

• Logistics:• I will use the Tcl installation on FAS/NICE.• You can do assignment 1 on nice or on your own machine• NOTE: if you intend to use your own machine, install Tcl and

BDB; do NOT wait until the night before assignment 1 isdue. We will not answer build/install questions the 24 hoursbefore the assignment is due.

2/4/2010 10

Getting Started (on FAS)

• You need to know where to find the appropriateexecutables and where to find the appropriate sharedlibraries.

• Edit your .cshrc file and add the following two lines:setenv PREPATH /nfs/home/c/s/cs165/binsetenv LD_LIBRARY_PATH /nfs/home/c/s/cs165/lib

• Log out• Log back in

2/4/2010 11

Getting Started (with Tcl)• Start up Tcl interpreter:

ice% tclsh• Variables:

• Untyped• Variables need not be declared; created as you need them• Variable names are alphanumeric strings that begin with a

character:foo, a, dog, b4

• You assign values to variables using set, e.g.,% set foo 4% set bar “cat”

• You access the value of a variable using the $ symbol:% puts $foo% puts $bar

2/4/2010 12

Calculating

• Numerical evaluation is accomplished via the exprcommand:% expr 1 + 3% set foo 4% set bar 5% expr $foo + $bar% puts [expr 3 + 4]

2/4/2010 13

Control Flow• All you should need are if statements and for loops.• You need two additional pieces of syntax:

• Tcl uses {} for grouping.• Tcl uses [] for evaluation

• By evaluation, we mean how you tell Tcl toevaluate anexpression so that you can assign it to a variable.

• For example:set i [expr $foo + $bar]

• Sets i to the result of evaluating [expr $foo + $bar]• With all this in hand, if statements should look pretty natural:

if { boolean expression } {do stuff here

} else {do other stuff here

}

2/4/2010 14

Boolean Statementsif { $foo == 4 } {

# This is a comment character# You can now do stuff conditionally

}

if { $foo < 10 } {# Do something

} else {# Do something else

}# Tcl is whitespace sensitive, so positioning# your {} actually matters!

2/4/2010 15

FOR Loopsfor { init } {condition } { loop increment } {

# do stuff}

• So, to loop from 0-9:for { set i 0 } { $i < 10 } { incr i } {

# do stuff

}

• Note: incr i is shorthand for incr i 1 which isshorthand for set i [expr $i + 1]

2/4/2010 16

BDB + Tcl• The Berkeley DB library and its interface to the Tcl language is

dynamically loaded using a series of commands that can befound in ~cs165/tools/loadme.tcl.

• You type:ice% tclsh% source ~cs165/tools/loadme.tcl

• That file contains:lappend auto_path /nfs/home/c/s/cs165/libpkg_mkIndex /nfs/home/c/s/cs165/liblibdb_tcl-4.8.soload /nfs/home/c/s/cs165/lib/libdb_tcl-4.8.so

• Now you can access Berkeley DB commands.• In general, you’ll use the berkdb command to create handles

and then you’ll use those handles to execute methods.

2/4/2010 17

The berkdb Command

• Used to create/open environments and databases.• Environments: make sure the directory exists.• Let’s call our home directory work.

% set e [berkdb env -create -home work]

• Now, examine e:% puts $e

• The variable e contains a new command thatrepresents the environment.

• That command implements other commands that arethe methods off of the environment.

2/4/2010 18

• If you do not specify an environment, then you can open adatabase, but you have opened it OUTSIDE the environment.

• Note: if you want to examine things like the memory poolstatistics, you need an environment -- more on this later).

• Compare the following two commands:% set dba [berkdb open -create -btree mybtree1.db]% set dbb [berkdb open -create -env $e -btree mybtree2.db]

• How are they different?

• Guess how to create a hash table?

Creating/Opening Databases

• dba is NOT in an environment; db created in current directory)• dbb IS in an environment and will be created in work

% set db [berkdb open -create -env $e -hash myhash.db]

2/4/2010 19

• Both environment and database handles/commands take“methods” to perform operations.

• The put method adds data to a database.% $db put dog fido% $db put cat fluffy

• How do you suppose you get data out of the database?

• Just like with other Tcl commands, you can assign the results ofthese calls:% set dogval [$db get dog]

Adding Data to a Database

% $db get dog% $db get cat% $db get elephant

2/4/2010 20

Putting it all Together

• Let’s put this all together and write a loop that adds10 items to a database:% for {set i 0} {$i < 10} {incr i} {$db put key$i data$i

}

• This inserts 10 key/data pairs that look like:{key0 data0} {key1 data1} ... {key9 data9}

• We can retrieve those values:% $db get key3% $db get key8

2/4/2010 21

Cursors• The last handle/command you’ll need is a cursor, which is used to iterate over a

collection of data.• Cursors are associated with databases, so we create a cursor using a database

method:% set c [$db cursor]

• You can perform the same operations with cursors that you do with databases,plus you can use a cursor for iteration.% $db put dog fido

• is the same as:% $c put -keyfirst dog fido

• and% $db get dog

• is the same as% $c get -set dog

• except that the cursor version leaves the cursor referencing the item, so you canalso issue get methods relative to that position (e.g.,current, next, prev)

2/4/2010 22

More Cursors• Cursor get takes an option, -set, before specifying the key,

because the cursor get method supports more operations thanthe database get method. In particular, it supports options like:• -first• -next• -last• -prev

• What do you suppose the following does?% for { set pair [$c get -first] } \ { $pair != ““ } \ { set pair [$c get -next] } {

puts $pair}

2/4/2010 23

Performance Analysis• Why?

• Nearly every hard database problem boils down to performance.• It is useful to learn what performance analysis tools are available

for any given data management technology.• Our toolkit:

• In Tcl:• time command: measures time to execute a Tcl command (in microseconds).% time { command_goes_here }• Caveats: includes Tcl parsing time and such (sometimes significant)

• In Berkeley DB:• db_stat: produces statistics about individual databases as well as about

Berkeley DB’s own subsystems.

2/4/2010 24

db_stat

• For now, we’ll focus on only two uses of db_stat.• Individual database statistics• Memory Pool Statistics

• db_stat for databases• Usage:

ice% db_stat -d database

• ORice% db_stat -h HOME -d database

2/4/2010 25

• What is a memory pool?• Recall that memory is fast and disk is slow.• Goal: grab data from memory whenever possible.

• How?

• Berkeley DB is not the only one who maintains a memory pool;the operating system does as well (frequently called the buffercache).

Memory Pools

Tcl w/Berkeley DB

Berkeley DB memory pool (mpool)

Operating SystemFile System Buffer Cache

to disk

• Keep recently used data in memory in the hope that you’ll use it again real soon

2/4/2010 26

Data Movement

• When you try to read a key, what really happens is:• Berkeley DB figures out on what page that key lives.• Berkeley DB looks in its mpool.• If the page is there, you get your key (quickly).• If the page isn’t there, Berkeley DB makes space in its

mpool and then requests the page from the file system.• If the page is in the file system buffer cache, it is given to

Berkeley DB (relatively quickly).• If the pages is not in the buffer cache, then it is requested

from disk.

2/4/2010 27

db_stat for mpool

• You must be using an environment in order toexamine the memory pool statistics.

• Summary statisticsice% db_stat -h HOME -m

• Detailed statisticsice% db_stat -h HOME -M A

• Resetting statisticsice% db_stat -Z -M A -h HOME