DFS Lecture 12
-
Upload
radheshyam-gawari -
Category
Documents
-
view
220 -
download
0
Transcript of DFS Lecture 12
-
8/10/2019 DFS Lecture 12
1/19
March 23 & 28, 2000 1
Csci 2111: Data and File
StructuresWeek 10, Lectures 1 & 2
Hashing
-
8/10/2019 DFS Lecture 12
2/19
March 23 & 28, 2000 2
Motivation
Sequential Searching can be done in O(N)access time,meaning that the number of seeks grows in proportion tothe size of the file.
B-Trees improve on this greatly, providing O(LogkN)access where k is a measure of the leaf size (i.e., thenumber of records that can be stored in a leaf).
What we would like to achieve, however, is an O(1)
access, which means that no matter how big a file grows,access to a record always takes the same small number ofseeks.
Static Hashingtechniques can achieve such performanceprovided that the file does not increase in time.
-
8/10/2019 DFS Lecture 12
3/19
March 23 & 28, 2000 3
What is Hashing?
A Hash functionis a function h(K) which transforms a
key K into an address.
Hashing is like indexing in that it involves associating
a key with a relative record address.
Hashing, however, is different from indexing in two
important ways:
With hashing, there is no obvious connectionbetween the key and the location.
With hashing two different keys may be transformed
to the same address.
-
8/10/2019 DFS Lecture 12
4/19
March 23 & 28, 2000 4
Collisions
When two different keys produce the sameaddress, there is a collision. The keys involved are
called synonyms. Coming up with a hashing function that avoids
collision is extremely difficult. It is best to simplyfind ways to deal with them.
Possible Solutions:Spread out the records
Use extra memory
Put more than one record at a single address.
-
8/10/2019 DFS Lecture 12
5/19
March 23 & 28, 2000 5
A Simple Hashing Algorithm
Step 1:Represent the key in
numerical form
Step 2:Fold and Add
Step 3:Divide by a prime number
and use the remainder as the
address.
-
8/10/2019 DFS Lecture 12
6/19
March 23 & 28, 2000 6
Hashing Functions and Record
Distributions
Records can be distributed among addresses in
different ways: there may be (a) no synonyms (uniform
distribution); (b) only synonyms (worst case); (c) a few
synonyms (happens with random distributions).
Purely uniform distributions are difficult to obtain and
may not be worth searching for.
Random distributions can be easily derived, but theyare not perfect since they may generate a fair number
of synonyms.
We want better hashing methods.
-
8/10/2019 DFS Lecture 12
7/19
March 23 & 28, 2000 7
Some Other Hashing Methods
Though there is no hash function that guaranteesbetter-than-random distributions in all cases, bytaking into considerations the keys that are beinghashed, certain improvements are possible.
Here are some methods that are potentially betterthan random:
Examine keys for a pattern
Fold parts of the keyDivide the key by a number
Square the key and take the middle
Radix transformation
-
8/10/2019 DFS Lecture 12
8/19
March 23 & 28, 2000 8
Predicting the Distribution of
Records
When using a random distribution, we can use a
number of mathematical tools to obtain
conservative estimates of how our hashingfunction is likely to behave:
Using the Poisson Function p(x)=(r/N)xe-(r/N)/x!
applied to Hashing, we can conclude that:
In general, if there are N addresses, then the
expected number of addresses with x records
assigned to them is Np(x)
-
8/10/2019 DFS Lecture 12
9/19
March 23 & 28, 2000 9
Predicting Collisions for a Full
File
Suppose you have a hashing function that you
believe will distribute records randomly and you
want to store 10,000 records in 10,000 addresses. How many addresses do you expect to have no
records assigned to them?
How many addresses should have one, two, and
three records assigned respectively?
How can we reduce the number of overflow
records?
-
8/10/2019 DFS Lecture 12
10/19
March 23 & 28, 2000 10
Increasing Memory Space I
Reducing collisions can be done by choosing a good
hashing function or using extra memory.
The question asked here is how much extra memory
should be use to obtain a given rate of collision reduction?
Definition: Packing densityrefers to the ratio of the
number of records to be stored (r) to the number ofavailable spaces (N).
The packing density gives a measure of the amount of
space in a file that is used.
-
8/10/2019 DFS Lecture 12
11/19
March 23 & 28, 2000 11
Increasing Memory Space II
The Poisson Distr ibutionallows us to predict the numberof collisions that are likely to occur given a certain packingdensity. We use the Poisson Distribution to answer the
following questions: How many addresses should have no records assigned tothem?
How many addresses should have exactly one recordassigned (no synonym)?
How many addresses should have one record plus one ormore synonyms?
Assuming that only one record can be assigned to each homeaddress, how many overflow records can be expected?
What percentage of records should be overflow records?
-
8/10/2019 DFS Lecture 12
12/19
March 23 & 28, 2000 12
Collision Resolution by
Progressive Overflow
How do we deal with records that cannot fit into their home
address? A simple approach: Progressive Overf lowor
L inear Probing.
If a key, k1, hashes into the same address, a1, as another
key, k2, then look for the first available address, a2,
following a1 and place k1 in a2. If the end of the address
space is reached, then wrap around it. When searching for a key that is not in, if the address space
is not full, then an empty address will be reached or the
search will come back to where it began.
-
8/10/2019 DFS Lecture 12
13/19
March 23 & 28, 2000 13
Search Length when using
Progressive Overflow
Progressive Overflow causes extra searches andthus extra disk accesses.
If there are many collisions, then many records
will be far from home. Definitions: Search lengthrefers to the number of
accesses required to retrieve a record fromsecondary memory. The average search lengthis
the average number of times you can expect tohave to access the disk to retrieve a record.
Average search length = (Total searchlength)/(Total number of records)
-
8/10/2019 DFS Lecture 12
14/19
March 23 & 28, 2000 14
Storing More than One Record
per Address: Buckets
Definition:A bucketdescribes a block of records
sharing the same address that is retrieved in one
disk access. When a record is to be stored or retrieved, its
home bucket address is determined by hashing.
When a bucket is filled, we still have to worry
about the record overflow problem, but this occurs
much less often than when each address can hold
only one record.
-
8/10/2019 DFS Lecture 12
15/19
March 23 & 28, 2000 15
Effect of Buckets on Performance
To compute how densely packed a file is, we need
to consider 1) the number of addresses, N,
(buckets) 2) the number of records we can put ateach address, b, (bucket size) and 3) the number
of records, r. Then, Packing Density = r /bN.
Though the packing density does not change when
halving the number of addresses and doubling the
size of the buckets, the expected number of
overflows decreases dramatically.
-
8/10/2019 DFS Lecture 12
16/19
March 23 & 28, 2000 16
Making Deletions
Deleting a record from a hashed file is morecomplicated than adding a record for two reasons:
The slot freed by the deletion must not be allowed to
hinder later searchesIt should be possible to reuse the freed slot for later
additions.
In order to deal with deletions we use tombstones, i.e.,
a marker indicating that a record once lived there butno longer does. Tombstones solve both the problemscaused by deletion.
Insertion of records is slightly different when usingtombstones.
-
8/10/2019 DFS Lecture 12
17/19
March 23 & 28, 2000 17
Effects of Deletions and
Additions on Performance
After a large number of deletions and additionshave taken places, one can expect to find many
tombstones occupying places that could beoccupied by records whose home address precedesthem but that are stored after them.
This deteriorates average search lengths.
There are 3 types of solutions for dealing with thisproblem: a) local reorganization during deletions;b) global reorganization when the average searchlength is too large; c) use of a different collisionresolution algorithm.
-
8/10/2019 DFS Lecture 12
18/19
March 23 & 28, 2000 18
Other Collision Resolution
Techniques
There are a few variations on random hashing thatmay improve performance:Double Hashing: When an overflow occurs, use a
second hashing function to map the record to itsoverflow location.
Chained Progressive Overf low: Like Progressiveoverflow except that synonyms are linked together with
pointers.
Chaining with a Separate Overf low Area: Likechained progressive overflow except that overflowaddresses do not occupy home addresses.
Scatter Tables: The Hash file contains no records, butonly pointers to records. I.e., it is an index.
-
8/10/2019 DFS Lecture 12
19/19
March 23 & 28, 2000 19
Pattern of Record Access
If we have some information about what records
get accessed most often, we can optimize their
location so that these records will have shortsearch lengths.
By doing this, we try to decrease the effective
average search length even if the nominal average
search length remains the same.
This principle is related to the one used in
Huffman encoding.