1 Indexing Structures for Files. 2 Basic Concepts Indexing mechanisms used to speed up access to...

1

Indexing Structures for Files

2

Basic Concepts

Indexing mechanisms used to speed up access to desired data without having to scan entire table based on a search key

Search Key an attribute used to look up records in a file.

3

Index Structure

An index file consists of records (called index entries) of the form

Index entries Search key value and a pointer to a row having that

value The values in the index are ordered.

Index files are typically much smaller than the original file

When a file is modified, every index on the file must be updated Updating indices imposes overhead on database

modification.

search-key pointer

4

Index Evaluation Metrics

Indexing techniques evaluated on basis of: Access types (queries) supported efficiently.

records with a specified value in the attributeor records with an attribute value falling in a specified

range of values. Access/search time Insertion time Deletion time Space overhead

5

Index Classification

primary index: is specified on the ordering key field of an ordered file,

where every record has a unique value for that field. The index has the same ordering as the one of the file.

clustering index: is specified on the ordering field of an ordered file. The index has the same ordering as the one of the file. An ordered file can have at most one primary index or one

clustering index, but not both.

secondary index: is specified on any nonordering field of the file. The index has different ordering than the one of the file. A file can have several secondary indices in addition to its

primary/clustering index.

6

Primary Indices

Primary index is specified on the ordering key field of an ordered file.

There is one index entry (or index record) in the index file for each block in the data file. Each index entry has the value of the primary

key field for the first record in a block.

The total number of entries in the index file is the same as the number of disk blocks in the data file.

The index file for a primary index needs fewer blocks than does the data file.

7

Primary Indices

8

Primary Indices

Finding a record is efficient – do a binary search

Records insertion and deletion is a major problem. We can avoid the problem by: Using an unordered overflow file, or Using a linked list of overflow records.

9

Index (sequential)

continuous

free space

102030

405060

708090

39313536

323834

33

overflow area(not sequential)

Primary Indices

10

Sparse Vs. Dense Indices

dense index has index entry for every record in the file.

sparse (nondense) index has index entries for only some of the search-

key values.

A primary index is sparse (nondense) index.

11


Sparse primaryindex sorted

on Id

Dense secondaryindex sorted

on Name

Ordered file sorted on Id

Id Name Dept

12


Ashby, 25, 3000

Smith, 44, 3000

Ashby

Cass

Smith

22

25

30

40

44

44

50Sparse primaryindex on Name

Ordered file on NameDense secondary

index on Age

33

Bristow, 30, 2007

Basu, 33, 4003

Cass, 50, 5004

Tracy, 44, 5004

Daniels, 22, 6003

Jones, 40, 6003

13

Dense Indices

Pro: Very efficient in locating a record given a key,

if fits in the memory Can tell if any record exists without accessing

file

Con: if too big and doesn’t fit into the memory, will

be expense when used to find a record given its key

14

Sparse Indices

Sparse index contains index records for only some search-key values. Some keys in the data file will not have an entry

in the index file Applicable when records are sequentially ordered

on search-key (ordered files) Normally keeps only one key per data block

To locate a record with search-key value K we: Find index record with largest search-key value

K Search file sequentially starting at the record to

which the index record points

15

Ordered File

2010

4030

6050

8070

10090

Sparse/Primary Index

10305070

90110130150

170190210230

Sparse Indices

16

Sparse Indices

Less space (can keep more of index in memory) Support multi-level indexing structure

Less maintenance overhead for insertions and deletions.

17

Index Update: Deletion

If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Single-level index deletion: Dense indices

deletion of search-key is similar to file record deletion. Sparse indices

If an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order).

If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

18

Dense Index: Deletion

2010

4030

6050

8070

10203040

50607080

19

Dense Index: Deletion

2010

4030

6050

8070

10203040

50607080

delete record 30

4040

20

Sparse Index: Deletion

2010

4030

6050

8070

10305070

90110130150

21


2010

4030

6050

8070

10305070

90110130150

delete record 40

22


2010

4030

6050

8070

10305070

90110130150

delete record 30

4040

23


2010

4030

6050

8070

10305070

90110130150

delete records 30 & 40

5070

24

Index Update: Insertion

Single-level index insertion: Perform a lookup using the search-key value

appearing in the record to be inserted. Dense indices

if the search-key value does not appear in the index, insert it.

Sparse indicesif index stores an entry for each block of the file, no

change needs to be made to the index unless a new block is created.

In this case, the first search-key value appearing in the new block is inserted into the index.

25

2010

30

5040

60

10304060

Sparse Index: Insertion

26

2010

30

5040

60

10304060

insert record 34

34


27

2010

30

5040

60

10304060

insert record 15

15

2030

20

• Illustrated: Immediate reorganization• Variation:

– insert new block (chained file)– update index


28

2010

30

5040

60

10304060

insert record 25

25

overflow blocks(reorganize later...)


29

Dense Index: Insertion

Similar Often more expensive . . .

30

Duplicate keys

1010

2010

3020

3030

4540

31

1010

2010

3020

3030

4540

10101020

20303030

1010

2010

3020

3030

4540

10101020

20303030

Duplicate keys

Dense index

32

1010

2010

3020

3030

4540

10102030

care

ful if looki

ng

for

20 o

r 30

!

Duplicate keys

Sparse index, one way?

40

33

1010

2010

3020

3030

4540

10203030

– place first new key from block

shouldthis be40?

Duplicate keys

Sparse index, another way? (clustering index)

34

Clustering Indices

A clustering index can be used when the field (the clustering field) is non-key, and the data file is sorted by the clustering field.

A file can have at most one primary index or one clustering index, but not both.

A clustering file is also an ordered file with two fields: Clustering field pointer to the first block that has a record with that

value for its clustering field.

There is one entry in the clustering index for each distinct value of the clustering field (rather than for every record). Sparse index (nondense)

35

Clustering Indices

A clustering index on the DEPNo ordering nonkey field of an EMPLOYEE file.

36

Clustering Indices

Record insertion and deletion still cause problems a solution; cluster of contiguous blocks

Good for range searches Use location mechanism to locate index entry at

start of range This locates first data record.

Subsequent data records are contiguous if index is clustered (not so if unclustered)

37

Clustering Indices

Clustering index with a separate block cluster for each group of records that share the same value for the clustering field.

38

Secondary Indices

Secondary index: is specified on any nonordering field of the file.

Non-ordering field can be a key (unique) or a non-key (duplicates)

The index has different ordering than the one of the file.

A file can have several secondary indices in addition to its primary index.

there is one index entry for each data record

index record points either to the block in which the record is stored, or to the record itself

Hence, such an index is dense

39

Secondary Indices

A secondary index usually needs more storage space and longer search time than does a primary index. It has larger number of entries.

Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive each record access may fetch a new block from

disk

40

Secondary Indices

A dense secondary index (with block pointers) on a nonordering KEY field.

41

Secondary Indices

A dense secondary index (with record pointers) on a non-ordering non-key field.

42

Index Types and Indexing Fields

Also, review Table 14.2.

Data file ordered by indexing field

Data file not ordered by

indexing field

Indexing field is key Primary Index Secondary index(Key)

Indexing field is nonkey Clustering Index Secondary index(NonKey)

43

Multilevel Indices

To search the index faster we can create an index for the index.

A multilevel index considers the index file as an ordered file and creates a primary index for the first level outer index – a sparse index of primary index inner index – the primary index file

The above process can be repeated for a higher level if the previous level needs more than one block of disk storage.

Read EXAMPLE 3

44

Multilevel Indices

45

B+-Tree Index

A B+-tree, of order f (fan-out --- maximum node capacity), is a rooted tree satisfying the following: All paths from root to leaf are of the same length

(balanced tree)

Each non-leaf node (except the root) has between f/2 and up to f tree pointers (f-1 key values).

A leaf node has between f/2 and f-1 data pointers (plus a pointer for sibling node).

If the root is not a leaf, it has at least 2 children.

If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and f-1 values.

46

B+-Tree Non-leaf Node Structure

Ki are the search-key values, K1 K2 K3 … Kf-1

all keys in the subtree to which P1 points are K1. all keys in the subtree to which Pf points are Kf-1. for 2 i f-1, all keys in the subtree to which Pi points

have values Ki-1 and Ki.

Pi are pointers to children nodes (tree nodes).

47

B+-Tree Leaf Node Structure

for i = 1, 2, …, f-1, pointer Pri is a data pointer, that either points to a file record with search-key value Ki, or block of record pointers that point to records having

search-key value Ki. (if search-key is not a key) Pnext points to next leaf node in search-key order.

Within each leaf node, K1 K2 K3 … Kf-1

If Li, Lj are leaf nodes and i j, then Li’s search-key values Lj’s search-key values

48

Sample Leaf Node

From non-leaf node

to next leafin sequence

57

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 9

5

49

Sample Non-Leaf Node

to keys to keys to keys to keys 57 57 k 81 81 k 95 95

57

81

95

50

Example of a B+-Tree

Root f=4

35

110

130

179

11

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

51

Number of pointers/keys for B+-Tree

Full node min. node

Non-leaf

Leaf

f=4

12

01

50

18

0

30

3 5 11

30

35

52

Observations about B+-Trees

In a B+-tree, data pointers are stored only at the leaf nodes of the tree hence, the structure of leaf nodes differs from

the structure of internal nodes.

The leaf nodes have an entry for every value of the search field, along with a data pointer to the record.

Some search field values from the leaf nodes are repeated in the internal nodes.

53

B+-Trees: Search

Let a be a search key value and T the pointer to the root of the tree that has f pointer.

Search(a, T) If T is non-leaf node:

for the first i that satisfy a Ki, 1 i f-1 call Search(a, Pi), else call Search(a, Pf).

Else //T is a leaf node if no value in T equals a, report not found. else if Ki in T equals a, follow pointer Pri to

read the record/block.

54

B+-Trees: Search

In processing a query, a path is traversed in the tree from the root to some leaf node.

If there are n search-key values in the file, the path is no longer than log f/2(n) (worst

case).

With 1 million search key values and f = 100, at most log50(1000000) = 4 nodes are accessed in a lookup.

Contrast this with a balanced binary tree with 1 million search key values -- around 20 nodes are accessed in a lookup.

55

B+-Trees: Insertion

Find the leaf node in which the search-key value would appear

If the search-key value is found in the leaf node, add the record to main file and if necessary add to the block a pointer to the record

If the search-key value is not there, add the record to the main file and then: If there is room in the leaf node, insert (key-

value, pointer) pair in the leaf node Otherwise, split the node along with the new

(key-value, pointer) entry

56

B+-Trees: Insertion

Splitting a node: take the f (search-key value, pointer) pairs

(including the one being inserted) in sorted order. place the first (f+1)/2 in the original node x, and

the rest in a new node y. let k be the largest key value in x. insert (k, y) in the parent node in their correct

sequence. If the parent is full

the entries in the parent node up to Pj, where j = (f+1)/2 are kept, while the jth search value is moved to the parent, no replicated.

A new internal node will hold the entries from Pj+1 to the end of the entries in the node.

57

B+-Trees: Insertion

The splitting of nodes proceeds upwards till a node that is not full is found.

In the worst case the root node may be split increasing the height of the tree by 1.

58

Insertion – Example 3

Insert key = 31

f=4

3 5 11

30

32

11

32

31

59


Insert key = 7f=4

3 5 11

30

31

11

31

3 5

7

5

60


Insert key = 160f=4

10

0

120

140

179

150

156

179

180

200

156

17

9

160

179

61


New root, insert 45f=4

3 12

25

1 2 3 10

12

20

25

30

32

40

40

45

32

25new root

63

B+-Trees: Deletion

Find the record to be deleted, and remove it from the main file and from the bucket (if present).

Remove (search-key value, pointer) from the leaf node.

If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then Insert all the search-key values in the two nodes

into a single node (the one on the left), and delete the other node.

64

B+-Trees: Deletion

Delete the pair (Ki-1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure.

Otherwise, if the node has too few entries due to the removal, and the entries in the node and a sibling DO NOT fit into a single node, then Redistribute the pointers between the node and a

sibling such that both have more than the minimum number of entries.

Update the corresponding search-key value in the parent of the node.

65

B+-Trees: Deletion

The node deletions may cascade upwards till a node which has f/2 or more pointers is found.

If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.

66

Merge with Sibling

Delete 45

10

40

50

20

40

45

50

f=4

50

67

10

35

50

15

30

35

40

5035

30

f=4

Redistribute Keys

Delete 40

68

40

45

30

37

25

26

20

22

10

141 3

3 14

26

3730

22

22

new root

f=4

Non-leaf Merging

Delete 37

30

30

70

Extra Reading

Read Examples 1 to 7.

1 Indexing Structures for Files. 2 Basic Concepts Indexing mechanisms used to speed up access to...

Documents

Transcript of 1 Indexing Structures for Files. 2 Basic Concepts Indexing mechanisms used to speed up access to...