Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Select Operation- disk access

and Indexing

*Some info on slides from Dr. S. Son, U. Va

Disk access

• DBs traditionally stored on disk

• Cheaper to store on disk than in memory

• Costs for:– Seek time, latency, data transfer time

• Disk access is page (block) oriented

• 2 - 4 KB page size

Access time

• Access time is the time to randomly access a page

• System initially determines if page in memory buffer (page tables, etc.)

• Large disparity between disk access and memory access

Select operation using table scan

• If read the entire table for a select – table scan

• Improvements to table scan of disk:– Parallel access– Sequential prefetch

Parallel access

• Linear search - all data rows read in from disk – I/O parallelism can be used (Raid)

• multiple I/O read requests satisfied at the same time

• stripe the data across different disks

– Problems with parallelism?• must balance disk arm load to gain maximum

parallelism

• requires the same total number of random I/O's, but using devices for a shorter time

Sequential prefetch I/O

• Retrieve one disk page after another (on same track) – (32 in DB2, varies in Oracle)

• Seek time no longer a problem

• Must know in advance to read 32 successive pages

• Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Access time

• Seek time –as low as 4 ms server

• Latency time –as low as 1 ms or less

• Data transfer time – .4-2 ms

• Solid state disks up to 100,000 I/Os per sec. – still expensive

Access time for fast I/O

RIO Seq. Prefetch .004 .004 Seek - disk arm to cylinder .001 .001 Latency - platter to sector .0005 .016 Data transfer - Page .0055 .021 1 page vs. 32 pages

.176* seconds .021 seconds 32 pages for both

* .0055X32=.176 for 32 pages of RIO vs .021 for 32 pages of Seq. Prefetch

Organizing disk space

• How to store data so minimize access time if read the entire table?

Disk allocation

• Disk Resource Allocation for Databases (DBA has control)

• Goal – contiguous sectors on disk - want data as close together as possible to minimize seek time

• No standard SQL approach, but general way to deal with allocation

• Some OS allow specification of size of file and disk device

Types of Files• Heap files (unordered – sequential)• Sorted files (ordered – sort key)• Hash files (hash key, hash function)• B+-trees

• Storage Area Networks SAN – ERP (enterprise resource planning) and DW (data warehouses)– Storage devices configured as nodes in network –

can attach/detach

Tablespace

Tablespace is:• Allocation medium for tables and indexes for

ORACLE, DB2, etc.• Can put >1 table in a table space if accessed

together • Tablespace corresponds to 1 or more OS files

and can span disk devices• Usually relations cannot span disk devices

DB storage structures

DB Company DatabaseTable- tspace 1

system

space

OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx

Segments data data data data index

Extents

Tablespace

• ORACLE DB's contain several tablespaces, including one called system - data description + indexes + user-defined tables

• default tablespace given to each user • if multiple tablespaces - better control over load

balancing • can take some disk space off-line

Extent• Relation composed of 1 or more extents

• Extent - contiguous storage on disk • when data segment or index segment first

created, given an initial extent from tablespace 10KB (5 pages)

• if need more space given next contiguous extent

DB storage structures

DB Company DatabaseTable- tspace 1

system

space

OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx

Segments data data data data index

Extents

Extent

• Can increase the size by a positive % (cannot decrease) – initial n - size of initial extent – next n - size of next – max extents - maximum number of extents – min extents - number of extents initially

allocated – pct increase n - % by which next extent

grows over previous one

Oracle create tablespace

• http://www.adp-gmbh.ch/ora/sql/create_tablespace.html

http://www.adp-gmbh.ch/ora/sql/create_tablespace.html

Create table

• Create table statement - can specify tablespace, no. of extents– When initial extent full, new extent allocated – pctfree - determine how much space in a page can be

used for inserts of new rows • if pctfree =10%, inserts stop when page is 90% full

» Uses another page

– pctused – determines when new inserts start again • if fall below certain percentage of total, default pctused =

40% pctfree + pctused < 100

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_7002.htm

Rows

• Row layout on each disk page

1 2 3… N Row N Row N-1 … Row 1Header info Row directory free space data rows

• Header - • Row directory – row number and page byte offset

– Row number is row number in page – also called slot#• Page byte offset – with varchar, row size not constant

• To identify a particular row use RID (RowID) – page #, slot # [file#]

slot# is number in row directory (logical #)

Differences in DBMSs re: rows

• ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule)

• ORACLE • rows can be split between pages (row record

fragmentation) • Can have rows from multiple tables on same page,

more info

• DB2, no splitting, entire row moved to new page, need forwarding pointer

Select operation using Indexes

• Alternative to table scan

24

Why use an index?

• If use a select (or join) on the same attribute frequently

• want a way to improve performance - use indexes– For example:

Select from Employee

where ssn = 333445555

B+-tree

• Most commonly used index structure type in DBs today • Based on B-tree• Good for equality and range searches• B+ tree: dynamic, adjusts gracefully under inserts and

deletes.• Used to minimize disk I/O • available in DB2, ORACLE also has hash cluster, Ingres

has heap structure, B-tree, isam (chain together new nodes)

Structure of B+ Trees

• leaf level pointers to data (RIDs)

• the remaining are directory (index) nodes that point to other index nodes Fig.

Index Entries

Data Entries("Sequence set")

(Direct search)

http://www.cs.ua.edu/457/Figures/fig6.11.pdf

Example of B+Tree

10 20 40

1 2 3 10 12 20 35 40 42 50

Points to data

Characteristics of B+ Tree

• Order of tree (fan out) – max number of child nodes

• Minimum 50% occupancy (except for root). Each node contains d/2 <= m <= d-1 entries. – Where the parameter d is the order of the tree.

• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)

• Supports equality and range-searches efficiently

Cost of I/O for B+-tree

• One index node is one page • If tree with depth of 3, 3 I/Os to get pointer to

data• Read in index node can remain in memory

– likely since frequent access to upper -level nodes of actively used B+-trees

B+ Trees in Practice

• Typical order: between 100-200 children• Typical fill-factor: 2/3 full (66.6%)

– average fanout = 133 (if 200 children)

• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records

• Can often hold top levels in buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes

Why B+-tree

• Directory structure - retrieve range of values efficiently – search for leftmost index entry Si such that

X <= Si

• Index entries always in sequence by value - can use sequential prefetch on index

• Index entries shorter than data rows - less I/O

B+-tree

• Balancing of B+-trees - insert, delete

• Nodes usually not full

• Utilities to reorganize to lower disk I/O

• Most systems allow nodes to become depopulated- no automatic algorithm to balance

• Average node below root level 71% full in active growing B+-trees

Duplicate key values

• Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily

duplicated key entails a long search through the leaf-level of the B+-tree

• Index compression - with multiple duplicates

| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|

where PrX is count of RID values

Create Index

Options: multiple columns

tablespace storage - initial extents, etc. percent free default = 10

% of each page left unfilled (creation)

free page (1 free page for every n index pages during creation)

35

Types of indexes (textbook)

• Primary index - key field is a candidate key (must be unique) – data file ordered by key field

• Clustering index - key field is not unique, data file is ordered – all records with same values on same pages

• Secondary index - non-clustering index – data file not ordered– First record in the data page (or block) is called the

anchor record• Non-dense index - pointer in index entry points to anchor• Dense index - pointer to every record in the file

Clustering

• Efficiency advantage read in a page, get all of the rows with

the same value • clustering is useful for range queries

e.g. between keyval1 and keyval2

Clustering

• Can only cluster table by 1 clustering index at a time • In SQL server

– creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified

• In DB2 – – if the table is empty, rows sorted as placed on disk – subsequent insertions not clustered, must use REORG

• In Oracle-– Cluster index – now available for PK in 10g– Define a cluster to create cluster index for 2 tables

Please help me to remember to

TURN OFF THE PROJECTOR!!

Indexes vs. table scan

• To illustrate the difference between table scan, secondary index (non clustered) and clustered index Assume 10 M customers, 200 cities2KB/page, row = 100 bytes, 20 rows/page Select *

From Customers Where city = Birmingham

1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Rules of Thumb for I/O

• Assume slightly slower times than before:– Random I/O – 160 pages/second, .00625– Sequential prefetch I/O – 1600 pages/second,

.000625

Will discuss later:– List prefetch I/O – 400 pages/second, .0025

Table Scan

Table Scan - read entire table

If used an random I/O (RIO) – WHICH ONE WOULD NEVER DO

10,000,000/20 = 500,000 pages 500,000*RIO = 3125

Instead, it makes more sense to use:sequential prefetch (SP) read 32 pages at a time

500,000*SP = 312

Clustering IndexClustering Index –

• All entries for B'ham clustered on same pages

• 50,000/20 = 2500 data pages (with 20 rows per page)

• Assume 3 upper nodes of the tree

• Assume 1000 index entries per leaf node, read 50000/1000 = 50

index pages

3 + 50000/1000 + 50,000/20 = number of pages to access

• If top 3 levels of tree in memory, count access time as 0

• Access time:

(3*0) + (50*SP) + (2500*SP) = 2,550 * .000625 = 1.6

Secondary Index

• In the worst case 1 entry for B'ham per page

• 50,000 data pages pages (10M/200)

3 + 50 + 50,000 = 50, 053 number of accesses

(3*0)+(50*SP) + (50,000*RIO)=312.5 access time

REALLY slow – see next slide for a better solution!

Use List Prefetch instead of RIO

List Prefetch – Better solution

Create list of data pages to access

Pages not necessarily in contiguous sequential order

System orders pages to minimize disk I/O

E.g. elevator algorithm for disk request scheduling

Using list prefetch (LP)

0+(50*SP)+50,000*LP=125.03 access time

% Free

• Redo the previous calculations assuming relations created with 50% free option specified.

Creating Indexes

• When determining what indexes to create consider:– workload - mix of queries and frequencies of requests

• 20% of requests are updates, etc. – can create lots of indexes but:

• cost to create • insertions • initial load time high if a large table • index entries can become longer and longer as

multiple columns included

Multiple Indexes

• More than one index on a relation – e.g. age – one index, class - one index,

gender - one index

Composite Index

• One index based on more than one attribute Create Index index_name on Table (col1, col2,... coln)

• Composite index entry - values for each attribute age, class, gender entry in index is: C1, C2, C3, RID

Using Indexes

• System must decide if to use index

• What if more than one index, which one?

• What if composite index?

Plans using Indexes

Can use an index if index matches select condition in where clause:

1. A matching index scan - only have to access a limited number of contiguous leaf entries to access data

2. Predicate screening with matching index scan – index entries to eliminate RIDs

3. Non-matching index scan – use index to identify RIDs4. Index-only retrieval – don’t access data, RIDs only5. Multiple index retrieval – use >1 index to identify RIDs

Matching index scan

Definition of a matching index scan - Only have to access contiguous leaf nodes

1) Single where clause and index matchesCreate index Idx1 on T1 ( C1)Select * from T1

where C1=10

search B+-tree to leaf level for leftmost entry having specified values

useful for =, between

Matching Index Scan 2) If multiple where clauses and all '='

Select * from T1 where C1=10 and C2=5

i) if there is a composite index and selectcolumns match all index columns, e.g.

Create index Idx2 on T1 ( C1, C2) only have to read contiguous leaf pages

ii) if there is a separate index for each clause, e.g. Create index idx3 on T1(C1);

Create index idx4 on T1(C2); must choose one or more of the indexes (later)

Matching Index Scan - Rules

A matching scan can be used ONLY IF

one of the columns in select is the first column of index

Decide how many attributes to match in a composite index after the first column, so can read in a small contiguous range of leaf entries in B+-tree to get RIDs

• Match first column of composite index then: – look at index columns from left to right – Match ends when no predicate found – If range (<=, like, between) for a column, match terminates

thereafter• easier to scan all entries for range – process rest of entries

using predicate screening

Matching Index Scan with Predicate screening

1) If select conditions match some index columns of composite index Create index idx6 on T1(C1, C2, C3, C4);

Select * from T1

where C1=10 and C2=3 and C4=20

• Access contiguous leaf pages, but not all results on contiguous leaf pages

• Must examine index entries to determine if in the result -- called predicate screening

Matching Index Scan with Predicate screening Another example:

2) If all select conditions match composite index columns and some selects are a range

Create index idx7 on T1(C1, C2, C3); Select * from T1 where C1=10 and C2

between 1 and 5 and C3 =‘F’

Advantages to Predicate screening

• discard RIDs based on values (for index)• will access fewer tuples because RIDs used to eliminate

potential tuples

Non-matching index scan

• Not always used by DBMSs• attributes in where clause don't include initial attribute of

index Create index idx3 on T1(C1, C2, C3);

Select * from T1 where C2=2 and C3=‘M’

• Search leaf entries of index and compare values for entries • must read in all index leaf pages to find C2, C3 value (so

why do it?)– 50 index pages vs 500,000 data pages

Index only retrieval

• Elements retrieved in select clause are attributes of compose index

• Don't need to access rows (actual data)

Create index idx5 on T1(C1, C3);

Select C1, C3 from T1 where C1=5 and C3 between 2 and 5

Select sum(C3) from T1

Multiple Index Access

• If conjunctive conditions & in where clause, can use >1 index– Extract RIDs from each index satisfying matching

predicate – Intersect lists of RIDs (and them) from each index – Final list - satisfies all predicates indexed

• If disjunctive conditions (or) – Union the two lists of RIDs

Some Query optimizer rules for using RID-lists (then use list prefetch)

1. predicted active resulting RIDs must not be > 50% of RID pool

2. Limit to any single RID list the size of the RID memory pool (16M RIDs)

3. RID list cannot be generated by screening predicates

Rules for multiple index Access

Optimizer determines diminishing returns using multiple index access

1. List indexes with matching predicates in where clause

2. Place indexes in order by increasing filter factor

3. For successive indexes, extract RID list only if reduced cost for final row returned e.g. no sense reading 100's of pages of a new index

to get number of rows to only 1 tuple

Example: Using RID lists with Multiple IndexesProspects Table : 50M rows - 10 rows per pagePages in table: 5,000,000

There are 4 Indexes: • age – 50 values (1000 entries per page)• zipcode – 100,000 values (100 entries per page)• hobby – 100 values (1000 entries per page)• incomeclass – 10 values (1000 entries per page)

Problem cont’d

Select name, straddr from prospectswhere zipcode between 02159 and 02658and age = 40 and hobby = ‘chess’ and incomeclass = 10;

Compute FF : Make sure in ascending order• FF(zipcode) =

500/100,000 = 1/200• FF(hobby) =

1/100• FF(age) =

1/50• FF(incomeclass) =

1/10

Problem cont’d

Data rows read if use indexes: (1) 50,000,000/200 = 250,000 (1,2) 250,000/100 = 2500 (1,2,3) 2500/50 = 50 (1,2,3,4) 50/10 = 5 How much time will this take? Is it cost effective to use all

of these indexes?

Problem cont’d I/O costs

• Cost:– Random IO: RIO= 1/160 = .00625– Sequential Prefetch: SP = 1/1600 = .000625– List Prefetch: LP = 1/400 = .0025

• Note:– Some textbooks assume if read <= 3 pages use RIO– They also assume non-leaf nodes RIO, we assume in memory

so it takes 0 disk access time

Problem cont’d

Table scan:

50M/10 per page * SP

Total time: 5,000,000 * 0.000625 = 3125

Using index 1: (100 entries per page)

data: 50M*FF*LP

250,000 * 0.0025 = 625

index:

non-leaf pages+(#leaf entries*FF*entries per page))*SP

(3*0) + (50,000,000/200/100) * 0.000625 = 1.56

Total time: 1.56 + 625 = 626.56

Problem cont’d

Using indexes 1&2:

data: 250,000/100 * LP

2500 * 0.0025 = 6.25

index 2: (1000 entries per page)

(3*0) + (50,000,000/100/1000)* 0.000625 = 0.3125

To use both indexes: 1.56 + 0.3125 = 1.8725

Total time: 1.8725 + 6.25 = 8.1225

Problem cont’dUsing indexes 1,2,3:

data: 50 * 0.0025 = 0.125


(3*0) + (50,000,000/50/1000) * .000625= .625

To use 3 indexes: 1.56 + 0.3125 + 0.625 = 2.4975

Total time: 2.4975 + 0.125 = 2.6225

Using indexes 1,2,3,4:

data: 5 * 0.0025 = 0.0125


(3*0)+ (50,000,000/10/1000)*.000625 = 3.125

To use 4 indexes: 1.56+0.3125+0.625+3.125=5.6225

Total time: 5.6225 + 0.0125 = 5.635

Problem cont’dIndex used

Data rows

I/O cost

Index I/O cost

Trade off if use index

None 50M

3125 sec

1 250,000

625 sec

1.56 sec Decrease 3125 to 625 sec

With 1.56 additional sec

1,2 2500

6.25 sec

1.56 + 0.3125 sec

Decrease 625 to 6.25 sec


1,2,3 50

0.125 sec

1.56 + 0.3125 + 0.625 sec

Decrease 6.25 to 0.125 sec


1,2,3,4 5

0.0125 sec

1.56 + 0.3125 + 0.625 + 3.125 sec

Decrease 0.125 to 0.0125 sec


Indexes and Information Retrieval

Some information on slides taken from CS245 – Stanford Univ.

Query: Get employees in

(Toy Dept) ^ (2nd floor)

Dept. index EMP Floor index

Toy 2nd

Intersect toy RIDs and 2nd Floor RIDs to get set of matching EMP’s

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

Inverted lists

cat

dog

IR QUERIES

• Find articles with “cat” and “dog”

• Find articles with “cat” or “dog”

• Find articles with “cat” and not “dog”

IR QUERIES

• Find articles with “cat” and “dog”

• Find articles with “cat” or “dog”

• Find articles with “cat” and not “dog”

• Find articles with “cat” in title

• Find articles with “cat” and “dog” within 5 words

IR – Web search problems

– Crawling and indexing share similar characteristics and requirements

– Both are offline problems, no need for real-time– Tolerable for a few minutes delay before content

searchable– OK to run smaller-scale index updates frequently

– Querying online problem – Demands sub-second response time– Low latency high throughput– Loads can very greatly

Architecture of IR SystemsDocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

How do we represent text?

• “Bag of words”– Treat all the words in a document as index terms for

that document– Assign a “weight” to each term based on “importance”– Disregard order, structure, meaning, etc. of the words– Simple, yet effective!

• Assumptions– Term occurrence is independent– Document relevance is independent– “Words” are well-defined

Stop Word List

• Words filtered out

• Common words

• Match on common word not asuseful as match on rare words...

• Not one definite list

http://www.ranks.nl/resources/stopwords.html

Representing Documents

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Doc

ume

nt 1

Doc

ume

nt 2

Stopword List

Inverted Index

• Inverted indexing is fundamental to all IR models• Consists of postings lists, one with each term in the

collection• Posting list – document id and payload

– Payload can be term frequency or number of times occurs on document, position of occurrence, properties, etc.

– Can be ordered by document id, page rank, etc.– Data structure necessary to map from document

id to e.g. URL

Inverted Index

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term

Doc

1D

oc 2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

4 82 4 61 3 71 3 5 72 4 6 83 53 5 72 4 6 831 3 5 7

1 3 5 7 8

2 4 82 6 8

1 5 72 4 6

1 36 8

Term Postings

CS 245 Notes 4 85

Posting: an entry in inverted list.Represents occurrence ofterm in article

Size of a list: 1 Rare words or (in postings) miss-spellings

106 Common words

Size of a posting: 10-15 bits (compressed)

Process query

• Given a query, fetch posting lists associated with query, traverse postings to compute result set

• Query document scores must be computed• Partial scores stored in accumulators• Top k documents extracted• Optimization strategies to reduce # postings

must examine

Indexing: Performance Analysis

• The indexing problem– Must be relatively fast, but need not be real time– For Web, incremental updates are important

• How large is the inverted index?– Size of vocabulary– Size of postings

• Fundamentally, a large sorting problem– Terms usually fit in memory– Postings usually don’t

Index

• Size of index depends on payload• Well-optimized inverted index can be 1/10 of size of

original document collection• If store position info, could be several times larger• Usually can hold entire vocabulary in memory (using

front-coding)• Postings lists usually too large to store in memory• Query evaluation involves random disk access and

decoding postings– Try to minimize random seeks

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Documents

Transcript of Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va