Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

88
Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

description

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va. Disk access. DBs traditionally stored on disk Cheaper to store on disk than in memory Costs for: Seek time, latency, data transfer time  Disk access is page (block) oriented 2 - 4 KB page size. - PowerPoint PPT Presentation

Transcript of Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Page 1: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Select Operation- disk access

and Indexing

*Some info on slides from Dr. S. Son, U. Va

Page 2: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Disk access

• DBs traditionally stored on disk

• Cheaper to store on disk than in memory

• Costs for:– Seek time, latency, data transfer time

•  Disk access is page (block) oriented

• 2 - 4 KB page size

Page 3: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Access time

• Access time is the time to randomly access a page

• System initially determines if page in memory buffer (page tables, etc.)

• Large disparity between disk access and memory access

Page 4: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Select operation using table scan

• If read the entire table for a select – table scan

• Improvements to table scan of disk:– Parallel access– Sequential prefetch

Page 5: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va
Page 6: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Parallel access

• Linear search - all data rows read in from disk – I/O parallelism can be used (Raid)

• multiple I/O read requests satisfied at the same time

• stripe the data across different disks       

– Problems with parallelism?• must balance disk arm load to gain maximum

parallelism

• requires the same total number of random I/O's, but using devices for a shorter time

Page 7: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Sequential prefetch I/O

• Retrieve one disk page after another (on same track) – (32 in DB2, varies in Oracle)

• Seek time no longer a problem

• Must know in advance to read 32 successive pages

• Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Page 8: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Access time

• Seek time –as low as 4 ms server

• Latency time –as low as 1 ms or less

• Data transfer time – .4-2 ms

• Solid state disks up to 100,000 I/Os per sec. – still expensive

Page 9: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Access time for fast I/O

RIO            Seq. Prefetch .004             .004                    Seek - disk arm to cylinder .001             .001                    Latency - platter to sector .0005           .016                Data transfer - Page .0055           .021                   1 page vs. 32 pages

.176* seconds  .021 seconds 32 pages for both

* .0055X32=.176 for 32 pages of RIO vs .021 for 32 pages of Seq. Prefetch

Page 10: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Organizing disk space

• How to store data so minimize access time if read the entire table?

Page 11: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Disk allocation

• Disk Resource Allocation for Databases (DBA has control)

• Goal – contiguous sectors on disk - want data as close together as possible  to minimize seek time

• No standard SQL approach, but general way to deal with allocation

• Some OS allow specification of size of file and disk device

Page 12: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Types of Files• Heap files (unordered – sequential)• Sorted files (ordered – sort key)• Hash files (hash key, hash function)• B+-trees

• Storage Area Networks SAN – ERP (enterprise resource planning) and DW (data warehouses)– Storage devices configured as nodes in network –

can attach/detach

Page 13: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Tablespace

Tablespace is:• Allocation medium for tables and indexes for

ORACLE, DB2, etc.• Can put >1 table in a table space if accessed

together • Tablespace corresponds to 1 or more OS files

and can span disk devices• Usually relations cannot span disk devices

Page 14: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

DB storage structures

DB Company DatabaseTable- tspace 1

system

space

OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx

Segments data data data data index

Extents

Page 15: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Tablespace

• ORACLE DB's contain several tablespaces, including one called system -     data description +  indexes + user-defined tables

• default tablespace given to each user • if multiple tablespaces - better control over load

balancing • can take some disk space off-line

Page 16: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Extent• Relation composed of 1 or more extents

• Extent - contiguous storage on disk • when data segment or index segment first

created, given an initial extent from tablespace 10KB (5 pages)

• if need more space given next contiguous extent

Page 17: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

DB storage structures

DB Company DatabaseTable- tspace 1

system

space

OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx

Segments data data data data index

Extents

Page 18: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Extent

• Can increase the size by a positive % (cannot decrease) – initial n - size of initial extent – next n - size of next – max extents - maximum number of extents – min extents - number of extents initially

allocated – pct increase n - % by which next extent

grows over previous one

Page 19: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Oracle create tablespace

• http://www.adp-gmbh.ch/ora/sql/create_tablespace.html

Page 20: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Create table

• Create table statement - can specify tablespace, no. of extents– When initial extent full, new extent allocated – pctfree - determine how much space in a page can be

used for inserts of new rows • if pctfree =10%, inserts stop when page is 90% full

» Uses another page

– pctused – determines when new inserts start again • if fall below certain percentage of total, default pctused =

40%                  pctfree + pctused < 100

Page 21: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Rows

• Row layout on each disk page

1 2 3… N Row N Row N-1 … Row 1Header info Row directory free space data rows

• Header - • Row directory – row number and page byte offset

– Row number is row number in page – also called slot#• Page byte offset – with varchar, row size not constant

• To identify a particular row use RID (RowID) – page #, slot # [file#]

slot# is number in row directory (logical #)

Page 22: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Differences in DBMSs re: rows

• ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule)

• ORACLE • rows can be split between pages (row record

fragmentation) • Can have rows from multiple tables on same page,

more info

• DB2, no splitting, entire row moved to new page, need forwarding pointer

Page 23: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Select operation using Indexes

• Alternative to table scan

Page 24: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

24

   Why use an index? 

• If use a select (or join) on the same attribute frequently

• want a way to improve performance - use indexes– For example:

Select from Employee

where ssn = 333445555

Page 25: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

B+-tree

• Most commonly used index structure type in DBs today • Based on B-tree• Good for equality and range searches• B+ tree: dynamic, adjusts gracefully under inserts and

deletes.• Used to minimize disk I/O • available in DB2, ORACLE also has hash cluster, Ingres

has heap structure, B-tree, isam (chain together new nodes)

Page 26: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Structure of B+ Trees

• leaf level pointers to data (RIDs)

• the remaining are directory (index) nodes that point to other index nodes Fig.

Index Entries

Data Entries("Sequence set")

(Direct search)

Page 27: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Example of B+Tree

10 20 40

1 2 3 10 12 20 35 40 42 50

Points to data

Page 28: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Characteristics of B+ Tree

• Order of tree (fan out) – max number of child nodes

• Minimum 50% occupancy (except for root). Each node contains d/2 <= m <= d-1 entries. – Where the parameter d is the order of the tree.

• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)

• Supports equality and range-searches efficiently

Page 29: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Cost of I/O for B+-tree

• One index node is one page • If tree with depth of 3, 3 I/Os to get pointer to

data• Read in index node can remain in memory

– likely since frequent access to upper -level nodes of actively used B+-trees

Page 30: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

B+ Trees in Practice

• Typical order: between 100-200 children• Typical fill-factor: 2/3 full (66.6%)

– average fanout = 133 (if 200 children)

• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records

• Can often hold top levels in buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes

Page 31: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Why B+-tree

• Directory structure - retrieve range of values efficiently – search for leftmost index entry Si such that

X <= Si

• Index entries always in sequence by value - can use sequential prefetch on index

• Index entries shorter than data rows - less I/O

Page 32: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

B+-tree

• Balancing of B+-trees - insert, delete

• Nodes usually not full

• Utilities to reorganize to lower disk I/O

• Most systems allow nodes to become depopulated- no automatic algorithm to balance

• Average node below root level 71% full in active growing B+-trees

Page 33: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Duplicate key values

• Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily

duplicated key entails a long search through the leaf-level of the B+-tree

• Index compression - with multiple duplicates

| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|

where PrX is count of RID values

Page 34: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Create Index

   Options:         multiple columns

        tablespace         storage - initial extents, etc.         percent free default = 10

% of each page left unfilled (creation)

free page (1 free page for every n index pages during creation)

    

Page 35: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

35

Types of indexes (textbook)

• Primary index - key field is a candidate key (must be unique) – data file ordered by key field

• Clustering index - key field is not unique, data file is ordered – all records with same values on same pages

• Secondary index - non-clustering index – data file not ordered– First record in the data page (or block) is called the

anchor record• Non-dense index - pointer in index entry points to anchor• Dense index - pointer to every record in the file

Page 36: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Clustering

• Efficiency advantage        read in a page, get all of the rows with

the same value • clustering is useful for range queries

        e.g.  between keyval1 and keyval2

Page 37: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Clustering

• Can only cluster table by 1 clustering index at a time • In SQL server

– creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified

• In DB2 – – if the table is empty, rows sorted as placed on disk – subsequent insertions not clustered, must use REORG

• In Oracle-– Cluster index – now available for PK in 10g– Define a cluster to create cluster index for 2 tables

Page 38: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Please help me to remember to

TURN OFF THE PROJECTOR!!

Page 39: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Indexes vs. table scan

• To illustrate the difference between table scan, secondary index (non clustered) and clustered index Assume 10 M customers, 200 cities2KB/page, row = 100 bytes, 20 rows/page             Select *

            From Customers             Where city = Birmingham

1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Page 40: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Rules of Thumb for I/O

• Assume slightly slower times than before:– Random I/O – 160 pages/second, .00625– Sequential prefetch I/O – 1600 pages/second,

.000625

Will discuss later:– List prefetch I/O – 400 pages/second, .0025

Page 41: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Table Scan

Table Scan - read entire table

If used an random I/O (RIO) – WHICH ONE WOULD NEVER DO

10,000,000/20 = 500,000 pages 500,000*RIO = 3125

Instead, it makes more sense to use:sequential prefetch (SP) read 32 pages at a time

500,000*SP = 312

Page 42: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Clustering IndexClustering Index –

• All entries for B'ham clustered on same pages

• 50,000/20 = 2500 data pages (with 20 rows per page)

• Assume 3 upper nodes of the tree  

• Assume 1000 index entries per leaf node, read 50000/1000 = 50

index pages

3 + 50000/1000 + 50,000/20 = number of pages to access

• If top 3 levels of tree in memory, count access time as 0

• Access time:

  (3*0) + (50*SP) + (2500*SP) = 2,550 * .000625 = 1.6

Page 43: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Secondary Index

• In the worst case 1 entry for B'ham per page

• 50,000 data pages pages (10M/200)

3 + 50 + 50,000 = 50, 053 number of accesses

(3*0)+(50*SP) + (50,000*RIO)=312.5 access time

REALLY slow – see next slide for a better solution!

Use List Prefetch instead of RIO

Page 44: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

List Prefetch – Better solution

Create list of data pages to access

Pages not necessarily in contiguous sequential order

System orders pages to minimize disk I/O

E.g. elevator algorithm for disk request scheduling

Using list prefetch (LP)

0+(50*SP)+50,000*LP=125.03 access time

Page 45: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

% Free

• Redo the previous calculations assuming relations created with 50% free option specified.

Page 46: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Creating Indexes

• When determining what indexes to create consider:– workload - mix of queries and frequencies of requests

• 20% of requests are updates, etc. – can create lots of indexes but:

• cost to create • insertions • initial load time high if a large table • index entries can become longer and longer as

multiple columns included

Page 47: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Multiple Indexes

• More than one index on a relation             – e.g. age – one index, class - one index,

gender - one index

Page 48: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va
Page 49: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Composite Index

• One index based on more than one attribute  Create Index index_name on Table (col1, col2,... coln)

•    Composite index entry - values for each attribute             age, class, gender             entry in index is:  C1, C2, C3, RID

Page 50: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Using Indexes

• System must decide if to use index

• What if more than one index, which one?

• What if composite index?

Page 51: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Plans using Indexes

Can use an index if index matches select condition in where clause:

1. A matching index scan - only have to access a limited number of contiguous leaf entries to access data

2. Predicate screening with matching index scan – index entries to eliminate RIDs

3. Non-matching index scan – use index to identify RIDs4. Index-only retrieval – don’t access data, RIDs only5. Multiple index retrieval – use >1 index to identify RIDs

Page 52: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Matching index scan

Definition of a matching index scan - Only have to access contiguous leaf nodes

1) Single where clause and index matchesCreate index Idx1 on T1 ( C1)Select * from T1

where C1=10

search B+-tree to leaf level for leftmost entry having specified values

useful for =, between

Page 53: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Matching Index Scan 2) If multiple where clauses and all '='

Select * from T1 where C1=10 and C2=5

  i)  if there is a composite index and selectcolumns match all index columns, e.g.

Create index Idx2 on T1 ( C1, C2) only have to read contiguous leaf pages

ii)   if there is a separate index for each clause, e.g. Create index idx3 on T1(C1);

Create index idx4 on T1(C2);      must choose one or more of the indexes (later)

Page 54: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Matching Index Scan - Rules

A matching scan can be used ONLY IF

one of the columns in select is the first column of index

Decide how many attributes to match in a composite index after the first column, so can read in a small contiguous range of leaf entries in B+-tree to get RIDs

• Match first column of composite index then: – look at index columns from left to right – Match ends when no predicate found – If range (<=, like, between) for a column, match terminates

thereafter• easier to scan all entries for range – process rest of entries

using predicate screening

Page 55: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Matching Index Scan with Predicate screening

1) If select conditions match some index columns of composite index Create index idx6 on T1(C1, C2, C3, C4);

          Select * from T1

where C1=10 and C2=3 and C4=20

• Access contiguous leaf pages, but not all results on contiguous leaf pages

• Must examine index entries to determine if in the result -- called predicate screening

Page 56: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Matching Index Scan with Predicate screening Another example:

2) If all select conditions match composite index columns and some selects are a range

Create index idx7 on T1(C1, C2, C3); Select * from T1 where C1=10 and C2

between 1 and 5 and C3 =‘F’

Page 57: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Advantages to Predicate screening

• discard RIDs based on values (for index)• will access fewer tuples because RIDs used to eliminate

potential tuples

Page 58: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Non-matching index scan

• Not always used by DBMSs• attributes in where clause don't include initial attribute of

index Create index idx3 on T1(C1, C2, C3);

          Select * from T1 where C2=2 and C3=‘M’

• Search leaf entries of index and compare values for entries • must read in all index leaf pages to find C2, C3 value (so

why do it?)– 50 index pages vs 500,000 data pages

Page 59: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Index only retrieval

• Elements retrieved in select clause are attributes of compose index

• Don't need to access rows (actual data)

Create index idx5 on T1(C1, C3);

Select C1, C3 from T1 where C1=5 and C3 between 2 and 5

       Select sum(C3) from T1

Page 60: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Multiple Index Access

• If conjunctive conditions & in where clause, can use >1 index– Extract RIDs from each index satisfying matching

predicate – Intersect lists of RIDs (and them) from each index – Final list - satisfies all predicates indexed

• If disjunctive conditions (or) – Union the two lists of RIDs

Page 61: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Some Query optimizer rules for using RID-lists (then use list prefetch)

1.  predicted active resulting RIDs must not be  > 50% of RID pool

2.  Limit to any single RID list the size of the RID memory pool (16M RIDs)

3.  RID list cannot be generated by screening predicates

Page 62: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Rules for multiple index Access

Optimizer determines diminishing returns using multiple index access

1.  List indexes with matching predicates in where clause

2.  Place indexes in order by increasing filter factor

3.  For successive indexes, extract RID list only if reduced cost for final row returned     e.g. no sense reading 100's of pages of a new index

to get number of rows to only 1 tuple

Page 63: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Example: Using RID lists with Multiple IndexesProspects Table : 50M rows - 10 rows per pagePages in table: 5,000,000

There are 4 Indexes: • age – 50 values (1000 entries per page)• zipcode – 100,000 values (100 entries per page)• hobby – 100 values (1000 entries per page)• incomeclass – 10 values (1000 entries per page)

Page 64: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’d

Select name, straddr from prospectswhere zipcode between 02159 and 02658and age = 40 and hobby = ‘chess’ and incomeclass = 10;

Compute FF : Make sure in ascending order• FF(zipcode) =

500/100,000 = 1/200• FF(hobby) =

1/100• FF(age) =

1/50• FF(incomeclass) =

1/10

Page 65: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’d

Data rows read if use indexes: (1) 50,000,000/200 = 250,000 (1,2) 250,000/100 = 2500 (1,2,3) 2500/50 = 50 (1,2,3,4) 50/10 = 5 How much time will this take? Is it cost effective to use all

of these indexes?

Page 66: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’d I/O costs

• Cost:– Random IO: RIO= 1/160 = .00625– Sequential Prefetch: SP = 1/1600 = .000625– List Prefetch: LP = 1/400 = .0025

• Note:– Some textbooks assume if read <= 3 pages use RIO– They also assume non-leaf nodes RIO, we assume in memory

so it takes 0 disk access time

Page 67: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’d

Table scan:

50M/10 per page * SP

Total time: 5,000,000 * 0.000625 = 3125

Using index 1: (100 entries per page)

data: 50M*FF*LP

250,000 * 0.0025 = 625

index:

non-leaf pages+(#leaf entries*FF*entries per page))*SP

(3*0) + (50,000,000/200/100) * 0.000625 = 1.56

Total time: 1.56 + 625 = 626.56

Page 68: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’d

Using indexes 1&2:

data: 250,000/100 * LP

2500 * 0.0025 = 6.25

index 2: (1000 entries per page)

(3*0) + (50,000,000/100/1000)* 0.000625 = 0.3125

To use both indexes: 1.56 + 0.3125 = 1.8725

Total time: 1.8725 + 6.25 = 8.1225

Page 69: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’dUsing indexes 1,2,3:

data: 50 * 0.0025 = 0.125

index 3: (1000 entries per page)

(3*0) + (50,000,000/50/1000) * .000625= .625

To use 3 indexes: 1.56 + 0.3125 + 0.625 = 2.4975

Total time: 2.4975 + 0.125 = 2.6225

Using indexes 1,2,3,4:

data: 5 * 0.0025 = 0.0125

index 4: (1000 entries per page)

(3*0)+ (50,000,000/10/1000)*.000625 = 3.125

To use 4 indexes: 1.56+0.3125+0.625+3.125=5.6225

Total time: 5.6225 + 0.0125 = 5.635

Page 70: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Problem cont’dIndex used

Data rows

I/O cost

Index I/O cost

Trade off if use index

None 50M

3125 sec

1 250,000

625 sec

1.56 sec Decrease 3125 to 625 sec

With 1.56 additional sec

1,2 2500

6.25 sec

1.56 + 0.3125 sec

Decrease 625 to 6.25 sec

With 0.3125 additional sec

1,2,3 50

0.125 sec

1.56 + 0.3125 + 0.625 sec

Decrease 6.25 to 0.125 sec

With 0.625 additional sec

1,2,3,4 5

0.0125 sec

1.56 + 0.3125 + 0.625 + 3.125 sec

Decrease 0.125 to 0.0125 sec

With 3.125 additional sec

Page 71: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Indexes and Information Retrieval

Some information on slides taken from CS245 – Stanford Univ.

Page 72: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Query: Get employees in

(Toy Dept) ^ (2nd floor)

Dept. index EMP Floor index

Toy 2nd

Intersect toy RIDs and 2nd Floor RIDs to get set of matching EMP’s

Page 73: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

Page 74: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

Inverted lists

cat

dog

Page 75: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

IR QUERIES

• Find articles with “cat” and “dog”

• Find articles with “cat” or “dog”

• Find articles with “cat” and not “dog”

Page 76: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

IR QUERIES

• Find articles with “cat” and “dog”

• Find articles with “cat” or “dog”

• Find articles with “cat” and not “dog”

• Find articles with “cat” in title

• Find articles with “cat” and “dog” within 5 words

Page 77: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

IR – Web search problems

– Crawling and indexing share similar characteristics and requirements

– Both are offline problems, no need for real-time– Tolerable for a few minutes delay before content

searchable– OK to run smaller-scale index updates frequently

– Querying online problem – Demands sub-second response time– Low latency high throughput– Loads can very greatly

Page 78: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Architecture of IR SystemsDocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

Page 79: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

How do we represent text?

• “Bag of words”– Treat all the words in a document as index terms for

that document– Assign a “weight” to each term based on “importance”– Disregard order, structure, meaning, etc. of the words– Simple, yet effective!

• Assumptions– Term occurrence is independent– Document relevance is independent– “Words” are well-defined

Page 80: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Stop Word List

• Words filtered out

• Common words

• Match on common word not asuseful as match on rare words...

• Not one definite list

Page 81: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Representing Documents

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Doc

ume

nt 1

Doc

ume

nt 2

Stopword List

Page 82: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Inverted Index

• Inverted indexing is fundamental to all IR models• Consists of postings lists, one with each term in the

collection• Posting list – document id and payload

– Payload can be term frequency or number of times occurs on document, position of occurrence, properties, etc.

– Can be ordered by document id, page rank, etc.– Data structure necessary to map from document

id to e.g. URL

Page 83: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va
Page 84: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Inverted Index

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term

Doc

1D

oc 2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

4 82 4 61 3 71 3 5 72 4 6 83 53 5 72 4 6 831 3 5 7

1 3 5 7 8

2 4 82 6 8

1 5 72 4 6

1 36 8

Term Postings

Page 85: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

CS 245 Notes 4 85

Posting: an entry in inverted list.Represents occurrence ofterm in article

Size of a list: 1 Rare words or (in postings) miss-spellings

106 Common words

Size of a posting: 10-15 bits (compressed)

Page 86: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Process query

• Given a query, fetch posting lists associated with query, traverse postings to compute result set

• Query document scores must be computed• Partial scores stored in accumulators• Top k documents extracted• Optimization strategies to reduce # postings

must examine

Page 87: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Indexing: Performance Analysis

• The indexing problem– Must be relatively fast, but need not be real time– For Web, incremental updates are important

• How large is the inverted index?– Size of vocabulary– Size of postings

• Fundamentally, a large sorting problem– Terms usually fit in memory– Postings usually don’t

Page 88: Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Index

• Size of index depends on payload• Well-optimized inverted index can be 1/10 of size of

original document collection• If store position info, could be several times larger• Usually can hold entire vocabulary in memory (using

front-coding)• Postings lists usually too large to store in memory• Query evaluation involves random disk access and

decoding postings– Try to minimize random seeks