Post on 01-Jan-2016
Why am I here?
Tal Olier – tal.olier@hp.com
~15 years in various software development positions. All of them involved database practice.
I work in HP Software, I love working there and I came to tell you this; the lecture is just an excuse getting me into the building :)
Agenda
• RDBMS in short (basic terms)• SQL reminder• A bit about (RDBMS) architecture• Performance - access paths • What is table join• VLDB - the size factor• VLDB - industry practice• How joins are executed• Summary
A little history
• Was invented in 1970– By Edgar Frank "Ted" Codd– In IBM labs– Oracle emerged first to the market
Basics – a table
• Rows • Columns• Primary key
Emp_id Emp_name Salary
1 Dany 10,000
2 Yosi 20,000
3 Moshe 30,000
4 Eli 40,000
Basics – a relation
• A foreign key (constraint)• A reference
– Source table– Source column/s– Target table– Target column/s
People example
• People: name, height, smoking, father• Books read: title, author• Schedule details: from, to, activity• Resume details: from, to, salary
Query language
• SQL – Structured Query Language• Declarative (vs. procedural)• Requires Internal optimization
SQL modules
• DML (+Select) – Data manipulation language• DDL – Data definition language• TC – Transaction controls (commit/rollback)• DCL – Data control language (grant/revoke)• PE – Procedural extensions
Database server
MemoryProcess
I/O System
Client Process
Data Files Log Files
Server ProcessBuffer cache Log
cacheOther cache
Everything is blocks
FTS – full table scan
• Scan the whole table – from top to bottom200720082009201020072008200920102007200820092010
B+ tree
• The leaves are organized in a doubly linked list• B+ tree – allows searching through all values by
searching the leaf level only
Database index
• Data is sorted according to the index columns• The leaf contain pointers to rows in the table• Search of 1 value in a tree - o (log n)• Smaller index height in B+ trees• Index (database) operations:
– Add/remove values– Index seek– Index scan
Inner join
• Use join predicate to match rows from 2 table: A and B
• Each row in table A is compared to each row in table B to find the pairs of rows that satisfy the join predicate
• Than column values for each matched pairs are combined into a result row
dept_id Dept_name
1 Sales
2 Engineering
3 Marketing
department
employee
Emp_name Dept_id
Rina 1
Moshe 2
Shira 2
Yossi null
emp_dept_id
emp_name dept_dept_id
dept_name
1 Rina 1 Sales
1 Rina 2 Engineering
1 Rina 3 Marketing
2 Moshe 1 Sales
2 Moshe 2 Engineering
2 Moshe 3 Marketing
2 Shira 1 Sales
2 Shira 2 Engineering
2 Shira 3 Marketing
null Yossi 1 Sales
null Yossi 2 Engineering
null Yossi 3 Marketing
Cartesian product
Equi join
• A inner join that uses equality comparison in the join predicate
• Example:select * from employee emp join department dept on emp.dept_id = dept.dept_id
Equi join
OK
OK
OK
emp_dept_id
emp_name dept_dept_id
dept_name
1 Rina 1 Sales
1 Rina 2 Engineering
1 Rina 3 Marketing
2 Moshe 1 Sales
2 Moshe 2 Engineering
2 Moshe 3 Marketing
2 Shira 1 Sales
2 Shira 2 Engineering
2 Shira 3 Marketing
null Yossi 1 Sales
null Yossi 2 Engineering
null Yossi 3 Marketing
Use case: Sales Information
• Table:– Customer name– Order number– Order date and time– List of items, amount and prices
Union view
Select * from t2007Union all Select * from t2008Union all Select * from t2009Union all Select * from t2010
Impact on index behavior
200720082009201020072008200920102007200820092010…
20072007…
20082008…
20092009…
20102010…
Partitioned index (local index)
200720082009201020072008200920102007200820092010…
20072007…
20082008…
20092009…
20102010…
Local indexes
• Index is bound to it’s partition• Drop partition derives drop index• Smaller index heights• Index is always usable• Harder to maintain uniqueness with it
Partitioned table - concepts
• Partition column is the key for dividing the data
• Performance – only relevant partitions used• Add/drop partition – DDL• Local index – index is bound to a partition
Data tables
block blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblock
T a b l e - A T
a b
l e -
B
T a b l e - C
Dimension referencing
Year Cust_id …
…
2007 1715
…
Cust_id Name Serial signature …
…
1715 Bank of America Ltd/ FFAA23472394- …
…
…
…
Making fact tables thin
20072007…
20082008…
20092009…
20102010…
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Join
To perform a join the optimizer need make the following decisions:• Access path
how to access each table
• Join orderif more than 2 tables/views are joined, which join to do first
• Join methodfor each pair of row resource how to perform the join
Join methods– nested loop
• One input is the outer loop, the other input is the inner loop
• The inner loop is executed for each row in the outer loop
• Effective when – The outer loop is small– The inner loop is pre indexed
Join methods– hash
• The smaller of the 2 inputs is named the build input
• The second is probe input• Hash table is build from build input• Each row in the build input is put in the
appropriate bucket• The entire probe input is scanned
Join methods– hash cont’
• For each row the hash value is calculated• The corresponding hash bucket is scanned to
find matched rows in the build input• Good for joining large amount
of data
Join methods– merge
• There is no concept of driving table• Both input sources are sorted according the
join key ( or use sorted source such as index)• The sorted lists are merged together • The merge itself is very fast, but it can be
expensive to sort the sources
Summary
• It’s all about I/O• Star schema – facts and dimensions• Partitions + local indexes• SQL joins (probably hash)