The Oracle9i Multi-Terabyte Data Warehouse

36

description

Session id:. The Oracle9i Multi-Terabyte Data Warehouse. Jeff Parker Manager Data Warehouse Development Amazon.com. The Challenges. Rapidly evolving business Growing data volumes Do more with less. The Challenges. Rapidly evolving business New international markets - PowerPoint PPT Presentation

Transcript of The Oracle9i Multi-Terabyte Data Warehouse

Page 1: The Oracle9i Multi-Terabyte  Data Warehouse
Page 2: The Oracle9i Multi-Terabyte  Data Warehouse

The Oracle9i Multi-Terabyte Data Warehouse

Jeff ParkerManager Data Warehouse Development

Amazon.com

Session id:

Page 3: The Oracle9i Multi-Terabyte  Data Warehouse

The Challenges• Rapidly evolving business

• Growing data volumes

• Do more with less

Page 4: The Oracle9i Multi-Terabyte  Data Warehouse

The Challenges• Rapidly evolving business

– New international markets

– Continual innovation of features on Amazon• Buy it used

• Magazine subscriptions

– Marketplace Partnerships – Toys R Us, Target

• Growing data volumes

• Do more with less

Page 5: The Oracle9i Multi-Terabyte  Data Warehouse

The Challenges• Rapidly evolving

business

• Growing data volumes– 2X growth yearly over the

past 5 years

– Currently 10 Terabytes of raw data

• Do more with less

Data Growth

0

5

10

15

20

25

30

35

1999 2000 2001 2002 2003

Ter

abyt

es

Page 6: The Oracle9i Multi-Terabyte  Data Warehouse

The Challenges• Rapidly evolving business

• Growing data volumes

• Do more with less– Innovative use of technology and resources

– Throwing money and people at the problem is not an option

– Leverage existing investment in Oracle

Page 7: The Oracle9i Multi-Terabyte  Data Warehouse

Addressing the issues• Rapidly evolving business

–Denormalize only for performance reasons

–Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs

• Growing data volumes

• Do more with less

Page 8: The Oracle9i Multi-Terabyte  Data Warehouse

Addressing the issues• Rapidly evolving business

• Growing data volumes– Dual database approach to ETL

• Staging database for efficient transformation of large datasets. SQL and hash-joins allow transforms to scale in a non-linear fashion

• Second database optimized for analytics

– Oracle as an API • Simplifies ETL architecture

• Better scalability than traditional ETL tools

• Do more with less

Page 9: The Oracle9i Multi-Terabyte  Data Warehouse

Addressing the issues• Rapidly evolving business

• Growing data volumes

• Do more with less–One DW schema supports all countries

–Cut costs by eliminating unneeded software

–Data driven Load functionality

Page 10: The Oracle9i Multi-Terabyte  Data Warehouse

The ETL Process

• Extract data from source

• The Load process

• Dimensional Transforms

DATAWAREHOUSE

RT

OLTP

RT

STAGINGDATABASE

DTRTFLATFILESEXTRACT

FLATFILES

RT

DT

= Row level data Transform

= Dimensional Transform

Page 11: The Oracle9i Multi-Terabyte  Data Warehouse

The ETL Process

• Extract data from source–Can create one or more files to be loaded

–Must produce Metadata upon which the Load process can depend

• The Load Process

• Dimensional Transforms

Page 12: The Oracle9i Multi-Terabyte  Data Warehouse

Extract produced Metadata

• Describes each field in database type terms

• Changes as the dataset changes

• Can reference multiple files

• Very reliable

• No additional overhead

Page 13: The Oracle9i Multi-Terabyte  Data Warehouse

XML Based Metadata<DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2"

DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE"

DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2"

DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES></DATA>

Page 14: The Oracle9i Multi-Terabyte  Data Warehouse

The ETL Process

• Extract data from source

• The Load Process –Makes extensive use of External Tables

–MERGE and Bulk Insert

–Contains integrated DBA tasks

–Every load is tracked in an operational database

• Dimensional Transforms

Page 15: The Oracle9i Multi-Terabyte  Data Warehouse

The Load Process

OPSDW Operational Data

Load Tasks

DBA Tasks

Log loadstats

Performcleanup

DataWarehouse

SQLInsert/Merge

Row levelTransforms

XT

ExternalTable

DataFiles

METADATA

DATAFILES

Page 16: The Oracle9i Multi-Terabyte  Data Warehouse

The Load Process

• External Tables–access to files on the operating system

–Is a building block in a broader ETL process

• MERGE & Bulk Insert

• Integrated DBA tasks

Page 17: The Oracle9i Multi-Terabyte  Data Warehouse

The External Table

• Created by using Metadata from the Extract process

• Data is read-only

• No indexes

• Use DBMS_STATS to set number of rows

DATA_SETSdataset_id NUMBERdataset_name VARCHAR(80)creation_date DATEcreated_by VARCHAR(8)

External Table

Data Files

Page 18: The Oracle9i Multi-Terabyte  Data Warehouse

Example External Table

1. Copy the data to the database server Data must reside in a file system location

specified by the DBA’s.

- create directory DAT_DIR as ‘/stage/flat’

Page 19: The Oracle9i Multi-Terabyte  Data Warehouse

Example External Table

2. Create the external table using the DML from the extract.

CREATE TABLE XT_datasets_77909( dataset_id NUMBER , dataset_name VARCHAR2(80) ,

creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER

DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline

characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' )

Page 20: The Oracle9i Multi-Terabyte  Data Warehouse

The External Table

• No pre-staging of data

• Ability to describe a flat file to Oracle

• Handles horizontally partitioned files

• Good error messaging

Page 21: The Oracle9i Multi-Terabyte  Data Warehouse

The Load Process

• External Tables

• MERGE–Can be run in parallel

–Combined with external table provides a powerful set of ETL tools

• Integrated DBA tasks

Page 22: The Oracle9i Multi-Terabyte  Data Warehouse

MERGE

• Allows for update or insert in a single statement–If key value already exists

• Yes, update row

• No, insert row

• MERGE statement is auto-generated

• Row level column transforms are supported

Page 23: The Oracle9i Multi-Terabyte  Data Warehouse

MERGE

External tableMetadata Permanent table

Dataset_id

Dataset_name

Created_by

last_updated

Dataset_id

Dataset_name

Created_by

sysdate

Dataset_id

Dataset_name

Created_by

Page 24: The Oracle9i Multi-Terabyte  Data Warehouse

MERGE exampleMERGE into DATASETS dsUSING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) srcOn ( xt.dataset_id = ds.dataset_id )When matched then UPDATE SET ds.dataset_name =

src.dataset_name ,ds.creation_date = src.dataset_name

,ds.created_by = src.created_by ,ds.last_updated = sysdate

when not matched then INSERT( dataset_name, creation_date, created_by, last_updated )

VALUES( dataset_name, creation_date, created_by, sysdate )

Page 25: The Oracle9i Multi-Terabyte  Data Warehouse

MERGE

• Issues we faced–Duplicate records in the dataset

–NESTED-LOOPS from external table

–Parallelism is not enabled by default

–Bulk Load partition determination

Page 26: The Oracle9i Multi-Terabyte  Data Warehouse

The Load Process

• External Tables

• MERGE

• Integrated DBA tasks–Reduces workload required by the DBA team

–Streamlines the load process

–Eliminates human error

Page 27: The Oracle9i Multi-Terabyte  Data Warehouse

Integrated DBA Tasks

• Provided by the DBA team–Managed by the DBA team

–ETL team does not need special knowledge of table layout

Page 28: The Oracle9i Multi-Terabyte  Data Warehouse

Integrated DBA Tasks

• Truncate Partitiondeveloper makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )

DBA utility translates this and executesalter table TABLE-NAME drop partition dbi20020930_101;

Page 29: The Oracle9i Multi-Terabyte  Data Warehouse

Integrated DBA Tasks

• Analyze Partitiondeveloper makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )

DBA utility translates this and executes dbms_stats.gather_table_stats(ownname , tabname , partname , cascade , estimate_percent, granularity);

Page 30: The Oracle9i Multi-Terabyte  Data Warehouse

Integrated DBA Tasks

• Return Partition Namedeveloper makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 )

DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables.

Page 31: The Oracle9i Multi-Terabyte  Data Warehouse

Integrated DBA Tasks

• Partitioning utilities–Helps to streamline the process

–Reduces workload of DBA team

–Helps to eliminate the problem of double loads for Snapshot tables and partitions

Page 32: The Oracle9i Multi-Terabyte  Data Warehouse

The Load Process

• External Tables– Provides access to flat files outside the database

• MERGE– Parallel “upsert” simplifies ETL

– Row level transforms can be performed in SQL

• Integrated DBA tasks– Reduces workload required by the DBA team

– Streamlines the load process

– Eliminates human error

• Loads are repeatable processes

Page 33: The Oracle9i Multi-Terabyte  Data Warehouse

Summary

• Reduction in time to integrate new subject areas

• Oracle parallelism scales well

• Eliminated unneeded software

Page 34: The Oracle9i Multi-Terabyte  Data Warehouse

Summary

• Oracle has delivered on the DW promise–Oracle External table combined with MERGE

is a viable alternative to other ETL tools

–ETL tools are ready today

Page 35: The Oracle9i Multi-Terabyte  Data Warehouse

&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Page 36: The Oracle9i Multi-Terabyte  Data Warehouse

Reminder – please complete the OracleWorld session survey

Thank you.