ETL Review

30
ETL Software Joanna Frazier Abhishek Sengupta Chris Kadlec Erik Shepard Susan Kost Brian Strok Ivan Vasquez

Transcript of ETL Review

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 1/30

ETL Software

Joanna Frazier Abhishek Sengupta

Chris Kadlec Erik ShepardSusan Kost Brian Strok 

Ivan Vasquez

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 2/30

What is ETL?

Short for e  xtract, t ransform, and l oad.Three database functions that are combinedinto one tool to pull data out of one database

and place it into another database.ETL is used to migrate data from one

database to another, to form data marts anddata warehouses and also to convertdatabases from one format or type toanother.

http://www.pcwebopedia.com/TERM/E/ETL.html

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 3/30

Side Note:

It should be noted that ETL is not 3 well-

defined steps.

We are breaking them up and presenting a

theoretical view for ease of understanding

 before bringing them together and showing

you how this method actually works in the

real business world.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 4/30

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 5/30

Extraction

Data Needs to be taken from some data source

so that it can be put into the Data

Warehouse. To do this:

1. Some code at the data source exports the

data to be used.

2. Some external program takes the data

from the source.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 6/30

Extraction (cont)

If the data is exported, it is typically

exported into a text file that can then be

 brought into an intermediary database.

If the data is extracted from the source, it is

typically transferred directly into an

intermediary database.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 7/30

Data Transformation

Locates, extracts, conditions, scrubs and

loads data onto the data warehouse platform

Physical database design must be available

 before loading can be performed

“Designs the process and develops the utilities and 

 programming that allow the data warehouse to be

initially loaded and maintained” 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 8/30

Data Transformation

3 major steps

- Data Cleansing

- Data Integration- Other Transformations (includes

replacement of codes, derived values,

calculating aggregates)

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 9/30

Data Cleansing

Dirty Data

Dummy Values

Absence of Data

Cryptic Data

Contradicting Data

Inappropriate Use of Address Lines

Reused Primary Keys

 Non-unique Identifiers

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 10/30

Data Integration

2 Major Problems

- Data that should be related but cannot be

(May arise due to non-unique primary keys or more often, the absence of primary keys)

- Data that is inadvertently related but should

not be

(Occurs when fields or records are reused for 

multiple purposes)

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 11/30

Loading

The populating of tables that presentation

applications will use to make data available

to users

Most critical operations in any warehouse,

yet often neglected

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 12/30

Loading (cont)

The LOADING

 process can be broken

down into 2 different

types: –  Initial Load

 –  Continuous Load

(loading over time)

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 13/30

Initial Load

Consists of populating tables in warehouse

schema and verifying data readiness

Examples:

 – DTS – data transformation services

 – Bcp utility – batch copy

 – SQL*Loader 

 –  Native Database Languages (T-SQL, PL/SQL,

etc.)

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 14/30

Continuous Loads

Must be scheduled and processed in a

specific order to maintain integrity,

completeness, and a satisfactory level of 

trust

Should be the most carefully planned step in

data warehousing or can lead to:

 – Error duplication

 – Exaggeration of inconsistencies in data

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 15/30

Continuous Loads (cont)

Must be during a fixed batch window

(usually overnight)

Must maximize system resources to load

data efficiently in allotted time

 – Ex.  Red Brick Loader can validate, load, and

index up to 12GB of data per hour on an SMP

system 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 16/30

Additional Aspects of Loader 

Should be able to:

 – Aggregations – build on past data (SUM,MODIFY, APPEND, UPDATE, etc)

 – Filtering – additional cleaning and filtering based on user instructions

 – Integrity – ensure data to be loaded meets

integrity constraints previously established – Index Building – creates indexes associatedwith the data being loaded

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 17/30

Questions to Ask 

Data Source Connectivity: Oracle, Sybase,Informix, mainframe(CICS), Flatfiles.

Functionality: pre-built Transformations Metadata: Open Architecture, Reporting

Capability, Extensibility

Performance: Engine Driven, Code Generator,

Bulk Loading, "Data never touchesthe ground", Multi-threaded processes.

Administration: Versioning, Debugging, Auditing

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 18/30

More Questions to Ask 

Backup and Disaster Recovery: Restart logic,Error detection

Modeling Tool Connectivity: Erwin,

Powerdesigner  Ease of Use: GUI Interface, Intuitive design,

integrated toolset

Programming Language Supported: VB, C, C++,

COBOL Support: 24x7, Devoted Staff levels

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 19/30

ETL Vendors

Ascential

SAS

 NCR Teradata IBM

Oracle

ValityFirstlogic

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 20/30

ETL Tool Set

Purchase or Grow Your Own

100’s of Vendors-www.dwinfocenter.org/clean.html

Pricing Varies Widely

Trend – Included as part of other initiatives

 – CRMs• NCR’s Teradata 

 – Data Warehouses• Oracle, Red Brick, DB2, Prism, Sybase, Teradata,

Informix, Microsoft SQL Server 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 21/30

Pricing Trends

Costs

 – FireSpout, ETL Engine

• Start at $150K 

 – MetaRecon Enterprise

• Server Package $250K, Client Package $50K 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 22/30

Pricing Trends

IBM, DB2, 7 –  ASPs and other partners – pay with a percentage of 

revenue received from customers once solution isrunning per subscriber or per transaction basis.

 –  Still offer per-user base pricing model. Majority of database purchases are sold with an accompanyingapplication and will still be done this way.

Formation 1.4

 –  Informix databases, Red Brick Warehouse, Oracle8Server, Microsoft SQL Server databases.

 –  $7500 per processor for the Formation Flow Engine

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 23/30

 No use of ETL Tools

Start Immediately

Any logic set can be programmed

Disadvantages –  

 – Many programs to build

 – Transformation logic is complex

 – Lengthy program build process

 –  No automatic metadata generation – Maintenance – constant changes

 – Infrastructure is very expensive

www.nyoug.org/dwetl_ny.pdf 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 24/30

Use ETL Tools

Enables rapid application development

(RAD)

Allows easy maintenance

Generates metadata automatically

Reduces development costs

Disadvantages –   – Learning curve

 – Some limits to logic capabilities

www.nyoug.org/dwetl_ny.pdf 

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 25/30

ETL for Spatial Data Warehousing

What is spatial data warehousing

Spatial Data Warehousing is theaggregation of discrete spatial databases

together in a single repository, along withassociated value-added tabular datasets.

Often come from disparate data sources,

e.g. roads from the Department of Transportation, rivers and lakes from theDepartment of Natural Resources, etc.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 26/30

Spatial Data Types

Functionality, there are two principle spatial

data types

 – Vector  – Geometric data such as points, lines,

and polygons. Examples would be roads,contour lines, schools, etc.

 – Raster  – Continuous or image data. Examples

would be aerial photography.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 27/30

Demonstration

Georgia 2000 Information System

The Georgia 2000 Information System aggregates

spatial and tabular data from a wide variety of 

sources. Foundation of the Georgia 2000 is the map data.

For example, political boundaries, roads, water 

features, facilities, locations, etc. Tabular data is

value-added to the map data with information such

as spending patterns per county, etc.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 28/30

Problems unique to spatial data

warehousingCoordinate System (Projection)

Geometric Errors – Misalignment of geometric features

Geometric Errors – Distortions of  photography due to camera angle, heightdisplacement, etc.

Topological Errors – Little pieces of unidentified areas called silvers. Canaccount in total for large areas.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 29/30

ETL for spatial data warehousing involvessystematic corrections of geometric, topological or coordinate system problems.

Another type of spatial data can be produced from

a process called “geocoding” in which points arelocated along a network (for example, a streetnetwork)

The quality of the underlying tabular data used as

input affects quality of geocoding. Correcting thistabular data for good results from geocodingrequires same types of ETL as does traditionaldata warehousing.

7/28/2019 ETL Review

http://slidepdf.com/reader/full/etl-review 30/30

Demonstrations