ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation •...
Transcript of ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation •...
11/27/2010
1
Datawarehousing
ETL ( Extract Transform Load )
Acknowledgement
Data warehousing (Fall’ 2010), Saleha Raza 2
• Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals
By: Paulraj Ponniah
• The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition
By: Ralph Kimball, Margy Ross
11/27/2010
2
ETL
• Extract
• Transform
• Load
• It is not uncommon for a project team to spend 50 – 70 % of the project on ETL task.
Data ExtractionDifficulties in Source System
11/27/2010
3
Major Steps in ETL Process
Data Extraction
11/27/2010
4
Data Extraction
Current vs Periodic attributes in operational system
11/27/2010
5
Data Extraction
• Immediate Data Extraction– Through transaction logs
– Through database triggers
– Capture in source system
• Deferred Data Extraction– Capture based on datetime stamp
• What if a source record gets deleted?
– Capture by comparing files (also called snapshot differntial)
Immediate Data Extraction
11/27/2010
6
Deferred Data Extraction
Data Extraction techniques - Summary
11/27/2010
7
Data Transformation• Format Revision
type/length conversions, datetime formatting etc.• Decoding of fields
• Cryptic fields, Boolean values
• Calculated and Derived values• Splitting of single field
• E.g. Address, FullName etc
• Merging of information• Different attributes coming from difference sources
• Character Set conversion• EBCIDIC, ASCII, UNICODE etc
• Conversion of units of measurement• Amounts in different currencies across different global branches, qty in different units
• Data time conversions• Different date formats (American/British data formats)
• Summarization• Generation of summary tables
• Key restructuring• Generation of surrogate keys to avoid business keys
• Deduplication• Resolution among different records coming from different sources pointing to the same object
Data Integration & Consolidation
• Entity Identification problem
• Multiple sources problem
• Transformation of dimension attributes• Incorporating dimension changes (Type 1/Type 2/Type
3 changes)
11/27/2010
8
Data Loading
Data Loading Techniques
• Load
• Append
• Destructive Merge
• Constructive Merge
Before loading data in datawarehouse, indexes are usually dropped from tables and are recreated after loading.
11/27/2010
9
Data Loading Techniques
Loading changes in dimension tables
11/27/2010
10
Data Quality
Three critical aspects of data in data warehouse are: quality, quality, and quality.
Data Quality
• Data quality implies that data is fit for the purpose for which it is intended.
• Data quality vs Data accuracy
11/27/2010
11
Some explicit data quality problems
• Dummy values in fields e.g. 11111111111 in zip code , spaces in mandatory fields
• Absence of data valuesData is not important for operational system and hence is not mandatory but is crucial in analysis.
• Unofficial use of fieldse.g. phone no/fax in address line 3, Customer comments in Contact field, Product features in handling instructions
• Cryptic fieldsCryptic code/ Magic numbers
• Contradicting valuesHome address vs Home phone, State vs Zip code, DOB cs Age
• Violation of business rulesSell price > Cost price, Profit percent between 1 and 100, Probability between 0 and 1, Qty Produced = Qty Accepted + Qty Rejected
• Reused primary keys
Some explicit data quality problems
• Non-unique identifiersEntity Identification problem, Product code – 366 points to different records in inventory system and POS system
• Inconsistent valuesStudent vs Faculty record for students who teach as well.
• Incorrect values@. in email
• Multipurpose fields• FacultyID / Student ID in LoginID
• Erroneous IntegrationAuction Example
11/27/2010
12
Data Quality
Incorrect codes, states , status etc
int value stored in string format, datetime in string
Null values, empty strings not allowed in DW
Data QualityFrom date < To dateSell price > Cost priceLoan balance >= 0
Logical parts of attributes
Address line 3 for phone/email,
Res phone, Home phone, Cell
e.g. PK must not be null,FK must be properly referenced
11/27/2010
13
Take a break!
Sources of Data Pollution
• System Conversions• Data aging• Heterogeneous System integration• Poor database design• Incomplete information at data entry• Input errors• Internationalization/localization• Fraud • Lack of policies
11/27/2010
14
Validation of Names and Addresses