Pentaho etl-tool

19
Kettle – ETL Tool Sreenivas K

Transcript of Pentaho etl-tool

Page 1: Pentaho etl-tool

Kettle – ETL Tool

Sreenivas K

Page 2: Pentaho etl-tool

Agenda Introduction

ETL Process Pentaho's Kettle

Data Integration Challenges Prerequisites and Recent Releases Pentaho DI Components Spoon

Transformations Jobs

Page 3: Pentaho etl-tool

Introduction – ETL Process

Major Components Extracting

Gathering raw data from source systems and storing it in ETL staging environment

Data Profiling Identifying data that changed since last load.

Transforming- Cleaning and Conforming

Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions

Data cleansing Recording error events Audit dimensions Creating and maintaining conformed dimensions and facts

Page 4: Pentaho etl-tool

Introduction – ETL Process

Loading Loading data into data warehouse tables Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken,

small static, and user-maintained dimensions Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes

Page 5: Pentaho etl-tool

Data Transformation and Integration Examples

Data filtering Is not null, greater than, less than, includes

Field manipulation Trimming, padding, upper and lowercase conversion

Data calculations + - X / , average, absolute value, arctangent, natural logarithm

Date manipulation First day of month, Last day of month, add months, week of year, day of year

Data type conversion String to number, number to string, date to number

Merging fields & splitting fields Looking up date

Look up in a database, in a text file, an excel sheet, …

Page 6: Pentaho etl-tool

Introduction – Pentaho Kettle

Kettle – Kettle Extraction Transformation Transportation & Loading tool

Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.

Products of Pentaho Mondrain – OLAP server written in Java Kettle – ETL tool

Page 7: Pentaho etl-tool

Data Integration - Challenges

Data is everywhere Data is inconsistent

Records are different in each system

Performance issues Running queries to summarize data for stipulated

long period takes operating system for task

Data is never all in Data Warehouse Excel sheet, acquisition, new application

Page 8: Pentaho etl-tool

Prerequisites Recent Releases

Java Runtime Environment 1.5 and above

Compatible with almost any platform

Compatible with wide range of Databases technologies.

4/25 Data Integration 3.0.3 GA

4/18 Data Integration 3.1 Milestone 2/8 Data Integration 3.0.2 GA

12/12 Data Integration 3.0.1 GA

11/15 Data Integration 3.0 GA

10/31 Data Integration 3.0 RC2

10/24 Data Integration 2.5.2 GA

10/08 Data Integration 3.0 RC1

08/24 Data Integration 2.5.1 GA

Page 9: Pentaho etl-tool

Pentaho Components

Spoon GUI that allows you to design transformations and jobs that can

be run with the Kettle tools — Pan and Kitchen

Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository.

Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.

Pan A program to execute transformations designed by Spoon in XML or

database repository.

Transformations are scheduled in batch mode to be run automatically at regular intervals

Kitchen Execute jobs designed by Spoon in XML or database repository

Page 10: Pentaho etl-tool

Repository Connection establishment Auto login

By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.

Login By default PDI provides login username and

password ad admin.

Page 11: Pentaho etl-tool
Page 12: Pentaho etl-tool
Page 13: Pentaho etl-tool
Page 14: Pentaho etl-tool

Transformation Value: Values are part of a row

and can contain any type of data Row: a row exists of 0 or more

values  Output stream: an output

stream is a stack of rows that leaves a step. 

Input stream: an input stream is a stack of rows that enters a step. 

Hop: A hop is a graphical representation of one or more data streams between 2 steps.

Note: A note is a piece of information that can be added to a transformation

Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.

Page 15: Pentaho etl-tool

Jobs Job Entry: A job entry is

one part of a job and performs a certain

Hop: A hop is a graphical representation of one or more data streams between 2 steps

Note: a note is a piece of information that can be added to a job

A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.

Page 16: Pentaho etl-tool

Input StepsOutput Steps

Lookup StepsTransformation Steps

Join StepsDW Steps

Mapping Steps

Job Steps

Page 17: Pentaho etl-tool
Page 18: Pentaho etl-tool
Page 19: Pentaho etl-tool