PDI-Labguide ETL Using Pentaho Data Integration
-
Upload
scherukuri2707 -
Category
Documents
-
view
98 -
download
12
description
Transcript of PDI-Labguide ETL Using Pentaho Data Integration
Infosys Technologies Limited
Version No: 3.0 i
Lab Guide
For Pentaho Data Integration 4.0.1
(also known as Kettle)
Table of Contents
Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE ............................................................... 3
Assignment 1: The Kettle Repository ..................................................................................................... 3
Assignment 2: My first Data transfer using Kettle ................................................................................ 6
Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations .... 15
Assignment 4: Creating an ODBC data source ..................................................................................... 26
Assignment 5: Using the ‘Database Lookup’ transformation............................................................ 29
Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE
Learning Objective: To download and install Pentaho Data Integration 4.0.1, and open the PDI
interface.
Step 1: Install Java Runtime Environment (version 1.4 or higher) in your system.
Step 2: Go to http://www.pentaho.com site and download Pentaho Data Integration 4.0.1.
Step 3: Unzip the downloaded PDI zip file. Open the ‘data-integration’ folder, and double click on the
spoon.bat file to open the PDI IDE.
Assignment 1: The Kettle Repository
Learning Objective: To learn the concept of a repository in PDI (Kettle) and learn how to create,
connect or disconnect from a repository.
Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This
workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the
repository, all information about transformations, jobs, schedules, etc. is stored. The repository concept
promotes re-usability, which in turn saves time and effort.
A repository may be created in two ways:
1) Kettle database repository
2) Kettle file repository
When kettle is started, the ‘Repository Connection’ dialog box appears, asking you to select arepository
from the list of existing repositories, or create a new one.
To create a file repository:
Step 1: In ‘Repository Connection’ dialog box click on + [ ] button. The ‘Select the repository type’
dialog box will appear.
Step 2: Select ‘kettle file repository’ and click ok.
Step 3: In ‘File repository settings’ dialog box, click on Browse button, select a folder that shall
exclusively be your file repository space; fill ID and Name and click on ‘OK’ button. Click on the
‘Repository connection’- ‘OK’ button to select the newly-created repository.
You are now ready to create transformations and jobs on this workspace.
To disconnect from the current working repository, go to Tools menu:
Tools -> Repository -> Disconnect repository
…or alternatively, press Ctrl+D.
NOTE: In the course of working with Kettle, if you want to change your repository or create a new one,
then you can do so by first disconnecting from the current working repository. Then, open the
‘Repository Connection’ dialog box from:
Tools -> Repository -> Connect
…or alternatively, press Ctrl+R. The ‘Repository Connection’ dialog box appears.
Assignment 2: My first Data transfer using Kettle
Learning Objective: To create a simple transformation that involves data transfer from a flat file to an
Access database destination.
Step 1: In the Kettle IDE file menu, open File -> New -> Transformation, or alternatively, press Ctrl+N.
Step 2: To save your transformation file with a name of your choice, press Ctrl+S. The ‘Transformation
properties’ dialog box opens up. Give the transformation a name of your choice, and then click on ‘OK’.
Step 3: On the ‘Design’ pane on the left of the IDE, expand the ‘Input’ group. Drag and drop the ‘Text file
input’ on the transformation design surface.
Step 4: Double-click on the ‘Text file input’. The text file input properties dialog box opens up. Click on
‘Browse’ to select the flatfile to be used as an input.
Select the ‘Products.txt’ flat file that will be used as input for the transformation. After clicking on
‘Open’, click on the button ‘Add’ to add the file to the list of selected files.
Step 5: Go to the ‘Content’ tab. Since this is a ‘Comma separated values (CSV)’ flat file, specify the
separator as comma (,).
Step 6: Open the fields tab click on Get fields, enter 0 to see the scan results of flat file and click on close
button.
You can also see the text file contains by click on preview rows button.
Step 7: Once done, click on ‘OK’ to complete the process of defining a flat file input.
Step 8: Expand the ‘Output’ group on the design pane, and drag and drop ‘Access output’ on the
transformation surface.
To determine data flow sequence from one transformation item to another, a ‘Hop’ is used.
To create the hop: a) Click on the Text file input, then press the <SHIFT> key and draw a line to the Access Output.
OR b) Place the mouse pointer on Text file input until the hover menu appears and then drag the
hop Output connector to Access output.
OR c) Place mouse pointer on the Text file input, press the middle button of the mouse then drag
the hop pointer and release on Access Output.
Step 9: Double-click on the ‘Access output’ to open its properties dialog box. Since the access database
does not currently exist, enter the file name along with the full path in ‘The database filename’ field.
Also enter the name of the target table in the ‘Target table’ field. Keep the checkboxes of the ‘Create
database’ and ‘Create table’ options selected, so that the database and the table will be created
respectively if they do not exist already.
After this is done, click on ‘OK’.
Step 10: To run the transformation, click on the green-coloured triangular button.
The ‘Execute a transformation’ dialog box opens up. Click on ‘Launch’ to execute the transformation.
The ‘Execution Results’ pane appears.
In the ‘Step Metrics tab, the column ‘Active’ shows ‘Finished’ if the transformation was executed
successfully.
Open the ‘Northwind’ access database file. You will see that the data has been successfully populated in
the ‘Products’ table.
Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations
Learning Objective: To learn how to use the ‘Calculator’ to calculate a new column using existing
column values, and select specific fields to be populated in the destination using the ‘Select Values’
transformation.
Requirements:
i. The columns from the ‘employee’ excel sheet that are required to be sent to an Excel worksheet
are: EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, HireDate, City, Country,
HomePhone, Extension and ReportsTo.
ii. In the ‘Employee’ table, the ‘Firstname’ and ‘Lastname’ columns should be stored as a single
column in the destination.
Step 1: Create a new transformation called ‘Employee’. Drag and drop ‘Excel input’ on the
transformation surface. Double-click the ‘Excel’ input to open its properties dialog box. Click on
‘Browse’.
Select the excel workbook that contains the source data for the ‘Employee’ table, and click on the ‘Add’
button to add it to the list of selected files.
Step 2: Go to the ‘Sheets’ tab, and click on ‘Get sheetnames’ to get the list of the names of the sheets
that you wish to include in the data flow. A dialog appears, that asks you to select the sheets you want.
Select the sheet named ‘employee’ and click on the ‘>’ button to include it in the list of selected sheets.
Then click on ‘OK’.
Step 3: Next, go to the ‘Fields’ tab and click on ‘Get fields from header row’ button to get a list of the
field names from the first row of the excel sheet ‘employee’.
Click on ‘Preview rows’ and enter the number of rows that you would like to preview (this facility is for
the developer to ensure that the connection will successfully be able to fetch the data from the excel
sheet correctly).
Step 4: Click on ‘OK’ to complete the task of defining a connection to the excel sheet data source.
Step 5: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Add constants’
transformation on the transformation surface. Double-click on it to open its properties dialog box.
Name the new field as ‘space’, specify data-type as ‘String’ and length as 1. The value should be given as
a space.
After this is done, click on ‘OK’. The ‘Add constants’ will now add a new field called ‘space’ in the data
flow.
Step 6: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Calculator’
transformation on the transformation surface. Create a hop from ‘Add constants’ to ‘Calculator’.
Step 7: Double-click on the ‘Calculator’ to open its properties dialog box.
i. Specify the new field name as ‘FullName’.
ii. Select the calculation type as ‘A+B+C’.
iii. Specify ‘Field A’ as ‘FirstName’, ‘Field B’ as ‘space’, ‘Field C’ as ‘LastName’, ‘Value type’ as ‘String’
and ‘Length’ as 70. Click on ‘OK’.
Step 8: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Select values’
transformation on the transformation surface. Create a hop from ‘Calculator’ to ‘Select values’.
[NOTE: The ‘Select values’ transformation is used for the purpose of specifically removing the columns
that are not required further in the data flow. The existing columns that are required may also be re-
named to any other name and cast to another data type, if needed.]
Step 9: Double-click on the ‘Select values’ transformation to open its properties dialog box. Click on the
‘Get fields to select’ button the fetch the fields that are presently in the data flow.
Step 10: Go to the ‘Remove’ tab. This is where the columns that have to be excluded from the data flow
are specified.
Under the ‘Fieldname’ column, click on the drop-down. It will show a list of the available fields in the
data flow. Click on the name of the column you wish to exclude. For example, click on ‘Address’, since it
is not required further in the data flow.
Do the same for all other fields that have to be excluded.
Step 11: Under the ‘Metadata’ tab, click on the ‘Get fields to change’ button. Remove the fields that are
not required in the data flow. Specify the alternative name, data-type, length, precision, etc. for each of
the input fields (if required).
Once done, click on ‘OK’.
Step 12: From the ‘Output’ group in the Design pane of Kettle, drag and drop ‘Excel output’ on the
transformation surface.
Create a hop from ‘Select values’ to ‘Excel output’. Double-click on ‘Excel output’ to open its properties
dialog box.
Click on the ‘Browse’ button.
Step 13: Select the folder where you want to save the excel destination workbook. Specify the name of
the file, and click on ‘Save’.
Step 14: In the ‘Content’ tab, specify the sheet name as ‘Employee’.
Step 15: In the ‘Fields’ tab, click on the ‘Get Fields’ button to fetch the fields that have to be included in
the ‘Employee’ worksheet. Specify ‘#’ as format for integer fields. Once done, click on ‘OK’.
Step 16: Your transformation is now complete and ready to be executed. Run the transformation by
clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.
After execution, the destination Excel sheet looks like this:
Assignment 4: Creating an ODBC data source
Step 1: Click on Start->Control Panel->Administrative Tools->Data Sources (ODBC), then in ODBC Data
Source Administrator dialog box select User DSN tab. Click on ‘Add’.
Step 2: Select ‘Microsoft Access driver (*.mdb, *.accdb) and click on ‘Finish’.
Step 3: Specify data source name, description and then click on ‘Select’ to select the access database to
be used.
Step 4: Select ‘Northwind.accdb’ from its location and click on ‘OK’.
Step 5: Click on ‘OK’ again.
Step 6: Click on ‘OK’ again.
The ODBC data source has now been created.
Assignment 5: Using the ‘Database Lookup’ transformation
Learning Objective: To learn how to lookup values from an referenced table using key-value pairs, and
include the value field(s) into the data flow.
Requirements:
i. The ‘OrderDetails’ sheet from the excel workbook ‘Northwind’ contains product-wise data
about orders. Replace the ‘ProductID’ field by the ‘ProductName’ and populate the data into
the Northwind.accdb Access database, into a table named ‘OrderDetails’.
Step 1: Create a new transformation file, and save it as ‘OrderDetails’.
Step 2: Drag and drop an ‘Excel input’ on the transformation surface. Edit the properties of the Excel
input.
i. Select the data source as ‘Northwind.xls’.
ii. Select the source sheet as ‘orderdetails’.
iii. Click on ‘Get fields from header row’ to fetch the fields for the data flow. Click on ‘OK’, once
done.
Step 3: Drag and drop ‘Database lookup’ on the transformation surface. Create a hop from ‘Excel input’
to the ‘Database lookup’.
Step 4: Double-click on ‘Database lookup’ to open its properties dialog box. For creating a new
connection to the Access database table ‘Products’ that belongs to the ‘Northwind.accdb’ database,
click on ‘New’.
Step 5: Give the connection a name. Select connection type as ‘MS Access’. Specify the name of the
ODBC connection to the Northwind.accdb database. Click on ‘Test’ to test the connection.
If connection is successful, the following message is displayed:
Click on ‘OK’.
Step 6: Click on ‘Browse’ to select the lookup table.
Step 7: Select the ‘Products’ table as the table to be looked up for value fields.
Step 8: To equate the key values between the source table and the lookup table, specify ‘Table field’ as
‘ProductID’, comparator as ‘=’ and ‘Field1’ as ‘ProductID’. Select the ‘Values to return from the lookup
table’ as ‘ProductName’.
Step 9:
i. Drag and drop ‘Select Values’ on the transformation surface. Create a hop from ‘Database
lookup’ to ‘Select values’.
ii. In the ‘Remove’ tab, select the field ‘ProductID’ to be removed.
iii. In the ‘Metadata’ tab, specify the data types of the fields that are included in the data flow.
Step 10: Drag and drop ‘Access output’ on the transformation surface. Create a hop from ‘Select values’
to the ‘Access output’.
i. Specify the database as the existing ‘Northwind.accdb’ database.
ii. Give the table name as ‘OrderDetails’.
iii. Click on ‘OK’.
Step 10: Your transformation is now complete and ready to be executed. Run the transformation by
clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.
After execution, the destination table looks like this:
<EOF>