IDE 5.0_Basics_20061025

Post on 23-Oct-2014

131 views 4 download

Tags:

Transcript of IDE 5.0_Basics_20061025

1

Using Informatica Data Explorer 5

Informatica Corporation, 2005-2006. All rights reserved.

Education Services

Version IDE-25102006

2

Agenda

• Overview of Informatica Data Explorer

• Importing Metadata and Accessing Source Data

• Column Profiling

• Data Rules

• Single Table Structural Analysis

• Cross Table Profiling

• Validating Table and Cross Table Analysis

• Normalization

• Repository

• Using the Repository Navigator

• Using Repository Reports

• Integration with PowerCenter

3

Introduction

4

Introduction Objectives

• Identify the components of the Informatica Data Explorer product suite

• Describe the Informatica Data Explorer process flow

Informatica Data Explorer Product Suite

IDE Source Profiler

IDEClient

Windows XP, 2000

IDEServer

Unix or Windows

UDB

Informix

Sybase

Oracle

COBOLPrograms

IDEFTM / XML

DDL,XML

& DTDs

IDE Repository

FlatFiles

IDEImport

Flat File

DDL

IDEProject

RepositoryNavigator

IDE ImportIMS

IDEImportVSAM

VSAM

IMS

Sequential

PDS

OS/390DB2

Unload

Command Files &

JCL

IDE Repository

PortsMSSQL

IDE Import

Relational

Via ODBC

6

FixedTargetMapping

RepositoryNavigator

Ports:InformaticaCWM

Sources

RDBMSODBC

FlatFile

* MainFrame

Connectors

Data Table n

Data Table 2

Data Table 1

Co

nten

t Pro

filin

g

Cross

Tab

le P

rofili

ng

Single Table profiling

IDE Profiling IDE Design

Consolidated Schema

Source DataKnowledge

Base

StructureContentQuality

Product Architecture

* The IMS and VSAM importers actually use a GUI (Source Profiler) to read a Copybook and generate a program to extract mainframe data into Flat Files for use by IDE

7

IDE Server Platform• Windows (2000/2003/XP)• Sun Solaris (7,8,9)• HP-UX (11 or later)•AIX (4.3, 5L)

IDE Server

Repository DBMS and

Server Platform

ODBC DriverConnectivity

Client

Workstation• Windows 2000• Windows XP

IDE Client

Project

Files

Workstation• Windows 2000/XP

Repository

Navigator

Workstation• Windows 2000/XP

FTM/XML

Data / Header

Files

•IBM DB2 UDB-7.2,8.1 •Informix 7.31,9.2,9.3 Informix 9•Microsoft SQL Server 7 and 2000)•Oracle 8i, 9i•Sybase 12 and 12.5

Relational Importers for:•IBM DB2 UDB-5.2, 6.1,7.1,7.2,8.1•Informix-7.24,7.31,9.1,9.2,9.3•Oracle -7.3,8,8i,9i•Sybase 10,11,12•ODBC(SQL Server, etc.)

Performs Actual Profiling

Profiling Results initial store

RepositoryCompleted Profiling

ResultsRepository does not

need to be on the same server as IDE

TCPIPConnectivity

ODBC/JDBCConnectivity

ODBC Driver (API 3.x conformance level 2)

Ports

XML format files

Technical Diagram

Flat File Importer for:•Fixed Length•Delimited•DB2 Unload

IDE Process Flow

IDE Data Profiling

IDE Data Prep / Import

Products

IDE Schema Development

• Data Extraction • Cleansing• Transformation

Specifications

FTM / XML Metadata Mapping

DB Load

Messaging

Target DB

Target DesignOR

Message

IDE Repository and NavigatorMetadata Management

OR

OR

Data• Relational• Flat Files• VSAM• IMS• ODBC

Documented Metadata

9

Introduction Review

• The Informatica Data Explorer product line: • Informatica Data Explorer

• Importer for Flat Files• DDL Generators• Source Profiler• Repository Navigator• Repository

• Import for Relational Databases• Import for VSAM• Import for IMS• FTM

10

Introduction Review (cont.)

• The IDE Process Flow consists of five major processes:• Data Preparation and Import

• Data Profiling

• Schema Validation and Development

• Metadata Development

• Metadata Management

11

Lesson 1

Importing Metadata and Accessing Source Data

12

Lesson 1 Objectives

• Explain what an Informatica Data Explorer Project is, and how it is used

• Create and setup Informatica Data Explorer Projects

• Define the term “metadata” as used by Informatica Software

• Explain the importance of metadata in Data Profiling using Informatica Data Explorer

13

Lesson 1 Objectives (cont.)

• Explain what source data are, and the ways in which they may be imported into Informatica Data Explorer

• Explain what the Informatica Data Explorer Flat File Importer does

• Describe the format of an Informatica Data Explorer Flat File, including the minimum requirements for Informatica Data Explorer to use it to access source data

14

Case Study Description

• The Customer Order system is a mainframe application accessed through a CICS user interface

• It was developed 10 years ago

• The Employee Identification system is an Oracle database created 2 years ago

• Business users are sure they know the data

• Senior executives suspect the quality of the data is bad

IDE Source Profiler

IDEClient

Windows 98, NT or 2000

IDEServer

Unix or NT

UDB

Informix

Sybase

Oracle

COBOLPrograms

IDEFTM / XML

DDL, IDE XML& DTDs

IDE Repository

FlatFiles

IDEImport

Flat File

DDL

IDEProject

RepositoryNavigator

IDE ImportIMS

IDE ImportVSAM

VSAM

IMS

Sequential

PDS

OS/390DB2

Unload

Command Files &

JCL

IDE Repository

PortsMSSQL

IDE ImportRelational

Via ODBC

Informatica Data Explorer Project

16

Informatica Data Explorer Projects

• The persistent data store used by IDE

• A Project is a UNIX or Windows NT container (directory, folder etc.)

• Projects contain:• Metadata

• Data

• Profiling as well as Mapping information

• Projects are opened and closed by the IDE Server

17

What is Metadata?

• Informatica Data Explorer defines metadata as:• Data that describes data

• Information about the characteristics of source data

• In Informatica Data Explorer, metadata is information that will create: • Schemas

• Tables

• Columns

• Other objects

18

Why Import Metadata?

• Must be imported into an Informatica Data Explorer Project before any subsequent tasks or activities can be started

• Informatica Data Explorer needs? to know the names of the Columns in order to store Data Profiling results

• Informatica Data Explorer needs to know how to interpret the source data (Fixed vs. Delimited)

• Provides basis for automated quality assessments in data profiling

IDE Data Sources

IDE Source Profiler

IDEClient

Windows 98, NT or 2000

IDEServer

Unix or NT

UDB

Informix

Sybase

Oracle

COBOLPrograms

IDEFTM / XML

DDL, IDE XML& DTDs

IDE Repository

FlatFiles

IDEImport

Flat File

DDL

IDEProject

RepositoryNavigator

IDE ImportIMS

IDE ImportVSAM

OS/390

Command Files &

JCL

IDE Repository

PortsMSSQL

IDE ImportRelational

Via ODBC

VSAM

IMS

Sequential

PDS

DB2Unload

20

IDE Flat Files

• Consist of two components

• Header File• Contains metadata describing contents of a data file

• Data file• Data in either delimited or fixed column format as well as

DB2 Load format

21

IDE Flat Files (cont.)

• Header and Data files may be • Separate files or

• Combined into one file

• A header file should not contain duplicate column names (IDE will automatically re-name them)

• IDE Flat Files may not contain Arrays (repeating groups or occurs)

Informatica Data Explorer Flat File Components

header:file=empinfo.dat

attribute:EMPIDdata_type=INTEGERnull_rule=NOT NULLmin_value=1000max_value=9999

attribute:LAST_NAMEdata_type=CHAR(20)null_rule=NOT NULL

attribute:FIRST_NAMEdata_type=CHAR(20)null_rule=NOT NULL

attribute:GENDERdata_type=CHAR(1)null_rule=NOT NULL

attribute:DEPTIDdata_type=CHAR(4)null_rule=NOT NULLmin_value=100

149,Francis,Lynn,3,200,MIS,Database Administrator,"120 Co

249,Venkatachalam,Nagarajan,3,200,MIS,Project Leader,"300289,Kim,Suk,3,200,MIS,Staff Consultant,"4040 N Fairfax Dr216,Masood,Airaj,,200,MIS,MIS Analyst,"300 N Wakefield Dr134,Swenson,Allison,F,200,MIS,Database Administrator,"900 164,Park,Allison,F,200,MIS,Database Analyst,"PO BOX 1471"323,Blaskiewicz,Allison,F,200,MIS,Technical Specialist,"3255,Barbles,Amy,F,100,Sales,Sales Executive,"4019 Rice Bl273,Karneh,Anna,1,200,MIS,Sr Prog Analyst,"12601 Fair Lak

Header File Data File

The data file example shows the associated comma delimited file to which this header file refers

The header file example shows some of the documented information that can be loaded into Informatica Data Explorer

23

Header and Data Files

• Header and data can be in one file

• We recommend that two files be made if created manually

• The more information that is provided in the header file, the more automatic comparisons Informatica Data Explorer can make

24

Login

25

Open Project

26

Import Metadata

27

Lab Exercises 1.1–1.6

28

Lesson 1 Review

• An Informatica Data Explorer Project is:• The persistent data store used by Informatica Data

Explorer

• Used to organize and partition the work effort

• Metadata describes the data source and is used by Informatica Data Explorer to access the source data

• A structure that contains: • Metadata

• Data

• Profiling and Mapping information

29

Lesson 1 Review (cont.)

• Informatica Data Explorer can import data from:• Relational Databases

• Oracle 7.3, 8, 8i, 9i or 10g

• Informix 7, 9.1, or 9.2

• Sybase 10, 11, 12 or 12.5

• IBM DB2 UDB 5.2, 6.1 7.1 or 7.2

• Microsoft SQL Server 7 and 2000 (using an ODBC driver)

• Flat Files• Delimited Format

• Fixed Length Format

• DB2 Load Format

30

Lesson 1 Review (cont.)

• Informatica Data Explorer Flat Files must be:• ASCII or EBCDIC character format (no binary data)

• Binary data is supported via the DB2 Load Utility format

• Informatica Data Explorer Flat File may not contain:• Arrays (repeating groups or occurs)

• Duplicate column names

31

Lesson 1 Review (cont.)

• Informatica Data Explorer Flat Files must have a header file along with the data file

• Additional information on data preparation is available in the Using Informatica Data Explorer Source Profiler course and the documentation

32

Lesson 2

Column Profiling

33

Lesson 2 Objectives

• Explain what Column Profiling is, and why it should be performed.

• Execute the Column Profile function of Informatica Data Explorer.

• Navigate and review the results of Column Profiling.

• Explain informational Tags.

• Describe when and how to apply informational Tags to Informatica Data Explorer objects.

34

What is Column Profiling?

• A process of discovering physical characteristics of each column in a file

• Comparing documented Metadata against Metadata inferred from the data source

• Column Profiling is done against data in the form of • ASCII flat files

• DB2 Load Utility files

• RDBMS tables

35

Why Profile Columns?

• Not all database metadata and documentation are accurate pictures of the data source

• Documented descriptions of data elements may be inconsistent with the way the element is actually used

• Informatica Data Explorer Column Profiling builds a description of a column (its metadata) based on the data it contains

36

Column Lists

• The results of Column Profiling are stored with the Columns in a Table

• Column List viewers can be opened from the Navigation Tree

• Column List viewers provide information about Documented and Inferred Metadata• Documented Metadata are supplied from the header

file or source table

• Inferred Metadata are those that Informatica Data Explorer determined from examining the data.

37

Column Profiling

38

Column Viewer

39

Lab Exercises 2.1–2.4

40

Drill Down

• Allows you to perform ad hoc drill downs through data presented in the Informatica Data Explorer viewers.

• Used to interrogate any data sources that can be accessed via an ODBC connection or Informatica Data Explorer Importer.

• Searches are issued against the selected data, and rows are returned for the specified search.

Drill Downs

41

Column Details

• Lists of properties about a Column that have been inferred by Informatica Data Explorer

• Columns can have several potential sets of characteristics

• The potential sets of characteristics are dependent on the physical view that is chosen

42

Drill Down

43

Drill Down Results

44

Lab Exercises 2.5–2.7

45

Column Value Pairs

• Informatica Data Explorer will store per Column• Up to 16,000 distinct values

• These are the most frequently occurring values from the set of all values that were observed during the Column Profile execution

• The frequency with which each value was observed

• Informatica Data Explorer will calculate• % Distribution for each distinct value based on the frequency

divided by the total rows profiled

46

Value Pair Review

• Issues to evaluate during Column Value Pair analysis:• Are the values/range of values correct?

• Is the data type correct?

• Is there a pattern or format to the data for this Column? Do all of the values match this pattern/format?

• Is there a difference in case for alpha characters? Are some values mixed case are others all upper or lower case? Is this an issue?

• Are there different representations (different abbreviations/misspellings) of the same data?

• Are there duplicate values in a field that should be unique?

47

Sorting Viewers

• It is possible to sort any of the tables displayed in Informatica Data Explorer

• By clicking on the column header, the results will be sorted in ascending order. Double-clicking again will sort the list in descending order

Value Pair Review

48

Sort Order

• Sorting is based on the character codes of the values in the data:• Spaces sort to the top of an ascending sort. When the caret

(^) symbol is displayed, the sort is based on the actual “space” character not the caret (^).

• Special characters (i.e. #, &, ‘)

• Nulls

• Numbers

• Alpha characters

49

Lab Exercises 2.8–2.10

50

Tags

• Informatica Data Explorer Tags come in various forms, depending on the type of information you want to convey:• Notes – general text

• Action Items – things that need to be done

• Rules – business rules defining nature of object

• Transformations – requirements to change the data to fit the object

51

Tags (cont.)

• Think of Tags as high-tech Post-Its™ that you can attach to many types of objects in an Informatica Data Explorer Project

• Note: All of the pull down menu items in Tags can be configured through server configuration files

52

Action Tag

53

Note Tag

54

Rule Tag

55

Lab Exercises 2.11–2.14

56

Content Presentation

• Constant Analysis

• Empty Column Analysis

• Inferred Data Type Analysis

• Null Rule Analysis

• Source Data Type Analysis

• Unique Analysis

• Frequency Analysis

• Pattern Analysis

• Domain Analysis

57

Content Presentations

58

Content Presentation (Continued)

59

Constant Analysis

60

Lesson 2 Review

• Column Profiling is about the analysis of column content and format

• Column Profiling scans data files and stores the resulting profile information in an Informatica Data Explorer Project

• Column Profiling information can be viewed by opening an Column List for a Table

61

Lesson 2 Review

• The results of Column Profiling are stored with an Column

• The results of Column Profiling include:• Primary and Alternate Data Types• Null Rules• Minimum/Maximum Value ranges• Value Pairs• Patterns

• Tags can be added to Columns or Tables to convey additional information or instructions about the Column

62

Lesson 3

Data Rules

63

Data Rules - Objectives

• What is a Data Rule?

• Using Data Rules in Informatica Data Explorer

• How to test for Data Rules

• Execute Data Rules tasks

• When to apply Data Rules in the data discovery process.

64

Define Data Rules

• What is a Business Rule? • Business Rule: describe the main characteristic of the data

• What is a Data Rule?• Data Rule is a constraint written against one or more

Tables that is used to find incorrect data.

• Can be viewed as business rules for data

65

Define Data Rules (cont.)

• Data Rules are often embedded in application programs

• The Informatica Data Explorer Practitioner can discover, document and test Data Rules against the initial source.

66

Using Data Rules in Informatica Data Explorer

• Data Rules is the process of using Informatica Data Explorer to determine if the externally proposed data relationships are fully supported by the source data.

• Discover if the source data supports the relationships and business needs.

• Data Rules are tested against the initial source, stored and then can be re-run after the data has been cleansed or moved.

67

Business Rules and Data Rules

• Employees with 2 or more years of service are paid 3 weeks vacation.

• Fulltime employees are assigned to a salary band.

• Employees in Dept C – salaries cannot be greater than $40,000.

• Department number contained in the employee record must correspond to an existing Department number.

• Does the Column contain a particular string of characters?

68

Business Rules and Data Rules (cont.)

• Does one Column include the full contents of another Column?

• In an address, is there a line of blanks followed by a line of non-blanks?

• Are all three fields of a key null?

• Is the date Column in the wrong format?

• Does the Column contain the right type of data for this type of record?

69

Create and Execute Data Rules

• Data Rules can be created from two locations:• Rules Tag

• Drill down

• Execute Data Rules from the Rules Tag viewer or Data Rules Management.

70

Drill Down

71

New Rule Tag

72

Lab Exercises 3.1–3.6

73

When to Apply Data Rules

• Tightly coupled to Drill Down

• Data Rules can be executed against different sources.

• Data Rules can be applied at any point in time during the data discovery process.

• Data Rules can be saved and re-run• After the data load as occurred or• A feed is supplied or• Data has changed for any reason

74

Complex Data Rules

RULE LoanTypeAmtTerm

SELECT "Loan_ID","Loan_Type","Loan_Amt","Loan_Term"

FROM <Use Table in Data Source>

WHERE (UPPER(LOAN_TYPE) = 'AUTO' and

(LOAN_AMT not between 3000 and 50000 or

LOAN_TERM not between 12 and 60)) or

(UPPER(LOAN_TYPE) = 'REAL' and

(LOAN_AMT not between 10000 and 500000 or

LOAN_TERM not between 36 and 360)) or

LOAN_TYPE is null or LOAN_AMT is null or LOAN_TERM is null

75

Data Rule Management

76

Lesson 3 Review

• Data Rules can be created on Columns that we think are volatile.

• Data Rules can be created, saved and ran on different data sources.

• Data Rules can be created from two locations:• Rules tab• Drill down

• Execute Data Rules from the Rules Tag viewer or Data Rules Management.

77

Lesson 4

Single Table Structural Analysis

78

Lesson 4 Objectives

• Explain what Table Structural Profiling is, and why it should be performed

• Define the term “Functional Dependency” as used by Informatica Data Explorer, and explain the significance

• Contrast a Single-Column Determinant to a Multiple-Column (or compound) Determinant as used by Informatica Data Explorer

79

Lesson 4 Objectives (cont.)

• Define the terms “Inferred Dependencies” and “Model Dependencies” as used by Informatica Data Explorer

• Explain why and when an Inferred Dependency should be added to the set of Model Dependencies

• Define the term “Sample Data” as used by Informatica Data Explorer, and explain the use of Sample Data in Dependency Profiling

• Understand when and how to apply Informational Tags in Dependency Profiling

80

What is Table Structural Profiling?

• A process that discovers the interrelationships between columns in your source data

• Is performed against samples of data that you have imported into Informatica Data Explorer

• It identifies Columns that determine the value of other Columns

81

Why Profile Table Structure?

• Functional Dependencies determine the structure of a data model and/or database design

• Functional Dependencies can be equated to an elementary form of Business Rule

• Dependencies between data items suggest organization of data storage that is both natural and efficient

82

Why Profile Table Structure? (cont.)

• Quickly validate expected Dependencies (Keys)

• If data does not conform to expected or required dependency rules, you most likely have a data integrity problem

83

IDEServer

RDBMS

Flat FilesDB2 LU

What is Sample Data?

• Sample Data is actual data that you import into an Informatica Data Explorer Table either from:• Downloaded flat files or • Directly from a relational database

• Sample Data is a subset of the data in the source database:• Multiple data samples can be loaded into

Informatica Data Explorer• Each data sample is stored in the Project

• Sample Data is associated with a particular Table

84

Why Import Sample Data?

• Sample data is used in Table Structural Profiling to examine relationships of all columns of a given record

Source

Data

Column Profiling(stores results only)

ImportSampleData

Data Sample #1

Table Structural Profiling

(examines entire records)

Data Sample #2

85

A value of EMPNO always determines the same value of ENAME throughout the sample data

EMPNO ENAME

EMPNO

123456789012345789

ENAME

John DoeJane Smith

Eduardo SanchezJane SmithJohn Doe

Eduardo Sanchez

Functional Dependencies

• An Column is functionally dependent on other Columns that determine its value

86

Functional Dependencies (cont.)

• A Functional Dependency is written as:• A B

• ‘A’ is the Determinant Column

• ‘B’ is the Dependent Column

• The statement is ALWAYS read left to right• ‘A functionally determines B’, or

• “If I know a value for A, I can determine the value for B” or

• For each distinct value of ‘A’ there can only be one value of ‘B’

87

Functional Dependencies (cont.)

• The determinant side can be compound:• A + B C

• ‘A’ and ‘B’ together are the Determinant Column

• ‘C’ is the Dependent Column

• The determinant side can be Null:• Ø C

• Nothing is the Determinant Column

• ‘C’ is the Dependent Column

• ‘C’ has only one value, or one value and nulls, in the whole sample

88

Reviewing Inferred Dependencies

• You must review the set of Inferred Dependencies

• The Dependencies inferred by Informatica Data Explorer exist implicitly in the data

• You must make decisions as to which of the Inferred Dependencies explicitly represent the current use of the data

• The review process is to determine• Which dependencies should be included in the set of

dependencies from which the Normalized Schema will be generated

89

Sample Data

90

Exercises 4.1 - 4.4

91

Adding an Inferred Dependency to the Model

• Inferred Dependencies added to the model establish the Tables (tables) that will be created in Normalization• Normalization breaks a single Table (table) into multiple

Tables (tables)

• For example, “Employee” Table in the source system represents two Tables (Employee and Department) once the dependencies are created and the model is normalized

92

Adding an Inferred Dependency to the Model (cont.)

• Columns that do not participate as a Dependant are automatically included in the Primary Key• Informatica Data Explorer considers all Columns as part of

the key until a relationship is established

• Dependency Profiling is an iterative process

93

Exercises 4.5 - 4.6

94

Dependency Subject Area

• Inferred Dependencies• The set of dependencies that are inferred from a sample

of data for a Table

• Table Dependencies• A subset of the Model Dependencies that are wholly

contained in a Table

95

Dependency Subject Area (cont.)

• Model Dependencies• The set of dependencies that you determine fit into your

design and are supported by the data• Model Dependencies are associated at the schema level• Model Dependencies are the set of all dependencies

across all Tables• Model Dependencies are used to create the normalized

schema

96

Dependencies

97

Inferred Dependencies

98

Key Dependencies

99

Model Dependencies

100

Filter Dependencies

101

Add Dependencies to Model or Filter

102

When to Add an Inferred Dependency

• Review each Inferred Dependency and add to model only those that can have a explicit reason for existing• Is the application enforcing the dependency?

• Is the user/business enforcing the dependency?

• Is some outside source enforcing the dependency?

103

Types of Dependencies

• True• The dependency is true for 100% of the data analyzed• Example: Every time a unique value is known for

EMPID, additional information is available (i.e. Employee Name, Address, Phone, etc.)

• Gray• The dependency is almost, but not quite 100% true for

the data analyzed• One row causes the violation

104

Types of Dependencies (cont.)

• Unsupported

• Two or more rows in the sample data do not support the dependency

• Unknown

• The dependency has not yet been validated against the sample data (Basis dependencies validation appear as Unknown)

105

• Questions to Ask:• What caused the dependency to be gray?• Should another sample be imported for verification?

• Review each Inferred Gray Dependency and add to model only those that can have a explicit reason for existing:• Is the application supposed to be enforcing the

dependency?• Is the user/business supposed to be enforcing the

dependency?• Is some outside source supposed to be enforcing the

dependency?

When to Add an Inferred Gray Dependency

106

Lab Exercise 4.7

107

Tagging Dependencies

• You cannot tag an Inferred or Model Dependency

• You add Tags to the Column that is causing the problem

108

Compound Determinants

• Two or more Columns that uniquely identify the Dependent Column

• This often represents a M to 1 relationship in the data

• This happens quite often in older file-based systems

109

Lab Exercise 4.8 – 4.9

110

Lesson 4 Review

• Importing Sample Data stores the data inside an Informatica Data Explorer Project

• Sample Data is used as input to Dependency Profiling

• You must import Sample Data before you can perform the Profile Dependencies task using Informatica Data Explorer• Data samples are imported using the Import Sample

Data feature• Data samples can be retained from doing a Drill Down

or executing a Data Rule

111

Lesson 4 Review (cont.)

• Dependency Profiling finds the relationships between Columns in the same source file or table

• All Inferred Dependencies are associated with sets of Sample Data

• Table Dependencies are dependencies that have been added to the model, and are associated with a specific Table

112

Lesson 4 Review (cont.)

• Model Dependencies are the set of dependencies from all Tables in the schema

• Only Model Dependencies are used as input to the generation of a Normalized Schema

• All Dependencies inferred by Informatica Data Explorer exist implicitly in the data

113

Lesson 4 Review (cont.)

• You will find many Inferred Dependencies that have no meaning in context of the application or business use of the data

• These are Implicit Dependencies that have no explicit meaning

• Dependency Profiling is an iterative process

114

Lesson 5

Cross Table Profiling

115

Lesson 5 Objectives

• Explain what Cross Table Profiling is, and why it should be performed

• Execute the Cross Table Profiling function in Informatica Data Explorer

• Navigate and review the results of Cross Table Profiling

• Define the terms “Synonym” and “Homonym” as used in Informatica Data Explorer

116

Lesson 5 Objectives (cont.)

• Understand what data is used for Cross Table Profiling, and how potential Synonyms are identified

• Describe why and when a Synonym should be created

• Create a Synonym

• Understand the significance of creating Synonyms

117

Cross Table Profiling

• The process that identifies similarity between the values in other columns

• Performed using the value sets associated with the Column objects inside Informatica Data Explorer• These are the Value Frequency

Lists that were created by Column Profiling

118

Why Profile Redundancies?

• To uncover Columns that actually represent the same business facts

• Informatica Data Explorer can uncover two types of redundancies: • Synonyms

• Redundant data that you would like to eliminate through the creation of Synonyms

• Redundant data that is intended to improve database performance

• Homonyms• Data that looks redundant but actually represents quite different

business facts (Homonyms)

119

Comparing Value Sets

ABC

BCDE

Value SetOverlap

Value Set1

ValueOverlap

Value Set2

120

Inferred Redundancies

121

Exercise 5.1 - 5.2

122

• Two or more Columns having the same business meaning

• Comparing common values between columns can identify candidate Synonyms

SP_NOValueSet

EMPIDValueSet

28%overlap

Synonyms

123

Effect of Synonyms

• If the Primary Keys of two Tables are synonyms, they will collapse into a single Table in the Normalized Schema

TransactionID (PK)

ProductID

ProductName

InventorySupplier

TransactionID

TransactionID (PK)

ProductID

PruductName

SupplierName

SupplierAddr

TransactionID (PK)

SupplierName

SupplierAddress

124

Effect of Synonyms (cont.)

• If two Columns that are synonyms represent a parent-child relationship, they will result in two Columns in two Tables with one Column participating in a Primary Key and the other in the corresponding Foreign Key

OrderNumber

ProductID

ProductName

Order Payment

PaymentID

OrderID

CheckNumber

OrderNumber (PK)

ProductID

ProductName

OrderNumber

PaymentID

PaymentID

OrderNumber (FK)

CheckNumber

125

Homonyms Defined

• Two or more Columns having the same name yet different business meanings

70%overlap

SHIPPING_STATEValueSet

STATEValueSet

126

Making Synonyms

127

Synonyms

128

Exercise 5.3 - 5.4

129

Lesson 5 Review

• Cross Table Profiling is about data integration between sets of data

• Cross Table Profiling comprises 2 activities• Comparing value lists

• Use Foreign Key or Join analysis to compare value lists greater than 16,000

• Assigning Synonyms

• Rule of Thumb• Be conservative about making Synonyms

• You can always come back after you’ve normalized the schema and make more

130

Lesson 5 Review

• You can not make intra-table Synonyms, only inter-table

• You must have built Value Lists either during the Profile Columns task, or during the Import Sample Data task, before you can perform Cross Table Profiling

• Creation of Synonyms participates in Normalization

131

Lesson 6

Validating Table and Cross Table Analysis

132

• Understand how Validation differs from Cross Table Profiling

• Define and discuss the term Referential Integrity

• Explain various methods of validation and how it can be use

• Execute Validation tasks

• View Validation results

Lesson 6 Objectives

133

• Validation can be used to:• Define the exact overlap characteristics of two redundant

Columns• Validate a single or multi-Column foreign key• Validate that the keys of two tables do not overlap (Vertical

Merge)• Validate single or multiple Column keys (Validate Keys)• Validate a Join• Validate against reference table• Validate against Domain values

• Execute Validation from the Single Table Structural Analysis and Cross Table Structural Analysis

Validation

134

Referential Integrity

• Example A: An Order File contains an OrderID that uniquely identifies each customer order. There should be no OrderID values in the Order or Detail file that do not exist in the other.

Example A

135

Referential Integrity (cont.)

• Example B: An Order file may have OrderID values that do not exist in the Payment file (outstanding payments or unbilled customers). The Payment file should not have any OrderID values that do not occur in the Order file.

Example B

136

• Validation compares sets of Columns between two relations to discover the quality of the overlap.

• Validation exhaustively tests all the data.

• Cross Table Profiling discovers potential overlap between Columns.

• Cross Table Profiling estimates overlap.

• Results of Validation– sets of statistics about the overlap and non-overlapping values

Validation and Cross Table Profiling

137

• To understand the exact overlap:• Execute Validation from the Cross Table Profiling

• create a relationship (Primary Key / Foreign Key, Join, …) between the two Columns and choose Validate

Profile Redundant Columns

138

Exercise 6.1-6.2

139

• Validate a Single or Multi-Column Foreign Key

• Primary use – test the Referential Integrity of primary and foreign key relationships.

• Each row in a child table must reference a row in the parent table.

• Every order detail record must reference an order.

• Information discovered can be used to help write logic to perform the data integration.

Foreign Key Analysis

140

Foreign Key Analysis Results

141

Parents Without Children

142

Exercise 6.3-6.5

143

• Primary use – when two similar systems are merged together.

• Company A merges with Company B: payroll master records are merged.

• It is expected that all rows in the parent and child tables are orphans.• Employees of Company A are not on Company B’s payroll

master file.

• Employees of Company B are not on Company A’s payroll master file.

Vertical Merge Analysis

144

Vertical Merge

145

Vertical Merge Analysis Results

146

Exercise 6.6-6.8

147

• Primary use – validate keys in a single Table

• Validation looks at the table and checks to make sure that every row is unique.

• Use this feature to find any duplicate rows for keys discovered in Single Table Structural Analysis.

Validate Key Analysis

148

Validate Key

149

Validate Key Analysis Results

150

New Alternate Key

151

Validate Alternate Key

152

Exercise 6.9 - 6.12

153

Lesson 7

Normalization

154

Lesson 7 Objectives

• Explain what Normalization is and when it should be performed

• Execute the Normalization function of Informatica Data Explorer

• Navigate and review the results of Normalization

155

Lesson 7 Objectives (cont.)

• Describe what an Column Trace is, and how it is used

• Understand how to modify the Normalized Schema by making changes to the Source Schema

• Explain the iterative nature of Normalization

156

Normalization

• A process that transforms an initial schema into a schema with greater integrity

• A process of transforming Source Schema into a:• Non-redundant

• Anomaly-free

• Third Normal Form model

• Normalization is based upon:• Dependencies added to the model in Single Table Structural

Analysis and

• Synonyms made in Cross Table Structural Analysis

157

Why Normalize?

• A Third Normal Form (3NF) schema has no: • Redundant Columns other than Foreign Keys

• Columns that are only partially dependant on the key

• Transitive Dependencies

• The Normalized Schema provides a checkpoint for the completeness and accuracy of the decisions you made during the profiling tasks

158

Exercise 7.1

Normalized Schema v. Source Schema

custord

ORDER_NO: char(4)ITEM_NO: char(6)

ORDER_DATE: datetimeSHIPDT: datetimePO_NUM: smallintLAST_NAME: varchar(10)FIRST_NAME: varchar(11)CNAME: varchar(36)CON_TTL: varchar(27)SHIPPING_STREET: varchar(40)SHIPPING_CITY: varchar(20)SHIPPING_STATE: char(2)SHIPPING_ZIP: varchar(10)PHONENUM: varchar(12)SP_NO: smallintQUANTITY: smallintITEM_DSC: varchar(25)SUPID: smallintUNIT_COST: moneyTAX_RATE: decimal(5,4)BILL_CODE: char(10)

empinfo

EMPID: smallint

LAST_NAME: varchar(17)FIRST_NAME: varchar(12)GENDER: char(1)DEPTID: smallintDEPTNM: varchar(14)TITLE: varchar(30)STREET: varchar(40)CITY: varchar(15)STATE: varchar(3)ZIP: varchar(10)PHONE: varchar(14)

All_Constant_Attributes

BILL_CODE: char(10)TAX_RATE: decimal(5,4)

ITEM_NO

ITEM_NO: char(6)

SUPID: smallintITEM_DSC: varchar(25)

ITEM_NO_ORDER_DATE

ITEM_NO: char(6)ORDER_DATE: datetime

UNIT_COST: money

ORDER_NO

ORDER_NO: char(4)

PHONENUM: varchar(12)SHIPPING_ZIP: varchar(10)SHIPPING_STATE: char(2)SHIPPING_CITY: varchar(20)SHIPPING_STREET: varchar(40)CON_TTL: varchar(27)CNAME: varchar(36)FIRST_NAME: varchar(11)LAST_NAME: varchar(10)PO_NUM: smallint

ITEM_NO_ORDER_NO

ITEM_NO: char(6)ORDER_NO: char(4)ORDER_DATE: datetimeEmployeeID: smallintDEPTID: smallint

QUANTITY: smallintSHIPDT: datetime

DEPTID

DEPTID: smallint

DEPTNM: varchar(14)

EmployeeID

EmployeeID: smallintDEPTID: smallint

PHONE: varchar(14)ZIP: varchar(10)STATE: varchar(3)CITY: varchar(15)STREET: varchar(40)TITLE: varchar(30)GENDER: char(1)FIRST_NAME: varchar(12)LAST_NAME: varchar(17)

160

Normalized Schema Anomalies

• Observable normalization anomalies may include:• Unexpected Tables

• Duplicate Tables

• Tables with strange/unexpected keys

• Columns in the wrong locations

161

Column Tracing

• Allows you to find the origin of an Column in another schema

• Used to determine the Source Model Dependencies and Synonyms (or the lack thereof) which may be causing the anomaly

162

Schema Locking

• The existence of a Normalized Schema causes Informatica Data Explorer to lock various objects in the Source Schema

• In order to modify Dependencies in the Source Schema, you must remove the Normalized Schema

163

Re-Normalizing

• In order to change the Normalized Schema, you must Remove the Normalized Schema

• Modify the Source Schema, then Re-run Normalization

• The next exercises:• Remove a dependency

• Add another Table

• Renormalize schema

• Review the new Normalized Schema

164

Lab Exercises 7.2 – 7.3

165

New Normalized Schema

All_Constant_Attributes

BILL_CODE: char(10)TAX_RATE: decimal(5,4)

ITEM_NO

ITEM_NO: char(6)

SUPID: smallintITEM_DSC: varchar(25)

SHIPPING_ZIP

SHIPPING_ZIP: varchar(10)

SHIPPING_STATE: char(2)SHIPPING_CITY: varchar(20)SHIPPING_STREET: varchar(40)

ORDER_NO

ORDER_NO: char(4)

PHONENUM: varchar(12)SHIPPING_ZIP: varchar(10)CON_TTL: varchar(27)CNAME: varchar(36)FIRST_NAME: varchar(11)LAST_NAME: varchar(10)PO_NUM: smallint

ITEM_NO_ORDER_NO

ITEM_NO: char(6)ORDER_NO: char(4)

UNIT_COST: moneyQUANTITY: smallintEmployeeID: smallintSHIPDT: datetimeORDER_DATE: datetime

DEPTID

DEPTID: smallint

DEPTNM: varchar(14)

EmployeeID

EmployeeID: smallint

PHONE: varchar(14)ZIP: varchar(10)STATE: varchar(3)CITY: varchar(15)STREET: varchar(40)TITLE: varchar(30)DEPTID: smallintGENDER: char(1)FIRST_NAME: varchar(12)LAST_NAME: varchar(17)

166

Lesson 7 Review

• Normalization is a 100% automated process

• The only inputs to the normalization process are• Dependencies added to the Model

• Column Synonyms

• Refinement of the Normalized Schema is an iterative process

167

Lesson 7 Review (cont.)

• The Normalized Schema is most often used as a basis for• Baseline view

• Review for anomalies

• Comparison to business requirements

• Staging Area

• The Normalized Schema is not a business model

168

Lesson 7 Review (cont.)

• Normalized Schema anomalies stem from either:• Dependencies added to the model

• Dependencies not added to the model

• Incorrect (or unmade) Synonyms

• You can Normalize the Source Schema as soon as you have added dependencies to the model during Single Table Structural Analysis• Actually, you can do it any time but it will just make

a copy of your existing schema if you have not added any dependencies.

169

Lesson 7 Review (cont.)

• If you have not established inter-relational Synonyms, you will get duplicate Tables and/or Columns in the Normalized Schema• Duplicate Tables will appear in the Normalized Model

with an extension, such as:• EmployeeID

• EmployeeID_1

• Suggestions:• Make only one change at a time and then renormalize

• Often making one change in the Source Schema can result in several changes in the Normalized Schema

170

Lesson 8

Exporting to the Repository

171

Lesson 8 Objective

• Export Projects to the IDE Repository

172

What is the Repository?

• A series of relational database tables that store the results from the Informatica Data Explorer Product Suite

173

Repository Export

• The Repository Export dialog box enables you to export an IDE catalog to the Repository

• The Repository Export dialog box provides the ability to limit some of the data that is exported to the Repository

• Once in the Repository, the Catalog becomes available to a variety of DBMS tools, such as SQL, report generators, and so on

• All schemas in the Catalog will be exported to the Repository

IDE Repository Architecture

UNIX or Windows NT

IDEServer

ODBCDrivers

Windows XP, 2000

IDEClient

Client Server

Project

RepositoryRDBMS

UNIX or Windows NT

175

Exporting to Repository

176

Lab Exercise 8.1

177

Lesson 8 Review

• You control what information from a Project is included in the Export process

• The more you export, the longer the process will take

• Information exported to the IDE Repository becomes available to:• Informatica Data Explorer Repository Navigator

• Report Writing tools

• SQL tools

178

Lesson 9

Using the Repository Navigator

179

Lesson 9 Objectives

• Understand use of the Repository Navigator

• Access the IDE Repository and browse its contents using the Navigator

• Explain Tags

• Understand how to share information among departments

180

IDE Repository Navigator

• A browser for the contents of the IDE Repository

• Can be used by anyone in your enterprise

Repository~~~~

KnowledgeAbout

Corporate Systems

Structure

Content

Quality

181

IDE Repository Architecture

UNIX or Windows NT

UNIX or Windows NT

IDEServer

ODBCDrivers

Windows XP, 2000

IDEClient

ODBCDrivers

IDESourceProfiler

Client Server

IDEFTM/XML

Project

IDE Repository

RDBMS

RepositoryNavigator

182

Schema Viewer

• The Schema Viewer functions similar to the Navigation Tree in Informatica Data Explorer• You expand/contract objects

• You use a right-click of the mouse to view properties

• The Schema Viewer provides users with the ability to query profiling information for Tables and Columns (Properties, Tags, Sample Data, Value Frequency Lists) within each schema

183

Exercise 9.1-9.3

184

The Link Viewer

• The Link Viewer shows links between any two schemas in the current project.

• Link Viewer uses:

• View Links between Columns

• Find information on compatibility problems

• Access Tags associated with Links

Link Viewer

185

Table Viewer

• Provides SQL access to the IDE Repository

• Has several pre-built SQL queries

• Allows you to run your own custom queries

186

Exercise 9.4-9.6

187

Lesson 9 Review

• The IDE Repository provides:• Rapid access to source data knowledge

• Team collaboration

• Enhanced communication

• Flexible ad hoc reporting

188

Lesson 10

Repository Reports

189

Lesson 10 Objectives

• Understand what IDE Repository Reports are

• Demonstrate how to use Repository Reports

• Create a report using a Crystal Reports template

• Export a report using Crystal Reports

190

What are Repository Reports

• IDE Repository Reports are a series of reports to provide specific management information from the IDE Repository.

• Reports are written with Crystal Reports.

191

Why use Crystal Reports?

• Provides a user interface to guide the design of reports that are stored in a relational database

• Can export data to other programs such as Excel, Word or HTML pages

• Provides the flexibility to create custom or ad hoc reports. The user is not limited to the reports provided in the Informatica Data Explorer Product Suite

• Accesses the IDE Repository through an ODBC connection

192

Report Templates

• A series of reports are provided as an easy means of obtaining documentation from the IDE Repository

• The Report Templates can be modified to meet individual needs

193

List of Reports

Column Profile - By File Column Profiling results sorted by FileColumn Profile - By Field Column Profiling results sorted by FieldNull Rule Exceptions List of Attributes with Null, Zero or BlanksValue Frequency Value Frequency Lists for AttributesSupported Relationships Inferred Dependencies for each Data SampleModel Relationships Dependencies that have been added to the ModelOverlapping Data Redundancy Profiling Overlap ReportNotes Note Tag ReportAction Items Action Item Tag ReportRules Rule Tag ReportTransformations Transformation Tag ReportAttribute Links Reports Links between Attributes

194

Selection Criteria

• Allow users to select values for certain fields within the templates

• Limit the amount of data reported from the IDE Repository

• Each Template provides selection on ProjectName and SchemaName at a minimum

195

Exercises 10.1 – 10.4

196

Exporting Reports

• Crystal Reports provides an option to export report data into other file formats

• Useful for sharing data with individuals that do not have access to Crystal Reports

197

Exercises 10.5

198

Lesson 10 Review

• IDE Repository Reports provide reporting capability from the IDE Repository

• Additional reports can be created to meet business needs

199

Lesson 11

Integration with PowerCenter

200

PowerCenter Integration

• Informatica Data Explorer has the ability to share metadata with PowerCenter. This allows the business users to share knowledge that was found during the data discovery process with the PowerCenter developers.

• Objects that can be shared are: • Source and target schemas

• Filters

• Expressions (transformation tags in IDE).

201

Create a Transformation Tag

202

Transformation Tag

203

Set Physical Properties

204

Export to Repository

205

Open Fixed Target Mapping (FTM)

206

Open Your Project

207

Export to PowerCenter

208

Import Object into PowerCenter

209

Open Customer in Source Analyzer

210

Open a new Transform

211

Open Ports Tab

212

Informatica Resources

213

Informatica – The Data Integration Company

Informatica provides data integration tools for both batch and real-time applications:

Data Migration Data Synchronization

Data Warehousing Data Hubs

Business Activity Monitoring

214

• Founded in 1993

• Leader in enterprise solution products

• Headquarters in Redwood City, CA

• Public company since April 1999 (INFA)

• 2000+ customers, including over 80% of Fortune 100

• Strategic partnerships with IBM Global Services, HP, Accenture, SAP, and many others

• Technology partnership with Composite Software for Enterprise Information Integration (EII) – real-time federated views and reporting across multiple data sources

• Worldwide distribution

Informatica – Company Information

215

Informatica Affiliations

216

Informatica Resources

www.informatica.com – provides information (under Services) on:• Professional Services• Education Services

my.informatica.com – customers and contractual partners can sign up to access:• Technical Support• Product documentation (under Tools – online documentation)• Velocity Methodology (under Services)• Knowledgebase• Mapping templates

devnet.informatica.com – sign up for Informatica Developers Network• Discussion forums• Web seminars• Technical papers

217