Unstructured Data Transformation Overview

18
Unstructured Data Transformation Overview By PenchalaRaju.Yanamala Transformation type: Active/Passive Connected The Unstructured Data transformation is a transformation that processes unstructured and semi-structured file formats, such as messaging formats, HTML pages and PDF documents. It also transforms structured formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. The Unstructured Data transformation calls a Data Transformation service from a PowerCenter session. Data Transformation is the application that transforms the unstructured and semi-structured file formats. You can pass data from the Unstructured Data transformation to a Data Transformation service, transform the data, and return the transformed data to the pipeline. Data Transformation has the following components: Data Transformation Studio. A visual editor to design and configure transformation projects. Data Transformation Service. A Data Transformation project that is deployed to the Data Transformation Repository and is ready to run. Data Transformation repository. A directory that stores executable services that you create in Data Transformation Studio. You can deploy projects to different repositories, such as repositories for test and production services. Data Transformation Engine. A processor that runs the services that you deploy to the repository. When Data Transformation Engine runs a service, it writes the output data, or it returns output data to the Integration Service. When Data Transformation Engine returns output to the Integration Service, it returns XML data. You can configure the Unstructured Data transformation to return the XML in an output port, or you can configure output groups to return row data.

Transcript of Unstructured Data Transformation Overview

Page 1: Unstructured Data Transformation Overview

Unstructured Data Transformation Overview

By PenchalaRaju.Yanamala

Transformation type:Active/PassiveConnected

The Unstructured Data transformation is a transformation that processes unstructured and semi-structured file formats, such as messaging formats, HTML pages and PDF documents. It also transforms structured formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT.

The Unstructured Data transformation calls a Data Transformation service from a PowerCenter session. Data Transformation is the application that transforms the unstructured and semi-structured file formats. You can pass data from the Unstructured Data transformation to a Data Transformation service, transform the data, and return the transformed data to the pipeline.

Data Transformation has the following components:

Data Transformation Studio. A visual editor to design and configure transformation projects. Data Transformation Service. A Data Transformation project that is deployed to the Data Transformation Repository and is ready to run.Data Transformation repository. A directory that stores executable services that you create in Data Transformation Studio. You can deploy projects to different repositories, such as repositories for test and production services.Data Transformation Engine. A processor that runs the services that you deploy to the repository.

When Data Transformation Engine runs a service, it writes the output data, or it returns output data to the Integration Service. When Data Transformation Engine returns output to the Integration Service, it returns XML data. You can configure the Unstructured Data transformation to return the XML in an output port, or you can configure output groups to return row data.

Page 2: Unstructured Data Transformation Overview

Configuring the Unstructured Data Option

The Unstructured Data transformation is installed with PowerCenter. Data Transformation has a separate installer. Install the Data Transformation Server and Client components after you install PowerCenter.

To install the Unstructured Data option, complete the following steps:

1. Install PowerCenter.

2. Install Data Transformation. For information about installing Data Transformation, see the Data Transformation Administrator Guide.

3. Configure the Data Transformation repository folder.

Configuring the Data Transformation Repository Directory

The Data Transformation repository contains executable Data Transformation services. When you install Data Transformation, the installation creates the following folder:

<Data_Transformation_install_dir>\ServiceDB

To configure a different repository folder location, open Data Transformation Configuration from the Windows Start menu. The repository location is in the following path in the Data Transformation Configuration:

CM Configuration > CM Repository > File System > Base Path

If Data Transformation Studio can access the remote file system, you can change the Data Transformation repository to a remote location and deploy

Page 3: Unstructured Data Transformation Overview

services directly from Data Transformation Studio to the system that runs the Integration Service. For more information about deploying services to remote machines, see the Data Transformation Studio User Guide.

Copy custom files from the Data Transformation autoInclude\user or the externLibs\user directory to the autoInclude\user or externLibs\user directory on the machine that runs the Integration Service. For more information about these directories, see the Data Transformation Engine Developer Guide.

Data Transformation Service Types

When you create a project in Data Transformation Studio, you choose a Data Transformation service type to define the project. Data Transformation has the following types of services that transform data:

Parser. Converts source documents to XML. The output of a parser is always XML. The input can have any format, such as text, HTML, Word, PDF, or HL7.Serializer. Converts an XML file to an output document of any format. The output of a serializer can be any format, such as a text document, an HTML document, or a PDF.Mapper. Converts an XML source document to another XML structure or schema. A mapper processes the XML input similarly to a serializer. It generates XML output similarly to a parser. The input and the output are fully structured XML.Transformer. Modifies the data in any format. Adds, removes, converts, or changes text. Use transformers with a parser, mapper, or serializer. You can also run a transformer as stand-alone component. Streamer. Splits large input documents, such as multi-gigabyte data streams, into segments. The streamer processes documents that have multiple messages or records in them, such as HIPAA or EDI files.

For more information about creating projects with Data Transformation, see Getting Started with Data Transformation.

Unstructured Data Transformation Components

The Unstructured Data transformation contains the following tabs:

Transformation. Enter the name and description of the transformation. The naming convention for an Unstructured Data transformation is UD_TransformationName. You can also make the Unstructured Data transformation reusable. Properties. Configure the Unstructured Data transformation general properties such as IsPartitionable and Output is Repeatable.UDT Settings. Modify Unstructured Data transformation settings such as input type, output type, and service name.UDT Ports. Configure Unstructured Data transformation ports and attributes.Relational Hierarchy. Define a hierarchy of output groups and ports to enable the Unstructured Data transformation to write rows to relational targets.

Properties Tab

Page 4: Unstructured Data Transformation Overview

Configure the Unstructured Data transformation general properties on the Properties tab.

The following table describes properties on the Properties tab that you can configure:

Property DescriptionTracing Level The amount of detail included in the session log when you run a

session containing this transformation. Default is Normal.IsPartitionable The transformation can run in more than one partition. Select one

of the following options:- No. The transformation cannot be partitioned.

- Locally. The transformation can be partitioned, but the Integration Service must run all partitions in the pipeline on the same node.

-

Across Grid. The transformation can be partitioned, and the Integration Service can distribute each partition to different nodes.

Default is Across Grid. Output is Repeatable

The order of the output data is consistent between session runs.

- Never. The order of the output data is inconsistent between session runs.

-

Based On Input Order. The output order is consistent between session runs when the input data order is consistent between session runs.

-

Always. The order of the output data is consistent between session runs even if the order of the input data is inconsistent between session runs.

Default is Never for active transformations. Default is Based On Input Order for passive transformation runs.

Output is Deterministic

Indicates whether the transformation generates consistent output data between session runs. Enable this property to perform recovery on sessions that use this transformation.

Warning: If you configure a transformation as repeatable and deterministic, it is your responsibility to ensure that the data is repeatable and deterministic. If you try to recover a session with transformations that do not produce the same data between the session and the recovery, the recovery process can result in corrupted data.

UDT Settings Tab

Configure the Unstructured Data transformation attributes on the UDT Settings tab.

The following table describes the attributes on the UDT settings tab:

Attribute DescriptionInputType Type of input data that the Unstructured Data transformation

passes to Data Transformation Engine. Choose one of the following input types:- Buffer. The Unstructured Data transformation receives source data in the InputBuffer port and passes data from the port to Data

Page 5: Unstructured Data Transformation Overview

Transformation Engine.

-

File. The Unstructured Data transformation receives a source file path in the InputBuffer port and passes the source file path to Data Transformation Engine. Data Transformation Engine opens the source file.

OutputType Type of output data that the Unstructured Data transformation or Data Transformation Engine returns. Choose one of the following output types:

-

Buffer. The Unstructured Data transformation returns XML data through the OutputBuffer port unless you configure a relational hierarchy of output ports. If you configure a relational hierarchy of ports, the Unstructured Data transformation does not write to the OutputBuffer port.

-

File. Data Transformation Engine writes the output to a file. It does not return the data to the Unstructured Data transformation unless you configure a relational hierarchy of ports in the Unstructured Data transformation.

-

Splitting.The Unstructured Data transformation splits a large XML output file into smaller files that can fit in the OutputBuffer port. You must pass the split XML files to the XML Parser transformation.

ServiceName Name of the Data Transformation service to run. The service must be present in the local Data Transformation repository.

Streamer Chunk Size

Buffer size of the data that the Unstructured Data transformation passes to Data Transformation Engine when the Data Transformation service runs a streamer. Valid values are 1-1,000,000 KB. Default is 256 KB.

Dynamic Service Name

Run a different Data Transformation service for each input row. When Dynamic Service Name is enabled, the Unstructured Data transformation receives the service name in the Service Name input port. When Dynamic Service name is disabled, the Unstructured Data transformation runs the same service for each input row. The Service Name attribute in the UDT Settings must contain a service name. Default is disabled.

Status Tracing Level

Set the level of status messages from the Data Transformation service.

-

Description Only. Return a status code and a short description to indicate if the Data Transformation service was successful or if it failed.

- Full Status. Return a status code and a status message from the Data Transformation service in XML.

- None. Do not return status from the Data Transformation service. Default is none.

Viewing Status Tracing Messages

You can view status messages from the Data Transformation service. Set the status tracing level to Description Only or Full Status. The Designer creates the UDT_Status_Code port and the UDT_Status_Message output ports in the Unstructured Data transformation.

Page 6: Unstructured Data Transformation Overview

When you choose Description Only, Data Transformation Engine returns a status code and one of the following status messages:

Status Code Status Message1 Success2 Warning3 Failure4 Error5 Fatal Error

When you choose Full Status, Data Transformation Engine returns a status code and the error message from the Data Transformation service. The message is in XML format.

Unstructured Data Transformation Ports

When you create an Unstructured Data transformation, the Designer creates default ports. It creates other ports based on how you configure the transformation. The Unstructured Data transformation input and output types determine how the Unstructured Data transformation passes data to and receives data from Data Transformation Engine.

Table 27-1 describes the Unstructured Data transformation default ports:

Table 27-1. Unstructured Data Transformation Default Ports

Port Input/Output

Description

InputBuffer Input Receives source data when the input type is buffer. Receives a source file name and path when the input type is file.

OutputBuffer Output Returns XML data when the output type is buffer. Returns the output file name when the output type is file. Returns no data when you configure hierarchical output groups of ports.

Table 27-2 describes other Unstructured Data transformation ports that the Designer creates when you configure the transformation:

Table 27-2. Unstructured Data Transformation Other Ports

Port Input/Output

Description

OutputFileName Input Receives a name for an output file when the output type is file.

ServiceName Input Receives the name of a Data Transformation service when you enable Dynamic Service Name.

UDT_Status_Code Output Returns a status code from Data Transformation Engine when the status tracing level is Description Only or Full Status.

UDT_Status_ Message

Output Returns a status message from Data Transformation Engine when the status tracing level is Description

Page 7: Unstructured Data Transformation Overview

Only or Full Status.

Note: You can add groups of output ports for relational targets on the Relational Hierarchy tab. When you configure groups of ports, a message appears on the UDT Ports tab that says hierarchical groups and ports are defined on another tab.

Ports by Input and Output Type

The input type determines the type of data that the Integration Service passes to Data Transformation Engine. The input type determines whether the input is data or a source file path.

Configure one of the following input types:

Buffer. The Unstructured Data transformation receives source data in the InputBuffer port. The Integration Service passes source rows from the InputBuffer port to Data Transformation Engine. File. The Unstructured Data transformation receives the source file path in the InputBuffer port. The Integration Service passes the source file path to Data Transformation Engine. Data Transformation Engine opens the source file. Use the file input type to parse binary files such as Microsoft Excel or Microsoft Word files.

If you do not define output groups and ports, the Unstructured Data transformation returns data based on the output type.

Configure one of the following output types:

Buffer. The Unstructured Data transformation returns XML through the Outputbuffer port. You must connect an XML Parser transformation to the Outputbuffer port.File. Data Transformation Engine writes the output file instead of passing data to the Integration Service. Data Transformation Engine names the output file based on the file name from the OutputFilename port. Choose the File output type to transform XML to binary data such as a PDF file or a Microsoft Excel file.

The Integration Service returns the output file name in the OutputBuffer port for each source row. If the output file name is blank, the Integration Service returns a row error. When an error occurs, the Integration Service writes a null value to the OutputBuffer and returns a row error. Splitting. The Unstructured Data transformation splits XML data from Data Transformation Engine into multiple segments. Choose the Splitting output type when the Unstructured Data transformation returns XML files that are too large for the OutputBuffer port. When you configure Splitting output, pass the XML data to the XML Parser transformation. Configure the XML Parser transformation to process the multiple XML rows as one XML file.

Adding Ports

A Data Transformation service might require multiple input files, file names, and parameters. It can return multiple output files. When you create an Unstructured Data transformation, the Designer creates one InputBuffer port and one OutputBuffer port. If you need to pass additional files or file names between the

Page 8: Unstructured Data Transformation Overview

Unstructured Data transformation and Data Transformation Engine, add the input or output ports. You can add ports manually or from the Data Transformation service.

The following table describes the ports you can create on the UDT Ports tab:

Port Type Input/Output

Description

Additional Input (buffer)

Input Receives input data to pass to Data Transformation Engine.

Additional Input (file)

Input Receives the file name and path for Data Transformation Engine to open.

Service Parameter Input Receives an input parameter for a Data Transformation service.

Additional Output (buffer)

Output Receives XML data from Data Transformation Engine.

Additional Output(file)

Output Receives an output file name from Data Transformation Engine.

Pass-through Input/Output

Passes data through the Unstructured Data transformation without changing it.

Creating Ports From a Data Transformation Service

A Data Transformation service can require input parameters, additional input files, or user-defined variables. The service might return more than one output file to the Unstructured Data transformation. You can add ports that pass parameters, additional input files, and additional output files. The Designer creates ports that correspond to the ports in the Data Transformation service.

Note: You must configure a service name to populate ports from a service.

To create ports based on a Data Transformation service:

1. Click the Ports tab on the Unstructured Data transformation.2. Click Populate From Service. The Designer displays the service parameters, additional input, and additional output port requirements from the Data Transformation service. Service parameters include Data Transformation system variables and user-defined variables. 3. Select the ports to create and configure each port as a buffer port or file port.

4. Click Populate to create the ports that you select. You can select all ports that appear

Defining a Service Name

When you create an Unstructured Data transformation, the Designer displays a list of the Data Transformation services that are in the Data Transformation repository. Choose the name of a Data Transformation service that you want to call from the Unstructured Data transformation. You can change the service name after you create the transformation. The service name appears on the UDT Settings tab.

To run a different Data Transformation service for each source row, enable the Dynamic Service Name attribute. Pass the service name with each source row.

Page 9: Unstructured Data Transformation Overview

The Designer creates the ServiceName input port when you enable dynamic service names.

When you enable dynamic service names, you cannot create ports from a Data Transformation service.

Relational Hierarchies

To pass row data to relational tables or other targets, configure output ports on the Relational Hierarchy tab. You can define groups of ports and define a relational structure for the groups.

When you configure output groups, the output groups represent the relational tables or the targets that you want to pass the output data to. Data Transformation Engine returns rows to the group ports instead of writing an XML file to the OutputBuffer port. The transformation writes rows based on the output type.

Create a hierarchy of groups in the left pane of the Relational Hierarchy tab. All groups are under the root group called PC_XSD_ROOT. You cannot delete the root. Each group can contain ports and other groups. The group structure represents the relationship between target tables. When you define a group within a group, you define a parent-child relationship between the groups. The Designer defines a primary key-foreign key relationship between the groups with a generated key.

Select a group to display the ports for the group. You can add or delete ports in the group. When you add a port, the Designer creates a default port configuration. Change the port name, datatype, and precision. If the port must contain data select Not Null. Otherwise, the output data is optional.

When you view the Unstructured Data transformation in the workspace, each port in a transformation group has a prefix that contains the group name.

When you delete a group, you delete the ports in the group and the child groups.

Exporting the Hierarchy Schema

When you define hierarchical output groups in the Unstructured Data transformation, you must define the same structure in the Data Transformation project that you create to transform the data. Export the hierarchy structure as an XML schema file from the Unstructured Data transformation. Import the schema to your Data Transformation project. You can then map the content of a source document to the XML elements and attributes in the Data Transformation project.

To export the group hierarchy from the Relational Hierarchy tab, click Export to XML Schema. Choose a name and a location for the .xsd file. Choose a location that you can access when you import the schema with Data Transformation Studio.

The Designer creates a XML schema file with the following namespace:

"www.informatica.com/UDT/XSD/<mappingName_<Transformation_Name>>"

Page 10: Unstructured Data Transformation Overview

The schema includes the following comment:

<!-- ===== AUTO-GENERATED FILE - DO NOT EDIT ===== -->

<!-- ===== This file has been generated by Informatica PowerCenter ===== -->

If you modify the schema, the Data Transformation Engine might return data that is not the same format as the output ports in the Unstructured Data transformation.

The XML elements in the schema represent the output ports in the hierarchy. Columns that can contain null values have a minOccurs=0 and maxOccurs=1 XML attribute

Mappings

When you create a mapping, design it according to the type of Data Transformation project you are going to run. For example, the Data Transformation Parser and Mapper generate XML data. You can configure the Unstructured Data transformation to return rows from the XML data or you can configure it to return an XML file.

The Data Transformation Serializer component can generate any output from XML. It can generate HTML or binary files such as Microsoft Word or Microsoft Excel. When the output is binary data, Data Transformation Engine writes the output to a file instead of passing it back to the Unstructured Data transformation.

The following examples show how to configure mappings with an Unstructured Data transformation.

Parsing Word Documents for Relational Tables

You can extract order information from a Microsoft Word document and write the order information to an order header table and an order detail table. Configure an Unstructured Data transformation to call a Data Transformation parser service and pass the name of each Word document to parse. The Data Transformation Engine opens the Word document, parses it, and returns the rows to the Unstructured Data transformation. The Unstructured Data transformation passes the order header and order details to the relational targets.

The mapping has the following objects:

Source Qualifier transformation. Passes each Microsoft Word file name to the Unstructured Data transformation. The source file name contains the complete path to the file that contains order information. Unstructured Data transformation. The input type is file. The output type is buffer. The transformation contains an order header output group and an order detail output group. The groups have a primary key-foreign key relationship.

The Unstructured Data transformation receives the source file name in the InputBuffer port. It passes the name to Data Transformation Engine. Data Transformation Engine runs a parser service to extract the order header and order detail rows from the Word document. Data Transformation Engine returns the data to the Unstructured Data

Page 11: Unstructured Data Transformation Overview

transformation. The Unstructured Data transformation passes data from the order header group and order detail group to the relational targets.Relational targets. Receive the rows from the Unstructured Data transformation.

Creating an Excel Sheet from XML

You can extract employee names and addresses from an XML file and create a Microsoft Excel sheet with the list of names.

The mapping has the following components:

XML source file. Contains employee names and addresses.Source Qualifier transformation. Passes XML data and an output file name to the Unstructured Data transformation. The XML file contains employee names.Unstructured Data transformation. The input type is buffer and the output type is file. The Unstructured Data transformation receives the XML data in the InputBuffer port and the file name in the OutputFileName port. It passes the XML data and the file name to Data Transformation Engine.

Data Transformation Engine runs a serializer service to transform the XML data to a Microsoft Excel file. It writes the Excel file with a file name based on the value of OutputFilename.The Unstructured Data transformation receives only the output file name from Data Transformation Engine. The Unstructured Data transformation OutputBuffer port returns the value of OutputFilename. Flat file target. Receives the output file name.

Parsing Word Documents and Returning A Split XML File

The Data Transformation Parser and Mapper components can transform data from any format and generate XML data. When the XML data is large, you can split the XML into segments and pass the segments to an XML Parser transformation. The XML Parser transformation receives the segments and processes the XML data as one document.

When you configure the Unstructured Data transformation to split XML output, the Unstructured Data transformation returns XML based on the OutputBuffer port size. If the XML file size is greater than the output port precision, the Integration Service divides the XML into files equal to or less than the port size. The XML Parser transformation parses the XML and passes the rows to relational tables or other targets.

For example, you can extract the order header and detail information from Microsoft Word documents with a Data Transformation parser service.

The mapping has the following components:

Source Qualifier transformation. Passes the Word document file name to the Unstructured Data transformation. The source file name contains the complete path to the file that contains order information. Unstructured Data transformation. The input type is file. The output type is splitting. The Unstructured Data transformation receives the source file name in the InputBuffer port. It passes the file name to Data Transformation Engine.

Page 12: Unstructured Data Transformation Overview

Data Transformation Engine opens the source file, parses it, and returns XML data to the Unstructured Data transformation.

The Unstructured Data transformation receives the XML data, splits the XML file into smaller files, and passes the segments to an XML Parser transformation. The Unstructured Data transformation returns data in segments less than the OutputBuffer port size. When the transformation returns XML data in multiple segments, it generates the same pass-through data for each row. The Unstructured Data transformation returns data in pass-through ports when a row is successful or not successful.The XML Parser transformation. The Enable Input Streaming session property is enabled. The XML Parser transformation receives the XML data in the DataInput port. The input data is split into segments. The XML Parser transformation parses the XML data into order header and detail rows. It passes order header and detail rows to relational targets. It returns the pass-through data to a Filter transformation.Filter transformation. Removes the duplicate pass-through data before passing it to the relational targets.Relational targets. Receive data from each group in the XML Parser transformation and the Filter transformation.

Rules and Guidelines

Use the following rules and guidelines when you create an unstructured data mapping:

When you configure hierarchical groups of output ports, the Integration Service writes to the groups of ports instead of writing to the OutputBuffer port. The Integration Service writes to the groups of ports regardless of the output type you define for the transformation.If an Unstructured Data transformation has the File output type, and you have not defined group output ports, you must link the OutputBuffer port to a downstream transformation. Otherwise, the mapping is invalid. The OutputBuffer port contains the output file name when the Data Transformation service writes the output file.Enable Dynamic Service Name to pass a service name to the Unstructured Data transformation in the Service Name input port. When you enable Dynamic Service Name, the Designer creates the Service Name input port. You must configure a service name with the Unstructured Data transformation or enable the Dynamic Service Name option. Otherwise the mapping is invalid.Link XML output from the Unstructured Data transformation to an XML Parser transformation.

Steps to Create an Unstructured Data Transformation

Create an Unstructured Data transformation in the PowerCenter Transformation Developer or the Mapping Designer.

To create an Unstructured Data transformation:

1. In the Mapping Designer or Transformation Developer, click Transformation > Create.

2. Select Unstructured Data Transformation as the transformation type.3. Enter a name for the transformation.4. Click Create.The Unstructured Data Transformation dialog box appears.5. Configure the following properties:

Page 13: Unstructured Data Transformation Overview

Property DescriptionService Name

Name of the Data Transformation service you want to use. The Designer displays the Data Transformation services in the Data Transformation repository folder. Do not choose a name if you plan to enable dynamic service names. You can add a service name on the UDT Settings tab after you create the transformation.

Input Type

Describes how Data Transformation Engine receives input data. Default is Buffer.

Output Type

Describes how Data Transformation Engine returns output data. Default is Buffer.

6. Click OK.

7. You can change the service name, input, and output type on the UDT Settings tab.

8. Configure the Unstructured Data transformation properties on the Properties tab.

9.

If the Data Transformation service has more than one input or output file, or if it requires input parameters, you can add ports on the UDT Ports tab. You can also add pass-through ports on the Ports tab.

10.

If you want to return row data from the Unstructured Data transformation instead of XML data, create groups of output ports on the Relational Hierarchy tab.

11. If you create groups of ports, export the schema that describes them from the Relational Hierarchy tab.

12. Import the schema to the Data Transformation project to define the project output