Talend Presentation · 3 Fast Facts •Founded in 2006 •Open source •Almost 500 employees in 8...

Post on 06-Jul-2020

0 views 0 download

Transcript of Talend Presentation · 3 Fast Facts •Founded in 2006 •Open source •Almost 500 employees in 8...

1

©2015 Talend Inc

Talend Presentation

2

Connecting the Data-Driven Enterprise

Data-Driven companies…

• 23 times greater customer acquisition

• 6 times greater customer retention

• 19 times more profitability

3

Fast Facts

• Founded in 2006

• Open source

• Almost 500 employees in 8 countries (50+ new in Q2’15)

• 1,700+ customers

• Raised over $100M from tier-one funds (incl. Silver Lake & Balderton Capital)

Talend Overview

Revenue

Growth

2007 2008 2009 2010 2011 2012 2013 2014

4

Unified Integration Platform

• Lowest Cost of Ownership

• Open, Standards Based

• Run in the Cloud, On-premises or Hybrid

• Big Data Leadership

5

Financial Services

IT Services

Media

Manufacturing

Retail and Consumer Goods

Travel and Transportation

Technology

A Broad customer base across industry and segments

6

Market trends

Digital is transforming the way IT is consumed and designed• Open and connected apps rather than centralized and siloed.

• Information architecture has to be revisited for real time, big data, agility, etc.

• Cloud as a must, not an option

The integration market is hot• Equipment rate in mid-market still low, demand is rising

• Disruptive trends like Big Data and Cloud drive new opportunities at large accounts

• Expertise matters, knowledge can be valued at high rates

Project revenues are being shrinked, need to find new growth drivers• From Big Bang to Land and Expand

• From Capex to Opex

• From implementation to build, run and extend

7

• Innovate with the Cloud and Big Data

• Roll-out new projects with MDM, BI/DI, and Application Integration

• Industrialize and standardize with shared services for Data Quality, Data and Application Integration

• Refresh your legacy systems

Why Change?

Market Start

Zenith of

Industrialization

Com

mo

ditiz

atio

n

Talend enables you to position at every step of your customer Cycle

Data Quality

Custom Coding to ETL

Custom solutionto MDM

Ent. Standardsfor integration

iPaaS

Big Data

BI Relaunch

ETL

Offloading

Migrations

New Apps project With ESB

Master Data Mgmt

New BI project with ETL

Cloud

Re-platforming

8

Why now ?

$2.1B

$3.9B

$5.5 B

Source: Gartner, IDC, Data Warehousing Institute

Data Integration and Quality

App. Integration

MDM

Big Data

REVENUE (BILLIONS)

iPaaS

• Market Growth: 13%,

• Talend Growth: 33%

Talend MDM grew 86% in 2014

Talend in Gartner’s MDM for MDM and in ForresterWave

Talend as a visionary in Gartner’s Magic Quadrant

for On Premises Application Integration Suite

Talend Big Data grew 126 % in 2014

Only native integration platform on Hadoop

31 % market growth between 2012 and 2018

Launch of Talend Integration Cloud starting

in March 2015

$0.4B

$0.3B

9

Why Talend ?

• Ramp up and fuel your volume business• Low learning curve, familiar to your teams

• Flexible pricing models across your different markets

• Open platform allow to adapt to any customer context

• Nurture your customers for life• Subscription based licensing help you install a sustainable

relationship with your customer

• Large portfolio of products to expand over time across your customers processes and applications

• Innovate • Get ready for Big Data

• Turn existing processes into real-time and data driven

• Connecting The Enterprise

10

Talend’s proposition for VAR

Upsell to existing accounts

DI

BI

DI, AI

CRM

DI, AI

AppsDI

Open new accounts

DI

DI

DIMDM ESB

BigData

DI DM BD ESB MDM

Nurture accounts and drive recurring revenue

Expand accounts

DI

DMMDM ESBBig Data

DI DI DI DI

MDM ESBBig Data

Platform

• Open new accounts by leveraging

Talend’s brand and open source model

• Grow your project margin by

embedding the integration layer as a

building block of your projects

• Drive recurring revenue at your

accounts through Talend’s subscription

based model

• Expand your customer’s share of wallet

by leveraging Talend unified platform

capabilities across integration domains

11

Evaluating the business benefits

12

About important Components

• Logs & Errors Components

• The Logs & Errors family of components allow you to log information about the execution of your Job. With the exception of tDie, these components play no functional part in the task-specific processing of your Job; however, they play an important pat in the debugging your Jobs and helping to ensure their smooth running.

• This article gives an overview of each of these components, providing a good understanding of where each of these may help. In later articles, we'll look at each of these components in more detail.

13

About important Components cont…

• tLogRow

• The component tLogRow allows you to write row data to the Job log file, or console window, if you're running your Job from within Talend Studio.

• If you're running your Job from within Talend Studio, remember that, writing a large volume of data to the console windows, makes Talend Studio very unresponsive.

14

About important Components cont…

• tAsset / tAssetCatcher

• This pair of components allows you to send and catch non-blocking trigger messages.

• tAssert send the message to tAssertcatcher will catch the message.

15

About important Components cont…

• tChronometerStart / tChronomoterStop

• tChronometerStop records and displays the elapsed time since the start of a SubJob or an associated tChronometerStart.

• tChronometerStart Component Reference

• The tChronometerStart component is part of the Logs & Errors family of components and is used in conjunction with tChronometerStop.

• For more information on the usage of this component, read our tutorial on tChronometerStop.

• tChronometerStart Global Variables

• The following global variables are available, for use in within your Job (Flow).

• (Long) globalMap.get("tChronometerStart_1_STARTTIME")

16

• tChronometerStop Component Reference• The tChronometerStop component is part of the Logs & Errors family of

components.• tChronometerStop records and displays the processing time of

a SubJob. tChronometerStop may be used on its own, to record the processing time of a whole SubJob, or in conjunction with tChronometerStart to record from a specific SubJob.

• tChronometerStop Example• The following example shows a simple Job with

two tChronometerStop components, each recording a specific duration of execution time.

17

Chronometer demo displays the elapsed time since the start of a SubJob or an associated tChronometerStart

18

• tChronometerStop_1

• This component records and displays the elapsed time since the invocation of tChronometerStart_1.

19

• tChronometerStop_3

• This component records and displays the elapsed time since the beginning of the whole SubJob.

• Note. There appears to be some inconsistencies in Talend Studio's sequencing of the tChronometer components; hence the apparent jump in the sequences shown in this tutorial.

20

• tChronometerStop Basic Settings

• As well a specifying the Since options, you also have control over the presentation of the results.

• Execution Result

• The execution timings of this sample Job can be seen below.

• tarting job tChronometerStopExample at 18:10 28/08/2013. [statistics] connecting to socket on port 3918 [statistics] connected [ tChronometerStop_1 ] 3001 milliseconds [ tChronometerStop_3 ] 10002 milliseconds [statistics] disconnected Job tChronometerStopExampleended at 18:10 28/08/2013. [exit code=0]

21

Log cont..

• tChronometerStop Global Variables

• The following global variables are available, for use in within your Job (Flow).

• (Long) globalMap.get("tChronometerStop_1_STOPTIME")

• (Long) globalMap.get("tChronometerStop_1_DURATION")

• tChronometerStart Global Variables

• The following global variables are available, for use in within your Job (Flow).

• (Long) globalMap.get("tChronometerStart_1_STARTTIME")

22

• tDie

• This component sends a message to a tLogCatcher and allows the Job to terminate a Job, with a specified Exit Code, once the message has been processed.

23

• tFlowMeter / tFlowMeterCatcher

• This pair of components allows you to record and catch the data-flow metrics of your Job.

24

• tStatCatcher

• The tStatCatcher component catches statistics that are generated by a Job or individual components. Statistics are always collected for Jobs, components must have tStatCatcher Statistics enabled (Advanced settings).

• Tstatscatcher demo displays the tStatCatcher, using a tLogRowcomponent.

25

26

• tLogCatcher

• This component carches messages from tDie and tWarn.

27

• tFlowMeter / tFlowMeterCatcher

• This pair of components allows you to record and catch the data-flow metrics of your Job.

28

29

Custom Code Components

• The Talend Custom Code components allow you to extend the functionality of Talend beyond the functionality that is available by simply connecting other components together. This may be basic functionality that is provided by the tJava component, through to more complex processing available through components such as tJavaFlex, tJavaRow and tLibraryLoad.

30

See inside the each component for more details

31

tSetGlobalVar Component

• tSetGlobalVar Component

• The tSetGlobalVar Component is a convenient method for adding GobalVariables to globalMap.

• In the following screenshot, you can see that a simple Job has been created to define two new Global Variables which are added to globalMap using tSetGlobalVar.

• This is equivalent to using a tJava component to make the following assignments.

• globalMap.put("myString", "Hello World!");globalMap.put("myInteger", 999);

• The tJava Component shown in this example simply prints the values of the newly created variables.

• System.out.println("myString=" + globalMap.get("myString"));System.out.println("myInteger=" + globalMap.get("myInteger"));

32

tSetGlobalVar Component

33

Orchestration Components

• Orchestration Components

• The Talend Orchestration components allow you control the behaviour and execution of your Job.

• tPreJob & tPostJob

• Add some pre and post-processing to your Jobs with this pair of Orchestration components.

• Read More »

• tRunJob

• Organise your Jobs using tRunJob. Modularise your Job or call reusable Reusable Jobs.

34

Processing Components

• Processing Components

• Processing Components

• The Talend Processing components are the family of components that allow you to work with and transform your data.

• tBufferInput

• Read your buffered data, using the tBufferInputcomponent.

• Read More »

• tBufferOutput

• Buffer your data for output, using the tBufferOutputcomponent.

• Read More »

• tMap

• The tMap component is at the core of everyting you do. Map, Join, Transform your data and more.

35

Talend null Handling

• Refer Doc

36

• Talend String Handling

• Talend Date Handling

• Talend Data Validation

• Talend Schema Reference

• Talend Java Tips

• Talend Routines Tutorial

• Talend Job Deployment & Scheduling

• Working with Databases

• Text Files

37

Talend Data Generation

• Talend Data Generation

• Talend provides the tRowGenerator component, for generating data. This component allows you to specify an arbitrary number of rows that should be generated, define a Schema, and then assign values to the columns that have been defined. Usually, random values are assigned, using the methods provided by Repository->Code->Routines->System->TalendDataGeneration; however, you may assign data using Routines of your choice.

• My preference is to use tRowGenerator for row generation and, perhaps, the assignment of a Primary Key; but to map the remainder of my data using one or more tMap components. This offers maximum flexibility.

• The methods provided by Talend's TalendDataGeneration Routines will give you a good start to your data generation needs; however, you may find them limiting. As part of this tutorial, I have built a set of Routines, TBEDataGeneration, which offer a greater breadth of data (for Address and Person). I will add to these, from time to time. These routines have a UK slant; however, you may modify these to suit your own needs.

38

Talend Performance

• Talend Performance

• As with building any software, performance (usually meaning speed of execution) is a key input to your design and development.

• Measuring load on tMap by limiting lookup component

• Using multithreading concept

• Using Parallel execution

• Using tSort which firmly use “ Shot on Disk “

• By tParallelize allows you to synchronize the execution of a subjob with the execution of other subjobs in your main Job. tParallelize helps you manage complex Job systems.

39

Metadata

• A set of data that describes and gives information about other data.

• Metadata is holding the schema information to facilitate with reusable properties.

• Below example will gives you clarity:-

40

Metadata about DB conection

41

Metadata about file delimited

42

Metadata about XML

43

Metadata about .xlsx

44

Metadata about json file

45

Routine/User define function

• Talend has provide facilities to create own user define function through java code, once it created it will reflect in expression.

46

Routine/User define function

47

Routine/User define function

Talend CDC

When we talk about OLAP ( , the extraction and transportation of data from one or more databases into a targetOnline Analytical Processing)system or systems for analysis. But this involves the extraction and transportation of huge volumes of data and is very expensive in bothresources and time.

The ability to capture only the changed source data and to move it from a source to a target system(s) in real time is known as Change Data. Capturing changes reduces traffic across a network and thus helps reduce ETL time.Capture (CDC)

The CDC feature, introduced in , simplifies the process of identifying the change data since the last extraction. CDC in Talend Studio  Talendquickly identifies and captures data that has been added to, updated in, or removed from database tables and makes this change dataStudio 

available for future use by applications or individuals. The CDC feature is available for Oracle, MySQL, DB2, PostgreSQL, Sybase, MS SQLServer, Informix, Ingres, Teradata, and AS/400.

Warning

The CDC feature works only with database systems running on the same server.

Three different CDC modes are available in :Talend StudioTrigger: this mode is the by-default mode used by CDC components.Redo/Archive log: this mode is used with Oracle v11 and previous versions and AS/400.XStream: this mode is used only with Oracle v12 with OCI.

For detailed information on these three modes, see the following sections.

Trigger modeThis mode is available for the following databases: MySQL, Oracle, DB2, PostgreSQL, Sybase, MS SQL Server, Informix, Ingres, and Teradata.

The mode places a trigger that launches change data capture on every monitored source table. This, by turn, imposes little modificationsTriggeron database structure.

With this mode, data extraction takes place at the same time the , , or operations occur in the source tables, and the changeInsert Update Deletedata is stored inside the database in change tables. The changed data, thus captured, is then made available to the target system(s) in acontrolled manner, using subscriber views.

In mode, CDC can have only one publisher but many subscribers. CDC creates subscriber tables to control accessibility of the changeTriggertable data by the target system(s). A target system is any application that wants to use the data captured from the source system.

The below figure shows the basic architecture of a CDC environment in mode in .Trigger Talend StudioIn this example, CDC monitors the changes made to a Product table. The changes are caught and published in a change table to which twosubscribers have access: a CRM application and an Accounting application. These two systems fetch the changes and use them to update theirdata.

CDC Redo/Archive log modeThe mode is only available for Oracle v11 and previous versions and AS/400 databases. It is equivalent to the archive log Redo/Archive log mode for Oracle and to the journal mode for AS/400.

In an Oracle database, a is a file which logs the history of changes made to data. In an AS/400 database, these changes are logged Redo log automatically in the database's internal logbook (journal). These changes include the insert, update and delete operations which data mayundergo.

Redo/Archive log mode is less intrusive than mode because in contrast to mode, it does not require modifications to the Trigger   Trigger database structure.

When setting up this mode for Oracle, only one subscriber can have access rights to the change table. This subscriber must Redo/Archive log be a database user who holds the subscription rights. Also, there is a subscription table which controls access to the subscriber change table. Thesubscription change table is a comprehensive, internal table which reflects the state of the Oracle database at the moment at which the Redo/Arc

option was activated.hive log 

When setting up this mode for AS/400, a save file, called and provided in your Studio, is restored on AS/400 and used to install a fitcdc.savf program called . When the subscriber views the changes made ( ) or consumes them for reuse (using a c RUNCDC View all changes  tAS400CDC omponent), the program reads and analyzes the logbook (journal) and the attached receiver from the source table and updates the RUNCDCchange table accordingly. The AS/400 mode (journal) creates subscription tables to prevent unauthorized target systemsCDC Redo/Archive log from accessing the data in the change tables. A target system means any application which tries to use data captured in the source system.

In this example, the CDC monitors the changes made to a Product table, thanks to the data contained in the database's logbook (journal). TheCDC reads the logbook and records the changes which have been made to the data. These changes are collected and published in a table ofchanges to which two subscribers have access, a CRM application and an Accounting application. These two systems fetch the changes and usethem to update their data.

XStream modeXStream Out provides Oracle Database components and application programming interfaces that enable you to share data changes made to anOracle database with other systems. It also provides a transaction-based interface for streaming the changes captured from the redo log of theOracle database to client applications with an outbound server. An outbound server is an optional Oracle background process that sends datachanges to a client application.

XStream In provides Oracle Database components and application programming interfaces that enable you to share data changes made to othersystems with an Oracle database. It also provides a transaction-based interface for sending information to an Oracle database from clientapplications with an inbound server. An inbound server is an optional Oracle background process that receives data changes from a clientapplication.

The mode is only available for Oracle v12 with OCI in . For more information about the mode, see XStream   Talend Studio  XStream   http://docs.or.acle.com/cd/E11882_01/server.112/e16545/toc.htm

CDC: a publish/subscribe principleThe CDC architecture is based on the publisher/subscriber model.

The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from thepublisher.

The main tasks performed by the publisher are:identifying the source tables from which the change data needs to be captured.capturing the change data and storing it in specially created change tables.allowing subscribers controlled access to the change data.

In mode, or the AS/400 mode (journal) the subscriber is a table that only lists the applications that have access rights Trigger   Redo/Archive log to the change tables. In the Oracle mode, the subscriber is a user of the database. The subscriber may not be interested in all Redo/Archive log the data that is published by the publisher.

Setting up a CDC environmentThe CDC feature is part of ; you do not need to install any software other than to use CDC. Talend Studio  Talend Studio 

However, if you want to use CDC in mode for an Oracle, you must first of all configure the database so that it generates the Redo/Archive log redo records that hold all insert, update or delete changes made in datafiles. For further information, see Prerequisites for the Oracle

.Redo/Archive log mode

If you want to use CDC in mode for AS/400, you must verify that the prerequisites on your AS/400 are all met. For further Redo/Archive log information, see . The prerequisites on AS/400

Note

For the time being, CDC is only available in Java and is for Oracle, MySQL, DB2, PostgreSQL, Sybase, MS SQL Server, Informix,Ingres, and Teradata in mode, for Oracle and AS/400 databases in mode, and for Oracle in XStream mode. Trigger   Redo/Archive log 

Note

To set up a CDC environment you must understand the basics involved in designing a Job in , and particularly the Talend Studiodefinition of metadata items.

Note

When setting up a CDC environment, make sure that the database connection for CDC is on the same server with the source data towhich changes are to be captured.

How to set up CDC in Trigger modeThe following two sections provide a two-step guide to set up the CDC environment in mode in : the first step explains how toTrigger Talend Studioconfigure your system for CDC and the second step explains how to extract the modified data.

How to configure CDC in Trigger mode

Below are configuration steps that need to be set up for a given publisher/subscriber scenario.just once

STEP 1: SET UP A PUBLISHER

1. 2.

To set up a publisher, you need to:

Set up a database connection dedicated to CDC.Set up a connection to the database where data is located.

Note:- For instance

If you work with an MS SQL Server, you must set the two connections to the same database but using two OR more different schema.and correct version of Talend studio.

STEP 2: IDENTIFY THE SOURCE TABLE

Create the Connection in Talend as simple as other connection creation for CRM schema and created the customer table as mentionedabove:- 

Create the Connection in Talend as simple as other connection creation for DWH ( Data warehouse schema) and created the customertable as mentioned above:- 

For monitoring, i have created the Talend as Schema in Myssql.

Next Step is to create the CDC  in CRM connection;

Next click on " Create subscription ) auto pop with query inside will appear click on "Execute "

Once the script is executed then, this process will .create table in Talend schema

It has create TSUBSCRIBERS table in Talend schema

 

Once you click on Add CDC below pop up appears change the Subscriber Name to Table name as " Customers  which is in CRM schema.

 

place your source file and create the job and ran the job

Now the CDC entry has marked in and as usual Insert, update and Delete in the Data Dumped into TALEND schema CRM Schema DWH table  schema