50571561-isas-etl-final

70
BHAVANI.P SUBHASHINI.V PUNNIYAA

Transcript of 50571561-isas-etl-final

Page 1: 50571561-isas-etl-final

BHAVANI.PSUBHASHINI.V

PUNNIYAA

Page 2: 50571561-isas-etl-final

Introduction

Extract, transform, and load (ETL) is a process indatabase usage and especially in data warehousingthat involves:  Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)

Page 3: 50571561-isas-etl-final
Page 4: 50571561-isas-etl-final

Extract

Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure

Page 5: 50571561-isas-etl-final

Transform Selecting only certain columns to load. Translating coded values Encoding free-form values Deriving a new calculated value Filtering Sorting Joining data from multiple sources Aggregation Generating surrogate-key values Transposing or pivoting Splitting a column into multiple columns Dis-aggregation of repeating columns into a separate detail table Lookup and validate the relevant data from tables or referential files for

slowly changing dimensions Applying any form of simple or complex data validation

Page 6: 50571561-isas-etl-final

Load

The load phase loads the data into the end target, usually the data warehouse (DW)

The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs

The load phase interacts with a database, the constraints defined in the database schema - as well as in triggers activated upon data load - apply

Page 7: 50571561-isas-etl-final

ETL Cycle

The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create

aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case

of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up

Page 8: 50571561-isas-etl-final

Challenges ETL processes can involve considerable complexity, and significant

operational problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified.

Data warehouses are typically assembled from a variety of data sources with different formats and purposes.

Design analysts should establish the scalability of an ETL system across the

lifetime of its usage.

The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time.

Page 9: 50571561-isas-etl-final

Performance Direct Path Extract method or bulk unload whenever is possible (instead of querying

the database) to reduce the load on source system while getting high speed extract Most of the transformation processing outside of the database To use bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL

process. Partition tables (and indices). Try to keep partitions similar in size (watch for null

values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking in the

target database tables during the load. Disable triggers in the target database tables during the load. Simulate their effect as

a separate step. Generate IDs in the ETL layer. Drop the indexes (on a table or partition) before the load - and recreate them after the

load. Use parallel bulk load when possible. If a requirement exists to do insertions, updates, or deletions, find out which rows

should be processed in which way in the ETL layer, and then process these three operations in the database separately.

Page 10: 50571561-isas-etl-final

Parallel Processing

Sources Central ETL layer Targets

ETL applications implement three main types of parallelism:

Data: By splitting a single sequential file into smaller data files to provide parallel access.

Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.

Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

Page 11: 50571561-isas-etl-final

Rerunnability, recoverability

Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece.

Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.

Page 12: 50571561-isas-etl-final

Best practices

Four-layered approach for ETL architecture design

Use file-based ETL processing where possible

Use data-driven methods and minimize custom ETL coding

Qualities of a good ETL architecture design

Page 13: 50571561-isas-etl-final

Tools

Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex.

ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities.

Page 14: 50571561-isas-etl-final

Open-source ETL frameworks

ApatarCloverETLFlat File CheckerJitterbit 2.0Pentaho Data Integration (now included in OpenOffice Base)RapidMinerScriptellaTalend Open Studio

Proprietary ETL frameworks

IBM InfoSphere DataStageInformatica PowerCenterOracle Data Integrator (ODI)Ab InitioAltova MapForceHiT Software AlloraDigital Fuel Service FlowPhocas ETLMicrosoft SQL Server Integration Services (SSIS)

Page 15: 50571561-isas-etl-final
Page 16: 50571561-isas-etl-final

The Pentaho BI Project is open source application software for enterprise reporting, analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence needs.

Business Model Pentaho uses a subscription model: its commercial open source business model eliminates software license fees, providing support, services, and product enhancements via an annual subscription. A commercial open source company, Pentaho "leads and sponsors" the open source projects that are core to its suite, giving it direct influence over software development.

Page 17: 50571561-isas-etl-final

Pentaho’s Board of Directors & Investors

The Board and Investor's composition is a strong, balanced blend of skills and experience, allowing them to offer guidance in core areas important to Pentaho.

Page 18: 50571561-isas-etl-final

Management and Technical Leads

The core project team at Pentaho has been together for many years and through success after success. It includes highly experienced industry leaders with a strong record of creating successful BI products for top-tier commercial vendors, including:

Business Objects Cognos Hyperion IBM Oracle SAS

Page 19: 50571561-isas-etl-final

COMPONENTS OF PENTAHO BI SUITE ENTERPRISE EDITION

•The Pentaho BI Suite provides a full spectrum of business intelligence (BI) capabilities including query and reporting, interactive analysis, dashboards, data integration/ETL, data mining, and a BI platform that has made it the world's most popular open source BI suite.

•Pentaho Enterprise Edition products provide comprehensive technical support, software maintenance, and enhanced functionality.

•Pentaho's technology was architected from the ground-up as a modern, fully integrated BI platform built on open standards.

•That means it fits easily into any IT infrastructure, out-of-the-box or embedded in a custom application

Page 20: 50571561-isas-etl-final

Pentaho Reporting

• Flexible deployment from standalone desktop reporting to embedded reporting and enterprise business intelligence

•Broad data source support including relational, OLAP, or XML-based data sources•Popular output options including Adobe PDF, HTML, Microsoft Excel, Rich Text Format, or plain text

•Web-based ad hoc query and reporting for business users

•Enterprise Edition provides enhanced software functionality, comprehensive professional technical support, product expertise, certified software and software maintenance.

Page 21: 50571561-isas-etl-final

Embedded reporting

Page 22: 50571561-isas-etl-final
Page 23: 50571561-isas-etl-final

Operational Reporting

Page 24: 50571561-isas-etl-final
Page 25: 50571561-isas-etl-final

Production Reporting

Page 26: 50571561-isas-etl-final
Page 27: 50571561-isas-etl-final

Pentaho Report Designer•Design reports quickly with the streamlined report wizard that takes authors from a blank canvas to a highly polished report in four simple steps.

• Connect to diverse data sources including relational data, Pentaho Analysis, flat files, java objects, or even stream data directly from Pentaho Data Integration transformations to design reports.

•Create and view user prompts, including dynamic cascading prompts.

•Publish directly to the BI server to give business users instant access to the information they need.

•Add rich data visualizations with over 15 customizable chart types, barcodes, sparklines, survey scales, and more.

•Localize reports easily to support multi-lingual deployment with a single report file.

•Embed HTML and JavaScript controls for dynamic and interactive online reports.

•Fine-tune reports using the built-in interactive preview mode.

Page 28: 50571561-isas-etl-final
Page 29: 50571561-isas-etl-final
Page 30: 50571561-isas-etl-final

Pentaho Analysis

Freely explore business information by drilling into and cross-tabulating data Experience speed-of-thought response times to complex analytical queries View information multi-dimensionally, choosing specific metrics and attributes to analyze Deploy stand-alone or integrated with other products in the Pentaho BI Suite

Pentaho Analyzer

Pentaho Analyzer provides intuitive, interactive analytical reporting letting non-technical business users quickly understand business information. As part of the enhanced functionality in Pentaho Analysis Enterprise Edition, Analyzer features:

Web-based, drag-and-drop report creation Advanced sorting and filtering Customized totals and user-defined calculations Chart visualizations And much more

Page 31: 50571561-isas-etl-final
Page 32: 50571561-isas-etl-final
Page 33: 50571561-isas-etl-final

Pentaho Dashboards

Pentaho Dashboards delivers the visibility by providing:

Rich, interactive displays including Adobe Flash-based visualizations so that business users can immediately see which business metrics are on track, and which need attention

Self-service dashboard designer that lets business users easily create personalized dashboards with zero training

Integration with Pentaho Reporting and Pentaho Analysis so that users can drill to underlying reports and analysis to understand what factors are contributing to good or bad performance

Portal integration to make it easy to deliver relevant business metrics to large numbers of users, seamlessly integrated into their application

Integrated alerting to continuously monitor for exceptions and notify users to take action

Page 34: 50571561-isas-etl-final
Page 35: 50571561-isas-etl-final
Page 36: 50571561-isas-etl-final

Pentaho Data Integration

Powers instantaneous, iterative BI application development Enables seamless collaboration between developers and end users Merges complex BI development into a single process Dramatically reduces time and difficulty of building and deploying BI apps

With Pentaho Data Integration, Pentaho is redefining the way that BI applications are built and deployed.  Utilizing Pentaho’s Agile BI approach, Pentaho Data Integration unifies the ETL, modeling and visualization processes into a single, integrated environment that enables developers and end-users to work seamlessly together.  The end result is that BI developers and end users can build BI applications more quickly, easily and at a small fraction of the cost of traditional solutions. Pentaho’s Agile BI:

Page 37: 50571561-isas-etl-final

Pentaho Data Integration is a full-featured ETL solution including:

Rich transformation library with over 100 out-of-the-box mapping objects Broad data source support including packaged applications, over 30 open source

and proprietary database platforms, flat files, Excel documents and more Advanced data warehousing support for Slowly Changing and Junk Dimensions Proven enterprise-class performance and scalability Integration with the Pentaho BI Suite for Enterprise Information Integration (EII), advanced scheduling, and process integration Unified ETL, modeling and visualization development environment for design of

BI applications.

Page 38: 50571561-isas-etl-final

Pentaho Data Integration Transformation Screenshot

Page 39: 50571561-isas-etl-final
Page 40: 50571561-isas-etl-final

Pentaho Data Integration Job Screenshot

Page 41: 50571561-isas-etl-final
Page 42: 50571561-isas-etl-final

Common use cases for Pentaho Data Integration include

Data warehouse population Agile design of BI applications Information enrichment by integrating data from various sources Data migration between applications Imports of data into databases from text-files, Excel spreadsheets,

relational systems and more Data cleansing by applying complex conditions in data

transformations Exploration of data in existing databases (tables, views, etc.)

Page 43: 50571561-isas-etl-final

Pentaho Data Mining

Data Mining is the process of running data through sophisticated algorithms to uncover meaningful patterns and correlations that may otherwise be hidden. These can be used to understand the business better and also exploited to improve future performance through predictive analytics.

Pentaho Data Mining is differentiated by its open, standards-compliant nature, use of Weka data mining technology, and tight integration with core business intelligence capabilities including reporting, analysis and dashboards. Other data mining offerings lack this level of sophistication and integration.

Page 44: 50571561-isas-etl-final

Pentaho Data Mining can be deployed as:

An out-of-the-box solution for immediate deployment to analysts. As far as end-users are concerned, data mining operates entirely in the background – users see results and recommendations through e-mail or other web pages, which can include Pentaho Dashboards.

A set of components that enable Java™ developers to quickly create custom reporting solutions using Java Objects or Java Server Pages (JSPs). These can be tightly integrated with other applications or portals.

Together with other components of the overall Pentaho BI Suite

Page 45: 50571561-isas-etl-final

Features and Benefits

Provides insight into hidden patterns and relationships in your data

Enables you to exploit these correlations to improve organizational performance

Provides indicators of future performance

Enables embedding of recommendations in your applications

Enables you to take full advantage of a range of data mining algorithms

Page 46: 50571561-isas-etl-final

Technology

Powerful Data Mining Engine

Provides a comprehensive set of machine learning algorithms from the Weka project including clustering, segmentation, decision trees, random forests, neural networks, and principal component analysis.

Pentaho has added integration with Pentaho Data Integration and automated the process of transforming data into the format the data mining engine needs.

Algorithms can either be applied directly to a dataset or called from Java code.

Output can be viewed graphically, interacted with programmatically, or used data source for reports, further analysis, and other processes.

Filters are provided for discretization, normalization, re-sampling, attribute selection, and transforming and combining attributes.

Page 47: 50571561-isas-etl-final

Classifiers provide models for predicting nominal or numeric quantities. Learning schemes include decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, and other advanced techniques.

The data mining engine is also well-suited for developing new machine learning schemes, enabling customers to incorporate their own models.

Inputs and outputs can be controlled programmatically, enabling developers to create completely custom solutions using the components provided.

Graphical Design Tools Graphical user interfaces are provided for data pre-processing,

classification,regression, clustering, association rules, andvisualization.

Page 48: 50571561-isas-etl-final

Data Mining - Boundary Visualizer

Page 49: 50571561-isas-etl-final
Page 50: 50571561-isas-etl-final

Data Mining – Classify Panel

Page 51: 50571561-isas-etl-final
Page 52: 50571561-isas-etl-final

Data Mining- Knowledge Flow

Page 53: 50571561-isas-etl-final
Page 54: 50571561-isas-etl-final

Data Mining- Explorer

Page 55: 50571561-isas-etl-final
Page 56: 50571561-isas-etl-final

CUSTOMER SUCCESSES

Pentaho customers address a wide range of BI challenges using services and software from Pentaho. Many Pentaho customers use Pentaho for reporting, data integration, dashboards, and/or analysis. Some use multiple modules or the full Pentaho BI Suite. With subscription services and open source licensing from Pentaho, customers can get best-in-class BI capabilities with the peace of mind of professional support, software maintenance, training, consulting, and more.

The following is a small sample of the many organizations around the globe that depend on Pentaho for commercial open source business intelligence.

Page 57: 50571561-isas-etl-final

"Our only regret was that we didn't have Pentaho for data integration years ago. Immediately we were able to see the increased operational efficiency, reduced internal costs and greater customer value using Pentaho Data Integration.“

Page 58: 50571561-isas-etl-final

Deployment OverviewKey Challenges Cumbersome, manual process for creation and distribution of reports Multiple data points including Google-Analytics needed to integrate and

automate into one report Pentaho Solution Pentaho Data Integration Business and implementation services by Pentaho Systems Integrator Partner,

DEFTeam Solutions Results Increased operational efficiency Reduced internal costs Greater customer valueWhy Pentaho Low cost Flexibility Speed-to-market

Page 59: 50571561-isas-etl-final

"We needed to deliver a business intelligence solution that would show immediate benefit by increasing efficiencies, containing costs, and helping drive revenue. By using Pentaho BI Suite Enterprise Edition, we were able to do so in a fiscally responsible manner, and in today's economic climate that is of utmost importance."

Page 60: 50571561-isas-etl-final

Deployment OverviewKey Challenges Gaining better insight across the organization to help steer strategic decision-

making Conducting deeper analysis on historical data across all facets of its service

offerings Pentaho Solution Pentaho BI Suite Enterprise Edition for data integration, reporting and analysis CentOS, PostgreSQL Database Results Company-wide performance gains through better visibility into customer, cost, and

revenue trends Increased operational efficiency, reduced internal costs and greater customer value Why Pentaho End-to-end BI capabilities Value vs. proprietary BI Enterprise Edition features

Page 61: 50571561-isas-etl-final

"The simplicity of the interface actually allows Lifetime Entertainment Services to give direct access to business analysts, allowing them to understand and manage the business rules governing the integration of information. That wasn't previously possible with complex hand-coded integration jobs."

Page 62: 50571561-isas-etl-final

Deployment Overview

Key Challenges Optimizing advertising processes to drive ad revenue growth Adapting data integration infrastructure to keep up with changing business rulesPentaho Solution Pentaho Data Integration Enterprise Edition Selected over Informatica and BusinessObjects Data Integrator Continued use of Business Objects BI toolsResults Ability for business analysts to manage integration rules and adapt integration

processes to company business rulesWhy Pentaho Ease of use Cost of ownership Enterprise Edition Features

Page 63: 50571561-isas-etl-final

"ActivePivot (tm) uniquely marries the concept of online analytical process with real-time position-keeping; something no other company currently offers. Thanks to Pentaho Spreadsheet Services we can now offer seamless MDX connectivity to Microsoft Excel."

Page 64: 50571561-isas-etl-final

Deployment OverviewKey Challenges Excel-based access to analytic application data Maximizing margins on analytic software solution for financial institutionsPentaho Solution Pentaho Analysis Pentaho Spreadsheet ServicesResults Competitive differentiation based on Excel-based access to centralized

informationWhy Pentaho Low costs delivered by commercial open source business model Standards-based offering allowing Excel-based connectivity to live OLAP

data

Page 65: 50571561-isas-etl-final

"Pentaho's BI suite and top-notch professional support enabled us to deliver a successful, high-value BI solution at a much lower cost than would have been possible with the expensive, proprietary alternatives."

Page 66: 50571561-isas-etl-final

Deployment OverviewKey Challenges Understanding the effectiveness of its online marketing activities Outgrowing Microsoft Excel-based reporting system Maintaining complex, hand-coded ETL scriptsPentaho Solution Pentaho BI Suite Enterprise Edition IBM servers, SUSE Linux, 1.5 terabyte Microsoft SQL Server data warehouse Professional services from Pentaho partner OpenBIResults Automated integration of clickstream data with Google Analytics and catalog

sales activity data Greater visibility into website traffic, keyword performance and revenue

attributionWhy Pentaho Standards-based, cross platform support Quality of support and services

Page 67: 50571561-isas-etl-final

"With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution."

Page 68: 50571561-isas-etl-final

Deployment OverviewKey Challenges Measuring and optimizing agent performance, customer satisfaction, and

marketing ROI Getting an integrated, strategic view across multiple operational systemsPentaho Solution Pentaho Data Integration Enterprise Edition Red Hat Enterprise Linux, MySQL database Continued use of proprietary BI tools (MicroStrategy) Product expertiseResults Three-fold performance increase, 8 hour reduction in batch load times Simplified maintenance and reduced costsWhy Pentaho Functionality and flexibility Professional support

Page 69: 50571561-isas-etl-final

AWARDS AND RECOGNITION

.

Page 70: 50571561-isas-etl-final

Thank You!