Accelerate your Data Science and DataOps projects with IBM ...

19
Accelerate your Data Science and DataOps projects with IBM DataStage and Watson Knowledge Catalog

Transcript of Accelerate your Data Science and DataOps projects with IBM ...

Page 1: Accelerate your Data Science and DataOps projects with IBM ...

Accelerate your Data Science and DataOps projects with IBM DataStage and Watson Knowledge Catalog

Page 2: Accelerate your Data Science and DataOps projects with IBM ...

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Please Note

2

© 2019 IBM Corporation

Page 3: Accelerate your Data Science and DataOps projects with IBM ...

3

COLLECT

ORGANIZE

ANALYZE

INFUSE

Introducing DataOps

AI

Analytics and AI at scale and speed

to drive

Operational efficiency

Data privacy & compliance

DataOps(DevOps for Data + Data Operations)

• A concept, like DevOps for Data, enabling collaboration between data consumer & data provider at speed & scale

• Automated data operations providing curated data pipeline with quality & governance

• Drives agility and innovation everywhere

People Process Technology

Page 4: Accelerate your Data Science and DataOps projects with IBM ...

Watson Knowledge Catalog supports end-to-end DataOps

4

Data Governance Teams

Data Quality – Trust your data

Data Stewards & Data Quality

Analysts

Data Consumption – Use your data

Data Citizens

Data is useful only if its quality, content, and structure is well understood. Delivering reliable, quality, timely data for business consumption is a continuous process.

To set up the foundation of a DataOps program, organizations need to comply with regulatory requirements, communicate and enforce policies and standards, and manage metadata.

Enterprises need to surface business-ready data to consumers allowing them to deliver timely value to the business and make better decisions

Knowledge Catalog

Data Governance – Know your data

Page 5: Accelerate your Data Science and DataOps projects with IBM ...

All capabilities in a single experience

5

Data Governance

Data Quality Data Consumption

Knowledge Catalog

Business Glossary

Policy Management

Policy Enforcement

Reference Data

ManagementData Lineage Classification

Self-Service

Data Prep

Social Collaboration

Data Discovery

Data Profiling & Analysis

Business Term Suggestions

Data Quality Issue

Detection

Page 6: Accelerate your Data Science and DataOps projects with IBM ...

Machine Learning and Automation make Data Governance less invasive

Getting Started Quickly

Using a body of knowledge for CCPA, GDPR, and CECL, get term assignment recommendations to assets in the catalog.

Quickly create and assign a data class to clusters of similar columns using patent-protected Fingerprint algorithm.

Ingest a PDF and capture business terms and governance rules based on the document.

Profile data automatically and classify each column

Search across catalogs, projects and categories based on metadata and past searches.

Using historical business term assignments and business term relationships, get recommendations for business terms to assign to columns.

Based on past searches and what’s popular, see recommended data assets.

Data protection rules automatically restrict access and anonymize data

Automated Daily Activities

Page 7: Accelerate your Data Science and DataOps projects with IBM ...

Integrations between WKC and DataStage

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 7

Data Lineage

Shared Connections

Use of Reference Data

Page 8: Accelerate your Data Science and DataOps projects with IBM ...

Integrations between WKC and DataStage

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 8

Data Lineage

Shared Connections

Use of Reference Data

Page 9: Accelerate your Data Science and DataOps projects with IBM ...

Integrations between WKC and DataStage

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 9

Data Lineage

Shared Connections

Use of Reference Data

Page 10: Accelerate your Data Science and DataOps projects with IBM ...

Demo

© 2020 IBM Corporation 10

Page 11: Accelerate your Data Science and DataOps projects with IBM ...

Poll

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 11

What integrations between Watson Knowledge Catalog and DataStage would be most useful?

Page 12: Accelerate your Data Science and DataOps projects with IBM ...

Benefits of Cloud Pak for Data

12

✅ Reduced cost of custom integrations with disparate tools

✅ Pay for what you need, not the entire platform

✅ Governance of data and AI lifecycle

✅ Built on open source technology

✅ APIs available to integrate with all services on the platform

✅ Common experiences and administration across offerings

Page 13: Accelerate your Data Science and DataOps projects with IBM ...

Moving ForwardHelping customers migrate to Watson Knowledge Catalog and Cloud Pak.Demonstrating how Watson Knowledge Catalog benefits and extends, delivering business-ready-data to the enterprise.

Customers can migrate existing content from Information Server 11.5and 11.7 to Watson Knowledge Catalog using existing exportcapabilities. This will ensure uninterrupted operability of theirGlossary, Catalog and Analysis Projects.

Version 11.7 – ISTool Export Syntax

Version 11.5 – ISTool Export SyntaxInstallation Guidelines will be forthcoming. Cloud Pak will require a new dedicateddeployment on RedHat Linux OpenShift. Cloud Pak supports IBM Cloud in additionto other public Cloud Vendors or on-premise deployment.

Page 14: Accelerate your Data Science and DataOps projects with IBM ...

Moving ForwardContent Migration from Information Server to Watson Knowledge Catalog

Assets which can be fully migrated to Watson Knowledge Catalog via command scripts:

• Glossary Assets: Categories & Terms, Governance Policies & Rules, Stewards, Labels

• Information Assets: Database, Data File, Business Intelligence, Data Model

• Other Assets: OpenIGC, Extended Data Sources and Mappings, FastTrack

• DataStage Projects and Job definitions

• Analysis Projects and Workspace

• Metadata Asset Manager Import Area definitions

Known Limitations

• Custom Attribute Relationships and Restrictions on Governance Assets (expected Q3 2020)

• Published Analysis Results and Quality Score will not migrate, and need to be re-generated

• Metadata Asset Manger historical Import Area information will not migrate

• Glossary Term History and Development Glossary will not migrate

• Glossary Multilingual definitions will not migrate (Translation expected Q2 2020)

• Data and Business Lineage configuration settings will need to be re-defined

The following components or capabilities are not currentlysupported in Watson Knowledge Catalog:

• Business Glossary Anywhere• Business Glossary for Eclipse• Cognos Framework Manger / Report Designer integration• Governance Dashboard / SQL Views• Stewardship Center / Subscription Manager• Business Process Manager Integration• Governance Catalog Collections

Page 15: Accelerate your Data Science and DataOps projects with IBM ...

Business Challenge

Associated Bank wanted to improve their client experiences, and be able to better analyze data from many existing data sources.

Solution

Associated Bank is adopting IBM Cloud Pak for Data System, for rapid deployment and scaling of AI. Initial projects include a new Customer 360 system for improving client experiences and a new governed data dashboard for improved analytics results.

Outcome⎻ Cloud Pak for Data System provides the Bank a single interface

platform for end-to-end enterprise analytics⎻ Single source of all information around the customers through

all the bank systems⎻ Assist in compliance with privacy regulations like CCPS

Solution Components

Data Modernization and DataOps‒ IBM Cloud Pak for Data System (on premise ) with

‒ IBM DataStage

‒ IBM Watson Knowledge Catalog

‒ IBM Db2 Warehouse on Cloud

‒ Services from IBM’s Expert Labs team and IBM’s Data Science

and AI Elite team‒ Partner: iOLAP

"One of the great things about the Cloud Pak for Data System is the speed with which we'll be able to launch and scale our analytics platform. The integrated stack contains what we need to improve data quality, catalog our data assets, enable data collaboration, and build/operationalize data sciences. We're able to move quickly with design, test, build and deployment of new models and analytical applications."

Steve LueckVice President, Data Management

Associated Bank

Rapid deployment and scaling of AI

Industry: Banking & Financial MarketsGeography: North America

Watch the video

Associated Bank

Page 16: Accelerate your Data Science and DataOps projects with IBM ...

Business Challenge

Large, complex bank with mix of disparate data silos with both legacy and modern capabilities. Adherence to numerous industry regulatory requirements made accessing and querying data difficult and complex; data lineage was a large factor. Data insight initiatives were often slowed or delayed.

Solution

The bank sought to move to one corporate operating model, in anticipation of GDPR and other cross-border regulatory requirements.

The Bank partnered with IBM in order to streamline data management and applications across all operational countries by developing a single operating model strategy and platform.

Outcome

⎻ Ensure proper data governance, while simultaneously leveraging data from across the bank

⎻ Consolidate stacks into a single user experience platform; increasing collaboration, streamlining application management, and optimizing licensing and IT cost drivers

⎻ Leverage data virtualization for existing on-premise investments with data to remove data silos

Solution Components

Data Modernization and DataOps⎻ IBM Cloud Pak for Data on premise with

⎻ IBM Data Virtualization⎻ IBM DataStage⎻ IBM Watson Knowledge Catalog

⎻ Services from IBM’s Expert Labs team and IBM’s Data Science and AI Elite team

Developing a single operating model strategy and platform

Industry: Banking & Financial MarketsGeography: Europe

A Large European Bank

Page 17: Accelerate your Data Science and DataOps projects with IBM ...

Business Challenge

The bank, an existing IBM Information Server Suite customer (DataStage, Quality Stage, IGC, IA, FastTrack), wanted to improve its data governance strategy with a focus on data quality and data lineage and as a result hereof become regulatory compliant. They wanted to enhance their business user experience and be able to better analyze data from many existing data sources. They lacked centralized enterprise-wide Data and Analytics strategy support processes and needed an enterprise data inventory to improve data governance.

Solution

The bank partnered with IBM to streamline data collection and management across the bank by developing a single Data Governance operating model strategy and platform.

The bank sees a clear path to automating its data governance practice and IBM Cloud Pak for Data as the perfect solution providing a product evolution, a modern and open platform to drive core system transformation. Starting on the journey to automating their data governance, the bank is implementing IBM Watson Knowledge Catalog to catalog all meta-data across platforms and ultimately provide real-time quality data for their Data Scientists and business users as self-service.

Outcome

⎻ Ensure proper data governance, while simultaneously leveraging data from across the bank

⎻ Consolidate stacks into a single user experience platform and increase collaboration

⎻ Get trusted data and reduce the amount of data preparation

⎻ Free up time to spend on analysis and gain new insights

Solution Components

Data Modernization and DataOps⎻ IBM Cloud Pak for Data on premise with

⎻ IBM DataStage⎻ IBM Watson Knowledge Catalog

⎻ Services from IBM’s Expert Labs team

Automating data governance across the bank to meet changing expectations and increased regulation

Industry: Banking & Financial MarketsGeography: Europe

A Large European Bank

Page 18: Accelerate your Data Science and DataOps projects with IBM ...

Governance

Expand connections Ecosystem

Customization of views by persona

Reference Data versioning

Enhanced Lineage – Business View

Support for Knowledge Accelerators

AI model policies and rules

Quality

ML assisted processing time estimates

DQ Remediation workflow

Address parse/enhance/verify

Delta data discovery

Connectivity to Hive over Kerberos

Consumption

Search and Drill Down asset hierarchy

Support additional asset types

Restrict access to data

Support external reporting and querying tools Integration with ADP and Cognos

3rd Party Data Accelerators/Providers

Enhance Platform Roles & Permissions –CPD user groups

Governance

Data Protection rules in Data Virtualization

Workflow customization for governance artifacts

Rule Based Meta Data Access Control**

Tag Data Source Connection**Data Lineage**

Quality

Enhanced learning for term suggestions

View of data quality trends over time

Data Rule Exception Management

‘Fingerprint’ data classes

Simplified Discovery Experience

WKC Instascan

Import table or column from ERWin**

Consumption

New Connectors: SharePoint, Hive MetaStore, OracleBI, Impala, Planning Analytics

Search on Description in Catalog Assets **

Overall

New look and feel!

Globalization for Brazilian Portuguese, English, French, German, Italian, Japanese, Russian, Simplified Chinese, Spanish, and Traditional Chinese

Watson Knowledge Catalog on Cloud Pak for Data2020/2021 Roadmap

Delivered 2H 2020Nov

1H 2021May

Governance

Reference Data Set mapping, hierarchies & custom columns

Workflow request management

Permissions and workflow by categories

Quality

Discovery and profiling of unstructured data

ML based data sampling

Quick scan discovery into catalog of choice

View sample data in term assignment

Consumption

Custom Asset Types

Asset relationships and hierarchies

Integration with Test Data Management

On Demand View of Sample Data from Term Assign

Overall

Import/Export to support move from dev/test/prod

© 2020 IBM Corporation

Governance

Expand connections Ecosystem

Approval Process for publishing assets

Quality

Retire Information Assets View

Data Quality Analysis Workspace

Discovery queues for new term generation

Consumption

Apache Atlas Integration

Overall

Additional Languages

2H 2021

** Available Today

Page 19: Accelerate your Data Science and DataOps projects with IBM ...

IBM Cloud / © 2018 IBM Corporation 19