From the Big Data keynote at InCSIghts 2012

10 April 2023 1

BIG DATA Defined: Data Stack 3.0

Anand DeshpandePersistent SystemsDecember 2012

10 April 2023 2

Congratulations to the Pune Chapter

Best Chapter Award at CSI 2012 Kolkata

10 April 2023 3

COMAD 2012 14-16 December

Pune

Coming to India

Delhi 2016

10 April 2023 4

The Data Revolution is Happening Now

The growing need for large-volume, multi-structured “Big Data” analytics,as well as … “Fast Data”, have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years.

We believe that the economics of data will increasingly drive competitive advantage.

Source: Credit Suisse Research, Sept 2011

10 April 2023 5

Organizational leaders want analyticsto exploit their growing data and computational power to get smart, and get innovative, in ways they never could before.

Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics and the Path From Insights to Value By Steve LaValle, Eric Lesser,Rebecca Shockley, Michael S. Hopkins and Nina KruschwitzDecember 21, 2010

What Data Can Do For You

10 April 2023 6

Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigierhttp://www.nytimes.com/2009/09/02/business/global/02weather.html

Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day.

Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own software that calculates how shopping patterns change “for every degree of temperature and every hour of sunshine.”

Determining Shopping PatternsBritish Grocer, Tesco Uses Big Databy Applying Weather Results to Predict Demand and Increase Sales

10 April 2023 7

GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams.

Source: Big data: Embracing the elephant in the room By Steve Hemsley http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article

Tracking Customers in Social Media

Glaxo Smith Kline Uses Big Datato Efficiently Target Customers

10 April 2023 8

What does India Think?

Persistent enabled Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. http://www.satyamevjayate.in/

Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This data is being analyzed and delivered in real-time to allow the producers to understand the pulse of the viewers, to gauge the appreciation for the show and most importantly to spread the message. Harnessing the truth from all this data is a key component of the show’s success.

http://www.satyamevjayate.in/

10 April 2023 9

10 April 2023 10

WE ALREADY HAVE DATABASES. WHY DO WE NEED TO DO ANYTHING DIFFERENT?

10 April 2023 11

● Transaction processing capabilities ideally suited for transaction-oriented operational stores.

● Data types – numbers, text, etc.● SQL as the Query language ● De-facto standard as the operational

store for ERP and mission critical systems.

● Interface through application programs and query tools

Relational Database Systems for Operational Store

Data Stack

1.0

10 April 2023 12

Data Stack 1.0: Online Transactions Processing (OLTP)

● High throughput for transactions (writes).

● Focus on reliability – ACID Properties.

● Highly normalized Schema.

● Interface through application programs and query toolsData Stack 1.0

10 April 2023 13

● Operational data stores store on-line transactions – Many writes, some reads.

● Large fact table, multiple dimension tables

● Schema has a specific pattern – star schema

● Joins are also very standard and create cubes

● Queries focus on aggregates.● Users access data through tools such

as Cognos, Business Objects, Hyperion etc.

Data Stack 2.0: Enterprise Data Warehouse for Decision Support

Data Stack 2.0

10 April 2023 14

Data Stack 2.0: Enterprise Data Warehouse

ETL

OLAPData Staging

Data Store

Reports & Ad hoc Anal

Alerts & Dashboard

s

What-if Anal. EPM

PredictiveAnalytics

Data Visualization

Data Warehouse

User

10 April 2023 15

Data Stack 2.0:Enterprise Data Warehouse Systems

Standard Enterprise Data Architecture

Data Warehouse Engine

Optimized LoaderExtractionCleansing

(ETL)

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Relational Databases

Application Logic

Presentation Layer

Data Stack 1.0:Operational Data Systems

10 April 2023 16

Who are the playersOracle Microsoft Open

SourcePure Play

ETL Oracle Data Integrator

SQL Server Integration

Service (SSIS)

IBM Infosphere DataStage

I

Business Objects Data

IntegratorKettle

Enterprise Data

integration server

Informatica Powercenter

DWH Oracle 11g/Exadata

Parallel Data Warehouse(P

DW)

Netezza (Pure Data) Sybase iQ

Postgres/MySQL <BLANK>

Teradata, Greenplum

(EMC),

OLAP Hyperion/Essbase

SQL Server Analysis

Services(SSAS)

Cognos Powerplay SAP Hana Mondrian OLAP Viewer

ReportingOracle BI –OBIEE) & Exalytics

SQL Server Reporting Services (SSRS)

Cognos BI

Business Objects , BO Dashboard

Builder

BIRTPentaho,

Jasper

Enterprise Guide, Web

Report Studio or;

MicroStrategy Qliktech, Tableau

Predictive Analytics

Oracle Data Mining (ODM)

SQL Server Data Mining

(SSDM)SPSS SAP Hana + R R/Weka

SAS Enterprise

Miner

10 April 2023 17

One in two business executives believe that they do not have sufficient information across their organization to do their job

Source: IBM Institute for Business Value

Despite the two data stacks ..

10 April 2023 18

Data has Variety: it doesn’t fit

Less than 40% of the Enterprise Data makes its way to Data Stack 1.0 or Data Stack 2.0.

10 April 2023 19

Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise

ERP Systems

CRM Systems

EnterpriseData Warehouse

StructuredData Sources

Email SystemsCollaboration/Wiki Sites

Document Repositories

Project artifacts

Employee Surveys

Customer Call Center Records

UnstructuredData Sources

OrganizationalWorkflow

SensorData

CloudData Sources

CRM Systems

ExpenseManagementSystem Vendor

Collaboration Systems

Supply ChainSystems

Location and Presence Data

PublicData Sources

Weather forecasts

Demographic Data

Maps

Economic Data

Social Networking Data

TwitterFeeds

10 April 2023 20

5 Exabytes of information was created between the

dawn of civilization through 2003, but that much

information is now created every 2 days, and the pace is

increasingEric Schmidt

at the Techonomy Conference, August 4, 2010

(1 exabyte = 1018 bytes )

Data Volumes are Growing

10 April 2023 21

The Continued Explosion of Data in the Enterprise and Beyond

80% of new information growth is unstructured

content –

90% of that is currently unmanaged

1990 2000 2010 2020Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010

2009

800,000 petabytes

2020

35 zettabytes

44x as much

Data and Content

Over Coming Decade

10 April 2023 22

What comes first -- Structure or data?

Schema/

Structure

Data

Structure First is Constraining

10 April 2023 23

Time to create a new data stack for unstructured data.

Data Stack 3.0.

10 April 2023 24

Time-out!

Internet companies have already addressed the same problems.

10 April 2023 25

● Twitter has 140 million active users and more than 400 million tweets per day.

● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day.

● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015.

● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.

Internet Companies have to deal with large volumes of unstructured real-time data.

10 April 2023 26

● Hosted service● Large cluster (1000s of nodes) of

low-cost commodity servers.● Very large amounts of data --

Indexing billions of documents, video, images etc..

● Batch updates.● Fault tolerance.● Hundreds of Million users, ● Billions of queries every day.

Their data loads and pricing requirements do not fit traditional relational systems

10 April 2023 27

● It is the platform that distinguishes them from everyone else. ● They required:

– high reliability across data centers– scalability to thousands of network nodes– huge read/write bandwidth requirements– support for large blocks of data which are gigabytes in size.– efficient distribution of operations across nodes to reduce

bottlenecks

Relational databases were not suitable and would have been cost prohibitive.

They built their own systems

10 April 2023 28

Companies have created business models to support and enhance this software.

Internet Companies have open-sourced the source code they created for their own use.

What did the Internet Companies build? And how did they get there?

They started with a clean slate!

Do we need ..● transaction support?● rigid schemas?● joins?● SQL?● on-line, live updates?

Must have● Scale● Ability to handle unstructured

data● Ability to process large

volumes of data without having to start with structure first.

● leverage distributed computing

What features from the relational database can be compromised?

For the internet workload, with distributed computing, ACID properties are too strong.

Rethinking ACID properties

Atomicity Consistency Isolation Durability

Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state -- BASE.

Basic Availability Soft-state Eventual consistency

● Consistent – Reads always pick up the latest write.

● Available – can always read and write.

● Partition tolerant – The system can be split across multiple machines and datacenters

Can do at most two of these three.

Brewer’s CAP Theorem for Distributed Systems

Consistency

PartitionTolerance

AvailabilityCA

CP AP

Essential Building Blocks for Internet Data Systems

Hadoop Distributed File System (HDFS)

Hadoop Map-Reduce Layer

C L U S T E R

Map Reduce Jobs (Developers)

Job

Tracker

“For the last several years, every company

involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy” - Jeremy Zawodny @Yahoo !

● Cheap nodes fail, especially if you have manyMean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 day

– Solution: Build fault-tolerance into system

● Commodity network = low bandwidth– Solution: Push computation to the data

● Programming distributed systems is hard– Solution: Data-parallel programming model: users write “map” &

“reduce” functions, system distributes work and handles faults

Challenges with Distributed Computing

36

The Hadoop Ecosystem● HDFS – distributed, fault tolerant file system● MapReduce – framework for writing/executing distributed, fault tolerant

algorithms● Hive & Pig – SQL-like declarative languages● Sqoop – package for moving data between HDFS and relational DB systems● + Others…

HDFS

Map/Reduce

Hive & Pig

Sqoop

Zooke

ep

er

Avro

(S

eri

aliz

ati

on

)

HBase

ETL Tools

BI Reporting

RDBMS

● Google GFS; Hadoop HDFS; Kosmix KFSlarge distributed log structured file system that stores all types of data.

● Provides global file namespace● Typical usage pattern

– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common

● A new application coming on line can use an existing GFS cluster or they can make your own.

● File system can be tuned to fit individual application needs.

Reliable Storage is Essential

http://highscalability.com/google-architecture

http://highscalability.com/google-architecture

● Chunk Servers– File is split into contiguous chunks– Typically each chunk is 16-64MB– Each chunk replicated (usually 2x or 3x)– Try to keep replicas in different racks

● Master node– a.k.a. Name Nodes in HDFS– Stores metadata– Might be replicated

Distributed File System

● Why use MapReduce?– Nice way to partition tasks across lots of machines.– Handle machine failure– Works across different application types, like search and ads. – You can pre-compute useful data, find word counts, sort TBs

of data, etc.– Computation can automatically move closer to the IO source.

Now that you have storage, how would you manipulate this data?

MapReduce

● The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

● The Apache Hadoop software library is a framework that allows:– distributed processing of large data sets across clusters of computers

using a simple programming model. – It is designed to scale up from single servers to thousands of machines,

each offering local computation and storage. – Rather than rely on hardware to deliver high-availability, the library itself

is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop is the Apache implementation of MapReduce

Hadoop MapReduce Flow

Word Count – Distributed Solution

the quick

brown fox

the fox ate

the mouse

how now

the

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 4

ate, 1

cow, 1

mouse, 1

quick, 1

Input Map Shuffle & Sort Reduce Output

the, 1brown, 1

fox, 1quick,

1the, 1fox, 1the, 1

ate, 1mouse, 1

how, 1now, 1

brown, 1the, 1

cow, 1

brown, [1,1]fox, [1,1]how, [1]now, [1]

the, [1,1,1,1]

ate, [1]cow, [1]

mouse, [1]quick, [1]

public void map(Object key, Text value, …. ) {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {

word.set(itr.nextToken()); context.write(word, one); }

public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }

Word Count in Map-Reducem

ap

red

uce

● Pig and Hive provide a wrapper to make it easier to write MapReduce jobs.

● The raw data is stored in Hadoop's HDFS.

● These scripting languages provide– Ease of programming. – Optimization opportunities. – Extensibility.

Pig and Hive

Pig is a data flow scripting language

Hive is SQL-like language

http://pig.apache.org/

https://cwiki.apache.org/confluence/display/Hive

● Avro™: A data serialization system.● Cassandra™: A scalable multi-master

database with no single points of failure.

● Chukwa™: A data collection system for managing large distributed systems.

● HBase™: A scalable, distributed database that supports structured data storage for large tables.

● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

● Mahout™: A Scalable machine learning and data mining library.

● Pig™: A high-level data-flow language and execution framework for parallel computation.

● ZooKeeper™: A high-performance coordination service for distributed applications.

Other Hadoop-related projects at Apache include:

http://hadoop.apache.org/

http://avro.apache.org/

http://avro.apache.org/

http://cassandra.apache.org/

http://incubator.apache.org/chukwa/

http://incubator.apache.org/chukwa/

http://hbase.apache.org/


http://hive.apache.org/

http://mahout.apache.org/


http://zookeeper.apache.org/

http://zookeeper.apache.org/


http://www.apache.org/

● Facebook– 1100-machine cluster with 8800 cores– store copies of internal log and dimension data sources and use it

as a source for reporting/analytics and machine learning

● Yahoo– Biggest cluster: 4000 nodes– Search Marketing, People you may know, Search Assist, and many

more…

● Ebay– 532 nodes cluster (8 * 532 cores, 5.3PB). – Using it for Search optimization and Research

Powered by Hadoop http://wiki.apache.org/hadoop/PoweredBy (more than 100+ Companies are listed)

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

● Hadoop is best suited for batch processing of large volumes of unstructured data.– Lack of schemas– Lack of indexes – Lack of updates – pretty much absent!– Not designed for joins.– Support for Integrity Constraints– Limited support for data analysis tools

Hadoop is not a relational database

But what are your data analysis needs?

OLTP Data Integrity

Data Independen

ceSQL

Ad-hoc Queries

Complex Relationship

s

Maturity and Stability

Hadoop is not a Relational Database:If these are important, stick to RDBMS

Do you need SQL and full relational systems?If not, consider NoSQL databases for your needsN

OSQL

http://nosql-database.org/

Key-value Tabular Document Graph

http://nosql-database.org/

http://redis.io/

http://redis.io/



http://www.hypertable.org/index.html


















The Key-Value In-Memory DBs

● In memory DBs are simpler and faster than their on-disk counterparts.● Key value stores offer a simple interface with no schema. Really a giant,

distributed hash table.● Often used as caches for on-disk DB systems.● Advantages:

– Relatively simple– Practically no server to server talk.– Linear scalability

● Disadvantages:– Doesn’t understand data – no server side operations. The key and value are always

strings.– It’s really meant to only be a cache – no more, no less.– No recovery, limited elasticity.

● Data is automatically – replicated over multiple servers.– partitioned so each server contains

only a subset of the total data

● Data items are versioned● Server failure is handled

transparently● Each node is independent of other

nodes with no central point of failure or coordination

● Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart.

● Good single node performance: you can expect 10-20k operations per second

– depending on the machines, the network, the disk system, and the data replication factor

● Voldemort is not a relational database, – it does not attempt to satisfy arbitrary

relations while satisfying ACID properties.

– Nor is it an object database that attempts to transparently map object reference graphs.

– Nor does it introduce a new abstraction such as document-orientation.

● It is basically just a big, distributed, persistent, fault-tolerant hash table.

Voldemort is a distributed key-value storage system

http://project-voldemort.com/

http://project-voldemort.com/


Tabular stores

● The original: Google’s BigTable– Proprietary, not open source.

● The open source elephant alternative – Hadoop with HBase.

● A top level Apache Project.● Large number of users.● Contains a distributed file system, MapReduce, a

database server (Hbase), and more.● Rack aware.

http://www.meetup.com/hbaseusergroup/

● BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.

● BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.

● It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.

● Commercial databases simply don't scale to this level and they don't work across 1000s machines.

What is Google’s Big Table

Document Stores

● As the name implies, these databases store documents.

● Usually schema-free. The same database can store multiple documents.

● Allow indexing based on document content.● Prominent examples: CouchDB, MongoDB.

● Document-oriented– Documents (objects) map nicely

to programming language data types

– Embedded documents and arrays reduce need for joins

– Dynamically-typed (schemaless) for easy schema evolution

– No joins and no multi-document transactions for high performance and easy scalability

● High availability– Replicated servers with

automatic master failover

● Rich query language● Easy scalability

– Automatic sharding (auto-partitioning of data across servers)

– Eventually-consistent reads can be distributed over replicated servers

● High performance– No joins and embedding makes

reads and writes fast– Indexes including indexing of keys

from embedded documents and arrays

– Optional streaming writes (no acknowledgements )

Why MongoDB?


Mapping Systems to the CAP Theorem

A

C PCP

CA AP

BigTable, MongoDB, BerkeleyDBHypertable, Terrastore, MemcachedDBHbase, Scalaris, Redis

RDBMS (MySQL, Postgres etc.), AsterData, GreenplumVertica,

Dynamo, CassandraVoldermot, SimpleDBTokyo Cabinet, CouchDBKAI, Riak

Partition ToleranceThe system works well despite physical networkpartitions

Consistency:All clients have the same view of the data

AvailabilityEach client can always read

and write

Bigness Massive Write Performance

Fast Key Value Access

Flexible Schema and Flexible Data

Types

Schema Migration

Write Availability

No single point of failure

Generally available

Ease of programming

NoSQL Use cases: Important to align data model to the requirements

Mapping new Internet Data Management Technologies to the Enterprise

Enterprise data strategy is getting inclusive

Not

OnlySQL

NOSQL

Fromto

Open Source Rules !

Hadoop Infrastructure

What about support !

10 April 2023 62

The Path to Data Stack 3.0:Must support Variety, Volume and Velocity

Data Stack 3.0Dynamic Data Platform

Uncovering Key Insights

Schema less Approach

PBs of Data

End User Direct Access

Structured + Semi Structured

Data Stack 2.0Enterprise Data Warehouse

Support for Decision Making

Un-normalized Dimensional Model

TBs of Data

End User Access Through Reports

Structured

Data Stack 1.0Relational Database Systems

Recording Business Events

Highly Normalized Data

GBs of Data

End User Access through Ent Apps

Structured

10 April 2023 63

Can Data Stack 3.0 Address Real Problems?

Large Data Volume at Low Price

Diverse Data beyond

Structured Data

Queries that Are Difficult to Answer

Answer Queries that No One Dare

Ask

How does one go about the Big Data Expedition?

10 April 2023 65

PERSISTENT SYSTEMS AND BIG DATA

Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution that offers a direct path to unlock the value

in your data.

10 April 2023

Big Data Expertise at Persistent● 10+ projects executed with Leading ISVs and Enterprise

Customers● Dedicated group to MapReduce, Hadoop and Big Data

Ecosystem(formed 3 years ago)

● Engaged with the Big Data Ecosystem, including leading ISVs and experts

• Preferred Big Data Services Partner of IBM and Microsoft


https://www.salesforce.com/

10 April 2023 68

Big Data Leadership and Contributions● Code Contributions to Big Data Open Source Projects,

including: – Hadoop, Hive, and SciDB

● Dedicated Hadoop cluster in Persistent● Created PeBAL – Persistent Big Data Analytics Library● Created Visual Programming Environment for Hadoop● Created Data Connectors for Moving Data● Pre-built Solutions to Accelerate Big Data Projects

http://www.scidb.org/

http://www.scidb.org/


10 April 2023 69

Persistent’s Big Data Offerings1.Setting up and Maintaining Big Data Platform2.Data Analytics on Big Data Platform3.Building Applications on Big Data

Foundational Infrastructure and Platform (Built Upon Selected 3rd Party Big Data Platforms and Technologies;

Cluster of Commodity Hardware)

Persistent Platform Enhancement IP (PeBAL Analytics Library, Data Connectors)

Persistent Pre-built Horizontal Solutions(Email, Text, IT Analytics, … )

Persistent Pre-built Industry

Solution: Retail

Technology Assets

Vis

ual

Pro

gra

mm

ing

Tools


Solution: Banking


Solution:Telco

Big Data Custom Services

Extension ofYour Team

Discovery WorkshopTraining for Your Team

Team Formation ProcessCluster Sizing/Config

People Assets

Methodology

10 April 2023 70

Commercial/ Open Source Product Persistent IP External Data source

Email Server

Connector Framew

ork

IBM Tivoli

BBCA

Web Proxy

Social M

edia Connector

Twitter, Facebook

Email Server

Web Proxy

DW

NoSQL

RDBMS

Data Warehouse

PIG/Jqal Text Analytics/GATE/SystemT

Hive

Persistent Analytics Library (PEBAL)

Graph Fn Set Fn …. ….. ….. Text Analytics Fn

Solutions

MapReduce and HDFSCluster Monitoring

Admin App

Workflow

Integration

Connector Framew

ork

BI ToolsReports & Alerts

Persistent Next Generation Data Architecture
















10 April 2023 71

Persistent Big Data Analytics Library

WHY PEBAL• Lots of common problems – not all of them are solved in Map Reduce

• PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces

BENEFITS OF A READY MADE SOLUTION• Proven – well written and tested• Reuse across multiple applications• Quicker implementation of map reduce applications• High performance

FEATURES• Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms.

• PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases.

10 April 2023 72

Graph

Set

Text Analytic

s

Inverted Lists

Web Analytic

s

Statistics

10 April 2023 73

Visual Programming EnvironmentADOPTION BARRIERS

• Steep Learning Curve• Difficult to Code• Ad-hoc reporting can’t always be done by writing programs• Limited tooling available

VISUAL PROGRAMMING ENVIRONMENT• Use Standard ETL tool as the UI environment for generating PIG scripts

BENEFITS• ETL Tools are widely used in Enterprises• Can leverage large pool of skilled people who are experts in ETL and BI tools

• UI helps in iterative and rapid data analysis• More people will start using it

10 April 2023 74

Visual Programming Environment for Hadoop

HDFS/ HiveHDFS

Persistent IP

Data Flow UI

PIG Convertor

HDFS

PIG UDF Library

Big Data Platform

ETL Tool

Metadata

Data Data

Metadata

Data Sources

PIG code

10 April 2023 75

Persistent Connector Framework

OUT OF THE BOX• Database, Data Warehouse• Microsoft Exchange• Web proxy• IBM Tivoli• BBCA• Generic Push connector for *any* content

FEATURES• Bi-directional connector (as applicable)• Supports Push/Pull mechanism• Stores data on HDFS in an optimized format• Supports masking of data

WHY CONNECTOR FRAMEWORK• Pluggable Architecture

20+Years

10 April 2023 76

Persistent Data Connectors

10 April 2023 77

Persistent’s Breadth of Big Data Capabilities

Horizontal and Vertical Pre-built Solutions

Big Data Platform (PeBAL) analytics libraries and Connectors

IT Management

Big Data Application Programming

Distributed File Systems

Cluster Layer

Tooling

• RDBMS/DWH to import/export data

• Text Analytics libraries

• Data Visualization using Web2.0 and reporting tools - Cognos, Microstrategy

• Ecosystem tools like - Nutch, Katta, Lucene

• Job configuration, management and monitoring with BIgInsight’s job scheduler (MetaTracker)

• Job failure and recovery management

• Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL

• Expertise in MR programming - PIG, Hive, Java MR

• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)

• Statistical Analytics - R, SPSS, BigInsights Integration with R

• HDFS

• IBM GPFS

• Platform Setup on multi-node clusters, monitoring, VM based setup

• Product DeploymentPersistent IP for Big Data SolutionsBig Data Platform Components

10 April 2023 78

Persistent Roadmap to Big Data

1. Learn

2. Initiate

3. Scale4. Measure

5. Manage

Discover andDefine Use Cases

Improve Knowledge Baseand Shared Big Data

Platform

Upgrade to Production if Successful

Validate witha POC

Measure Effectiveness

and Business Value

10 April 2023 79

Build a social graph of all customers

Overlay sales data on the graph

Identify influential customers using network analysis

Target these customers for promotions.

Customer Analytics

Identifying your most influential customers ?

Targeting influential customers is best way to improve campaign ROI!

70 million customers

> 1billion transactions over twenty years

Few thousandInfluential customers

10 April 2023 80

Overview of Email Analytics● Key Business Needs

– Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines.

– Provide an ongoing analysis of customer sentiment through email communications.

● Use Cases– Quickly identify if there has been an information breach or if the information is being

shared in ways that is not in compliance with organizational guidelines.– Identify if a particular customer is not being appropriately managed.

● Benefits– Ability to proactively manage email analytics and communications across the organization

in a cost-effective way.– Reduce the response time to manage a breach and proactively address issues that emerge

through ongoing analysis of email.

10 April 2023 81

Using Email to Analyze Customer Sentiment

Sense the mood of your customers through their emails

Carry out detailed analysis on customer team interactions and response times

10 April 2023 82

Analyzing Prescription Data

1.5 million patients are harmed by medication errors every year

Identifying erroneous prescriptions can save lives! Source: Center for Medication Safety & Clinical Improvement

10 April 2023 83

Overview of IT Analytics● Key Business Needs

– Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems.

– Information may be in different formats, locations, granularity, data stores.– System outages have a negative impact on short-term revenue, as well as long-term credibility and

reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative.

● Use Cases– Identify security threats and isolate the corresponding external factors quickly.– Identify if an email server is unstable, determine the priority and take preventative action before a

complete failure occurs.

● Benefits– Reduced maintenance cost– Higher reliablity and SLA compliance

10 April 2023 84

Consumer Insight from Social Media

Find out what the customers are talking about your organization or product in the social media

1. Structured AnalysisResponses to Pledge, multiple choice questions

2. Unstructured AnalysisResponses to following questions • Share your story• Ask a question to Aamir• Send a message of hope• Share your solution

Content Filtering Rating Tagging System (CFRTS)L0, L1, L2 phased analytics 3. Impact Analysis

Crawling general internet for measuring the before & after scenario on a particular topic

Web/TV Viewer

Response to Pledgemultiple choice questionsWeb, emails, IVR/CallsIndividual blogsSocial widgetsVideos…

IVR

SMS

Web

, Soc

ial M

edia

(S

truc

ture

d)So

cial

Med

ia

(uns

truc

ture

d)

Insights for Satyamev Jayate – Variety of sources

Rigorous Weekly Operation Cycle producing instant analyticsKiller combo of Human+Software to analyze the data efficiently Topic opens on

Sunday

Live Analytics report is sent

during the show

Data capture from SMS,

phone calls, social media,

website,

System runs L0 Analysis, L1, L2

Analysts continue

JSONs are created for the external and

internal dashboards

Featured content is delivered

thrice a day all through out the week.

Episode Tags are refined and messages are re-ingested for another pass

10 April 2023 87

10 April 2023 88

Thank you

Anand Deshpande ([email protected])http://in.linkedin.com/in/ananddeshpande

Persistent Systems Limitedwww.persistentsys.com

mailto:[email protected]

http://in.linkedin.com/in/ananddeshpande

http://www.persistentsys.com/

10 April 2023 89

Enterprise Value is Shifting to Data

Mainframe

Operating Systems

ERP

Apps

Data

20132006

Database

199519851975Line of D

iminishing Value

From the Big Data keynote at InCSIghts 2012

Technology

Transcript of From the Big Data keynote at InCSIghts 2012