Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
From the Big Data keynote at InCSIghts 2012
-
Upload
anand-deshpande -
Category
Technology
-
view
6.481 -
download
6
description
Transcript of From the Big Data keynote at InCSIghts 2012
![Page 1: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/1.jpg)
10 April 2023 1
BIG DATA Defined: Data Stack 3.0
Anand DeshpandePersistent SystemsDecember 2012
![Page 2: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/2.jpg)
10 April 2023 2
Congratulations to the Pune Chapter
Best Chapter Award at CSI 2012 Kolkata
![Page 3: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/3.jpg)
10 April 2023 3
COMAD 2012 14-16 December
Pune
Coming to India
Delhi 2016
![Page 4: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/4.jpg)
10 April 2023 4
The Data Revolution is Happening Now
The growing need for large-volume, multi-structured “Big Data” analytics,as well as … “Fast Data”, have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years.
We believe that the economics of data will increasingly drive competitive advantage.
Source: Credit Suisse Research, Sept 2011
![Page 5: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/5.jpg)
10 April 2023 5
Organizational leaders want analyticsto exploit their growing data and computational power to get smart, and get innovative, in ways they never could before.
Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics and the Path From Insights to Value By Steve LaValle, Eric Lesser,Rebecca Shockley, Michael S. Hopkins and Nina KruschwitzDecember 21, 2010
What Data Can Do For You
![Page 6: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/6.jpg)
10 April 2023 6
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigierhttp://www.nytimes.com/2009/09/02/business/global/02weather.html
Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own software that calculates how shopping patterns change “for every degree of temperature and every hour of sunshine.”
Determining Shopping PatternsBritish Grocer, Tesco Uses Big Databy Applying Weather Results to Predict Demand and Increase Sales
![Page 7: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/7.jpg)
10 April 2023 7
GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams.
Source: Big data: Embracing the elephant in the room By Steve Hemsley http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article
Tracking Customers in Social Media
Glaxo Smith Kline Uses Big Datato Efficiently Target Customers
![Page 8: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/8.jpg)
10 April 2023 8
What does India Think?
Persistent enabled Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. http://www.satyamevjayate.in/
Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This data is being analyzed and delivered in real-time to allow the producers to understand the pulse of the viewers, to gauge the appreciation for the show and most importantly to spread the message. Harnessing the truth from all this data is a key component of the show’s success.
![Page 9: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/9.jpg)
10 April 2023 9
![Page 10: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/10.jpg)
10 April 2023 10
WE ALREADY HAVE DATABASES. WHY DO WE NEED TO DO ANYTHING DIFFERENT?
![Page 11: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/11.jpg)
10 April 2023 11
● Transaction processing capabilities ideally suited for transaction-oriented operational stores.
● Data types – numbers, text, etc.● SQL as the Query language ● De-facto standard as the operational
store for ERP and mission critical systems.
● Interface through application programs and query tools
Relational Database Systems for Operational Store
Data Stack
1.0
![Page 12: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/12.jpg)
10 April 2023 12
Data Stack 1.0: Online Transactions Processing (OLTP)
● High throughput for transactions (writes).
● Focus on reliability – ACID Properties.
● Highly normalized Schema.
● Interface through application programs and query toolsData Stack 1.0
![Page 13: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/13.jpg)
10 April 2023 13
● Operational data stores store on-line transactions – Many writes, some reads.
● Large fact table, multiple dimension tables
● Schema has a specific pattern – star schema
● Joins are also very standard and create cubes
● Queries focus on aggregates.● Users access data through tools such
as Cognos, Business Objects, Hyperion etc.
Data Stack 2.0: Enterprise Data Warehouse for Decision Support
Data Stack 2.0
![Page 14: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/14.jpg)
10 April 2023 14
Data Stack 2.0: Enterprise Data Warehouse
ETL
OLAPData Staging
Data Store
Reports & Ad hoc Anal
Alerts & Dashboard
s
What-if Anal. EPM
PredictiveAnalytics
Data Visualization
Data Warehouse
User
![Page 15: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/15.jpg)
10 April 2023 15
Data Stack 2.0:Enterprise Data Warehouse Systems
Standard Enterprise Data Architecture
Data Warehouse Engine
Optimized LoaderExtractionCleansing
(ETL)
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Relational Databases
Application Logic
Presentation Layer
Data Stack 1.0:Operational Data Systems
![Page 16: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/16.jpg)
10 April 2023 16
Who are the playersOracle Microsoft Open
SourcePure Play
ETL Oracle Data Integrator
SQL Server Integration
Service (SSIS)
IBM Infosphere DataStage
I
Business Objects Data
IntegratorKettle
Enterprise Data
integration server
Informatica Powercenter
DWH Oracle 11g/Exadata
Parallel Data Warehouse(P
DW)
Netezza (Pure Data) Sybase iQ
Postgres/MySQL <BLANK>
Teradata, Greenplum
(EMC),
OLAP Hyperion/Essbase
SQL Server Analysis
Services(SSAS)
Cognos Powerplay SAP Hana Mondrian OLAP Viewer
ReportingOracle BI –OBIEE) & Exalytics
SQL Server Reporting Services (SSRS)
Cognos BI
Business Objects , BO Dashboard
Builder
BIRTPentaho,
Jasper
Enterprise Guide, Web
Report Studio or;
MicroStrategy Qliktech, Tableau
Predictive Analytics
Oracle Data Mining (ODM)
SQL Server Data Mining
(SSDM)SPSS SAP Hana + R R/Weka
SAS Enterprise
Miner
![Page 17: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/17.jpg)
10 April 2023 17
One in two business executives believe that they do not have sufficient information across their organization to do their job
Source: IBM Institute for Business Value
Despite the two data stacks ..
![Page 18: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/18.jpg)
10 April 2023 18
Data has Variety: it doesn’t fit
Less than 40% of the Enterprise Data makes its way to Data Stack 1.0 or Data Stack 2.0.
![Page 19: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/19.jpg)
10 April 2023 19
Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise
ERP Systems
CRM Systems
EnterpriseData Warehouse
StructuredData Sources
Email SystemsCollaboration/Wiki Sites
Document Repositories
Project artifacts
Employee Surveys
Customer Call Center Records
UnstructuredData Sources
OrganizationalWorkflow
SensorData
CloudData Sources
CRM Systems
ExpenseManagementSystem Vendor
Collaboration Systems
Supply ChainSystems
Location and Presence Data
PublicData Sources
Weather forecasts
Demographic Data
Maps
Economic Data
Social Networking Data
TwitterFeeds
![Page 20: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/20.jpg)
10 April 2023 20
5 Exabytes of information was created between the
dawn of civilization through 2003, but that much
information is now created every 2 days, and the pace is
increasingEric Schmidt
at the Techonomy Conference, August 4, 2010
(1 exabyte = 1018 bytes )
Data Volumes are Growing
![Page 21: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/21.jpg)
10 April 2023 21
The Continued Explosion of Data in the Enterprise and Beyond
80% of new information growth is unstructured
content –
90% of that is currently unmanaged
1990 2000 2010 2020Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
2009
800,000 petabytes
2020
35 zettabytes
44x as much
Data and Content
Over Coming Decade
![Page 22: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/22.jpg)
10 April 2023 22
What comes first -- Structure or data?
Schema/
Structure
Data
Structure First is Constraining
![Page 23: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/23.jpg)
10 April 2023 23
Time to create a new data stack for unstructured data.
Data Stack 3.0.
![Page 24: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/24.jpg)
10 April 2023 24
Time-out!
Internet companies have already addressed the same problems.
![Page 25: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/25.jpg)
10 April 2023 25
● Twitter has 140 million active users and more than 400 million tweets per day.
● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day.
● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015.
● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.
Internet Companies have to deal with large volumes of unstructured real-time data.
![Page 26: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/26.jpg)
10 April 2023 26
● Hosted service● Large cluster (1000s of nodes) of
low-cost commodity servers.● Very large amounts of data --
Indexing billions of documents, video, images etc..
● Batch updates.● Fault tolerance.● Hundreds of Million users, ● Billions of queries every day.
Their data loads and pricing requirements do not fit traditional relational systems
![Page 27: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/27.jpg)
10 April 2023 27
● It is the platform that distinguishes them from everyone else. ● They required:
– high reliability across data centers– scalability to thousands of network nodes– huge read/write bandwidth requirements– support for large blocks of data which are gigabytes in size.– efficient distribution of operations across nodes to reduce
bottlenecks
Relational databases were not suitable and would have been cost prohibitive.
They built their own systems
![Page 28: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/28.jpg)
10 April 2023 28
Companies have created business models to support and enhance this software.
Internet Companies have open-sourced the source code they created for their own use.
![Page 29: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/29.jpg)
What did the Internet Companies build? And how did they get there?
They started with a clean slate!
![Page 30: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/30.jpg)
Do we need ..● transaction support?● rigid schemas?● joins?● SQL?● on-line, live updates?
Must have● Scale● Ability to handle unstructured
data● Ability to process large
volumes of data without having to start with structure first.
● leverage distributed computing
What features from the relational database can be compromised?
![Page 31: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/31.jpg)
For the internet workload, with distributed computing, ACID properties are too strong.
Rethinking ACID properties
Atomicity Consistency Isolation Durability
Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state -- BASE.
Basic Availability Soft-state Eventual consistency
![Page 32: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/32.jpg)
● Consistent – Reads always pick up the latest write.
● Available – can always read and write.
● Partition tolerant – The system can be split across multiple machines and datacenters
Can do at most two of these three.
Brewer’s CAP Theorem for Distributed Systems
Consistency
PartitionTolerance
AvailabilityCA
CP AP
![Page 33: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/33.jpg)
Essential Building Blocks for Internet Data Systems
Hadoop Distributed File System (HDFS)
Hadoop Map-Reduce Layer
C L U S T E R
Map Reduce Jobs (Developers)
Job
Tracker
![Page 34: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/34.jpg)
“For the last several years, every company
involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy” - Jeremy Zawodny @Yahoo !
![Page 35: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/35.jpg)
● Cheap nodes fail, especially if you have manyMean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 day
– Solution: Build fault-tolerance into system
● Commodity network = low bandwidth– Solution: Push computation to the data
● Programming distributed systems is hard– Solution: Data-parallel programming model: users write “map” &
“reduce” functions, system distributes work and handles faults
Challenges with Distributed Computing
![Page 36: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/36.jpg)
36
The Hadoop Ecosystem● HDFS – distributed, fault tolerant file system● MapReduce – framework for writing/executing distributed, fault tolerant
algorithms● Hive & Pig – SQL-like declarative languages● Sqoop – package for moving data between HDFS and relational DB systems● + Others…
HDFS
Map/Reduce
Hive & Pig
Sqoop
Zooke
ep
er
Avro
(S
eri
aliz
ati
on
)
HBase
ETL Tools
BI Reporting
RDBMS
![Page 37: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/37.jpg)
● Google GFS; Hadoop HDFS; Kosmix KFSlarge distributed log structured file system that stores all types of data.
● Provides global file namespace● Typical usage pattern
– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common
● A new application coming on line can use an existing GFS cluster or they can make your own.
● File system can be tuned to fit individual application needs.
Reliable Storage is Essential
http://highscalability.com/google-architecture
![Page 38: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/38.jpg)
● Chunk Servers– File is split into contiguous chunks– Typically each chunk is 16-64MB– Each chunk replicated (usually 2x or 3x)– Try to keep replicas in different racks
● Master node– a.k.a. Name Nodes in HDFS– Stores metadata– Might be replicated
Distributed File System
![Page 39: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/39.jpg)
● Why use MapReduce?– Nice way to partition tasks across lots of machines.– Handle machine failure– Works across different application types, like search and ads. – You can pre-compute useful data, find word counts, sort TBs
of data, etc.– Computation can automatically move closer to the IO source.
Now that you have storage, how would you manipulate this data?
MapReduce
![Page 40: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/40.jpg)
● The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
● The Apache Hadoop software library is a framework that allows:– distributed processing of large data sets across clusters of computers
using a simple programming model. – It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. – Rather than rely on hardware to deliver high-availability, the library itself
is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop is the Apache implementation of MapReduce
![Page 41: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/41.jpg)
Hadoop MapReduce Flow
![Page 42: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/42.jpg)
Word Count – Distributed Solution
the quick
brown fox
the fox ate
the mouse
how now
the
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 4
ate, 1
cow, 1
mouse, 1
quick, 1
Input Map Shuffle & Sort Reduce Output
the, 1brown, 1
fox, 1quick,
1the, 1fox, 1the, 1
ate, 1mouse, 1
how, 1now, 1
brown, 1the, 1
cow, 1
brown, [1,1]fox, [1,1]how, [1]now, [1]
the, [1,1,1,1]
ate, [1]cow, [1]
mouse, [1]quick, [1]
![Page 43: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/43.jpg)
public void map(Object key, Text value, …. ) {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {
word.set(itr.nextToken()); context.write(word, one); }
public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
Word Count in Map-Reducem
ap
red
uce
![Page 44: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/44.jpg)
● Pig and Hive provide a wrapper to make it easier to write MapReduce jobs.
● The raw data is stored in Hadoop's HDFS.
● These scripting languages provide– Ease of programming. – Optimization opportunities. – Extensibility.
Pig and Hive
Pig is a data flow scripting language
Hive is SQL-like language
![Page 45: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/45.jpg)
● Avro™: A data serialization system.● Cassandra™: A scalable multi-master
database with no single points of failure.
● Chukwa™: A data collection system for managing large distributed systems.
● HBase™: A scalable, distributed database that supports structured data storage for large tables.
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel computation.
● ZooKeeper™: A high-performance coordination service for distributed applications.
Other Hadoop-related projects at Apache include:
http://hadoop.apache.org/
![Page 46: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/46.jpg)
● Facebook– 1100-machine cluster with 8800 cores– store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning
● Yahoo– Biggest cluster: 4000 nodes– Search Marketing, People you may know, Search Assist, and many
more…
● Ebay– 532 nodes cluster (8 * 532 cores, 5.3PB). – Using it for Search optimization and Research
Powered by Hadoop http://wiki.apache.org/hadoop/PoweredBy (more than 100+ Companies are listed)
![Page 47: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/47.jpg)
● Hadoop is best suited for batch processing of large volumes of unstructured data.– Lack of schemas– Lack of indexes – Lack of updates – pretty much absent!– Not designed for joins.– Support for Integrity Constraints– Limited support for data analysis tools
Hadoop is not a relational database
But what are your data analysis needs?
![Page 48: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/48.jpg)
OLTP Data Integrity
Data Independen
ceSQL
Ad-hoc Queries
Complex Relationship
s
Maturity and Stability
Hadoop is not a Relational Database:If these are important, stick to RDBMS
![Page 49: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/49.jpg)
Do you need SQL and full relational systems?If not, consider NoSQL databases for your needsN
OSQL
http://nosql-database.org/
Key-value Tabular Document Graph
![Page 50: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/50.jpg)
The Key-Value In-Memory DBs
● In memory DBs are simpler and faster than their on-disk counterparts.● Key value stores offer a simple interface with no schema. Really a giant,
distributed hash table.● Often used as caches for on-disk DB systems.● Advantages:
– Relatively simple– Practically no server to server talk.– Linear scalability
● Disadvantages:– Doesn’t understand data – no server side operations. The key and value are always
strings.– It’s really meant to only be a cache – no more, no less.– No recovery, limited elasticity.
![Page 51: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/51.jpg)
● Data is automatically – replicated over multiple servers.– partitioned so each server contains
only a subset of the total data
● Data items are versioned● Server failure is handled
transparently● Each node is independent of other
nodes with no central point of failure or coordination
● Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart.
● Good single node performance: you can expect 10-20k operations per second
– depending on the machines, the network, the disk system, and the data replication factor
● Voldemort is not a relational database, – it does not attempt to satisfy arbitrary
relations while satisfying ACID properties.
– Nor is it an object database that attempts to transparently map object reference graphs.
– Nor does it introduce a new abstraction such as document-orientation.
● It is basically just a big, distributed, persistent, fault-tolerant hash table.
Voldemort is a distributed key-value storage system
http://project-voldemort.com/
![Page 52: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/52.jpg)
Tabular stores
● The original: Google’s BigTable– Proprietary, not open source.
● The open source elephant alternative – Hadoop with HBase.
● A top level Apache Project.● Large number of users.● Contains a distributed file system, MapReduce, a
database server (Hbase), and more.● Rack aware.
![Page 53: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/53.jpg)
● BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
● BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
● It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
● Commercial databases simply don't scale to this level and they don't work across 1000s machines.
What is Google’s Big Table
![Page 54: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/54.jpg)
Document Stores
● As the name implies, these databases store documents.
● Usually schema-free. The same database can store multiple documents.
● Allow indexing based on document content.● Prominent examples: CouchDB, MongoDB.
![Page 55: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/55.jpg)
● Document-oriented– Documents (objects) map nicely
to programming language data types
– Embedded documents and arrays reduce need for joins
– Dynamically-typed (schemaless) for easy schema evolution
– No joins and no multi-document transactions for high performance and easy scalability
● High availability– Replicated servers with
automatic master failover
● Rich query language● Easy scalability
– Automatic sharding (auto-partitioning of data across servers)
– Eventually-consistent reads can be distributed over replicated servers
● High performance– No joins and embedding makes
reads and writes fast– Indexes including indexing of keys
from embedded documents and arrays
– Optional streaming writes (no acknowledgements )
Why MongoDB?
![Page 56: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/56.jpg)
Mapping Systems to the CAP Theorem
A
C PCP
CA AP
BigTable, MongoDB, BerkeleyDBHypertable, Terrastore, MemcachedDBHbase, Scalaris, Redis
RDBMS (MySQL, Postgres etc.), AsterData, GreenplumVertica,
Dynamo, CassandraVoldermot, SimpleDBTokyo Cabinet, CouchDBKAI, Riak
Partition ToleranceThe system works well despite physical networkpartitions
Consistency:All clients have the same view of the data
AvailabilityEach client can always read
and write
![Page 57: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/57.jpg)
Bigness Massive Write Performance
Fast Key Value Access
Flexible Schema and Flexible Data
Types
Schema Migration
Write Availability
No single point of failure
Generally available
Ease of programming
NoSQL Use cases: Important to align data model to the requirements
![Page 58: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/58.jpg)
Mapping new Internet Data Management Technologies to the Enterprise
![Page 59: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/59.jpg)
Enterprise data strategy is getting inclusive
Not
OnlySQL
NOSQL
Fromto
![Page 60: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/60.jpg)
Open Source Rules !
Hadoop Infrastructure
![Page 61: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/61.jpg)
What about support !
![Page 62: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/62.jpg)
10 April 2023 62
The Path to Data Stack 3.0:Must support Variety, Volume and Velocity
Data Stack 3.0Dynamic Data Platform
Uncovering Key Insights
Schema less Approach
PBs of Data
End User Direct Access
Structured + Semi Structured
Data Stack 2.0Enterprise Data Warehouse
Support for Decision Making
Un-normalized Dimensional Model
TBs of Data
End User Access Through Reports
Structured
Data Stack 1.0Relational Database Systems
Recording Business Events
Highly Normalized Data
GBs of Data
End User Access through Ent Apps
Structured
![Page 63: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/63.jpg)
10 April 2023 63
Can Data Stack 3.0 Address Real Problems?
Large Data Volume at Low Price
Diverse Data beyond
Structured Data
Queries that Are Difficult to Answer
Answer Queries that No One Dare
Ask
![Page 64: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/64.jpg)
How does one go about the Big Data Expedition?
![Page 65: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/65.jpg)
10 April 2023 65
PERSISTENT SYSTEMS AND BIG DATA
![Page 66: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/66.jpg)
Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution that offers a direct path to unlock the value
in your data.
![Page 67: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/67.jpg)
10 April 2023
Big Data Expertise at Persistent● 10+ projects executed with Leading ISVs and Enterprise
Customers● Dedicated group to MapReduce, Hadoop and Big Data
Ecosystem(formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and experts
• Preferred Big Data Services Partner of IBM and Microsoft
![Page 68: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/68.jpg)
10 April 2023 68
Big Data Leadership and Contributions● Code Contributions to Big Data Open Source Projects,
including: – Hadoop, Hive, and SciDB
● Dedicated Hadoop cluster in Persistent● Created PeBAL – Persistent Big Data Analytics Library● Created Visual Programming Environment for Hadoop● Created Data Connectors for Moving Data● Pre-built Solutions to Accelerate Big Data Projects
![Page 69: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/69.jpg)
10 April 2023 69
Persistent’s Big Data Offerings1.Setting up and Maintaining Big Data Platform2.Data Analytics on Big Data Platform3.Building Applications on Big Data
Foundational Infrastructure and Platform (Built Upon Selected 3rd Party Big Data Platforms and Technologies;
Cluster of Commodity Hardware)
Persistent Platform Enhancement IP (PeBAL Analytics Library, Data Connectors)
Persistent Pre-built Horizontal Solutions(Email, Text, IT Analytics, … )
Persistent Pre-built Industry
Solution: Retail
Technology Assets
Vis
ual
Pro
gra
mm
ing
Tools
Persistent Pre-built Industry
Solution: Banking
Persistent Pre-built Industry
Solution:Telco
Big Data Custom Services
Extension ofYour Team
Discovery WorkshopTraining for Your Team
Team Formation ProcessCluster Sizing/Config
People Assets
Methodology
![Page 70: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/70.jpg)
10 April 2023 70
Commercial/ Open Source Product Persistent IP External Data source
Email Server
Connector Framew
ork
IBM Tivoli
BBCA
Web Proxy
Social M
edia Connector
Twitter, Facebook
Email Server
Web Proxy
DW
NoSQL
RDBMS
Data Warehouse
PIG/Jqal Text Analytics/GATE/SystemT
Hive
Persistent Analytics Library (PEBAL)
Graph Fn Set Fn …. ….. ….. Text Analytics Fn
Solutions
MapReduce and HDFSCluster Monitoring
Admin App
Workflow
Integration
Connector Framew
ork
BI ToolsReports & Alerts
Persistent Next Generation Data Architecture
![Page 71: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/71.jpg)
10 April 2023 71
Persistent Big Data Analytics Library
WHY PEBAL• Lots of common problems – not all of them are solved in Map Reduce
• PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces
BENEFITS OF A READY MADE SOLUTION• Proven – well written and tested• Reuse across multiple applications• Quicker implementation of map reduce applications• High performance
FEATURES• Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms.
• PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases.
![Page 72: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/72.jpg)
10 April 2023 72
Graph
Set
Text Analytic
s
Inverted Lists
Web Analytic
s
Statistics
![Page 73: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/73.jpg)
10 April 2023 73
Visual Programming EnvironmentADOPTION BARRIERS
• Steep Learning Curve• Difficult to Code• Ad-hoc reporting can’t always be done by writing programs• Limited tooling available
VISUAL PROGRAMMING ENVIRONMENT• Use Standard ETL tool as the UI environment for generating PIG scripts
BENEFITS• ETL Tools are widely used in Enterprises• Can leverage large pool of skilled people who are experts in ETL and BI tools
• UI helps in iterative and rapid data analysis• More people will start using it
![Page 74: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/74.jpg)
10 April 2023 74
Visual Programming Environment for Hadoop
HDFS/ HiveHDFS
Persistent IP
Data Flow UI
PIG Convertor
HDFS
PIG UDF Library
Big Data Platform
ETL Tool
Metadata
Data Data
Metadata
Data Sources
PIG code
![Page 75: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/75.jpg)
10 April 2023 75
Persistent Connector Framework
OUT OF THE BOX• Database, Data Warehouse• Microsoft Exchange• Web proxy• IBM Tivoli• BBCA• Generic Push connector for *any* content
FEATURES• Bi-directional connector (as applicable)• Supports Push/Pull mechanism• Stores data on HDFS in an optimized format• Supports masking of data
WHY CONNECTOR FRAMEWORK• Pluggable Architecture
20+Years
![Page 76: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/76.jpg)
10 April 2023 76
Persistent Data Connectors
![Page 77: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/77.jpg)
10 April 2023 77
Persistent’s Breadth of Big Data Capabilities
Horizontal and Vertical Pre-built Solutions
Big Data Platform (PeBAL) analytics libraries and Connectors
IT Management
Big Data Application Programming
Distributed File Systems
Cluster Layer
Tooling
• RDBMS/DWH to import/export data
• Text Analytics libraries
• Data Visualization using Web2.0 and reporting tools - Cognos, Microstrategy
• Ecosystem tools like - Nutch, Katta, Lucene
• Job configuration, management and monitoring with BIgInsight’s job scheduler (MetaTracker)
• Job failure and recovery management
• Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL
• Expertise in MR programming - PIG, Hive, Java MR
• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)
• Statistical Analytics - R, SPSS, BigInsights Integration with R
• HDFS
• IBM GPFS
• Platform Setup on multi-node clusters, monitoring, VM based setup
• Product DeploymentPersistent IP for Big Data SolutionsBig Data Platform Components
![Page 78: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/78.jpg)
10 April 2023 78
Persistent Roadmap to Big Data
1. Learn
2. Initiate
3. Scale4. Measure
5. Manage
Discover andDefine Use Cases
Improve Knowledge Baseand Shared Big Data
Platform
Upgrade to Production if Successful
Validate witha POC
Measure Effectiveness
and Business Value
![Page 79: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/79.jpg)
10 April 2023 79
Build a social graph of all customers
Overlay sales data on the graph
Identify influential customers using network analysis
Target these customers for promotions.
Customer Analytics
Identifying your most influential customers ?
Targeting influential customers is best way to improve campaign ROI!
70 million customers
> 1billion transactions over twenty years
Few thousandInfluential customers
![Page 80: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/80.jpg)
10 April 2023 80
Overview of Email Analytics● Key Business Needs
– Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines.
– Provide an ongoing analysis of customer sentiment through email communications.
● Use Cases– Quickly identify if there has been an information breach or if the information is being
shared in ways that is not in compliance with organizational guidelines.– Identify if a particular customer is not being appropriately managed.
● Benefits– Ability to proactively manage email analytics and communications across the organization
in a cost-effective way.– Reduce the response time to manage a breach and proactively address issues that emerge
through ongoing analysis of email.
![Page 81: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/81.jpg)
10 April 2023 81
Using Email to Analyze Customer Sentiment
Sense the mood of your customers through their emails
Carry out detailed analysis on customer team interactions and response times
![Page 82: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/82.jpg)
10 April 2023 82
Analyzing Prescription Data
1.5 million patients are harmed by medication errors every year
Identifying erroneous prescriptions can save lives! Source: Center for Medication Safety & Clinical Improvement
![Page 83: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/83.jpg)
10 April 2023 83
Overview of IT Analytics● Key Business Needs
– Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems.
– Information may be in different formats, locations, granularity, data stores.– System outages have a negative impact on short-term revenue, as well as long-term credibility and
reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative.
● Use Cases– Identify security threats and isolate the corresponding external factors quickly.– Identify if an email server is unstable, determine the priority and take preventative action before a
complete failure occurs.
● Benefits– Reduced maintenance cost– Higher reliablity and SLA compliance
![Page 84: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/84.jpg)
10 April 2023 84
Consumer Insight from Social Media
Find out what the customers are talking about your organization or product in the social media
![Page 85: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/85.jpg)
1. Structured AnalysisResponses to Pledge, multiple choice questions
2. Unstructured AnalysisResponses to following questions • Share your story• Ask a question to Aamir• Send a message of hope• Share your solution
Content Filtering Rating Tagging System (CFRTS)L0, L1, L2 phased analytics 3. Impact Analysis
Crawling general internet for measuring the before & after scenario on a particular topic
Web/TV Viewer
Response to Pledgemultiple choice questionsWeb, emails, IVR/CallsIndividual blogsSocial widgetsVideos…
IVR
SMS
Web
, Soc
ial M
edia
(S
truc
ture
d)So
cial
Med
ia
(uns
truc
ture
d)
Insights for Satyamev Jayate – Variety of sources
![Page 86: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/86.jpg)
Rigorous Weekly Operation Cycle producing instant analyticsKiller combo of Human+Software to analyze the data efficiently Topic opens on
Sunday
Live Analytics report is sent
during the show
Data capture from SMS,
phone calls, social media,
website,
System runs L0 Analysis, L1, L2
Analysts continue
JSONs are created for the external and
internal dashboards
Featured content is delivered
thrice a day all through out the week.
Episode Tags are refined and messages are re-ingested for another pass
![Page 87: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/87.jpg)
10 April 2023 87
![Page 88: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/88.jpg)
10 April 2023 88
Thank you
Anand Deshpande ([email protected])http://in.linkedin.com/in/ananddeshpande
Persistent Systems Limitedwww.persistentsys.com
![Page 89: From the Big Data keynote at InCSIghts 2012](https://reader033.fdocuments.in/reader033/viewer/2022061217/54b417404a79599e1f8b46b6/html5/thumbnails/89.jpg)
10 April 2023 89
Enterprise Value is Shifting to Data
Mainframe
Operating Systems
ERP
Apps
Data
20132006
Database
199519851975Line of D
iminishing Value