Integrating Big Data Technologies
-
Upload
dataversity -
Category
Technology
-
view
2.010 -
download
0
Transcript of Integrating Big Data Technologies
INTEGRATING BIG DATA Dataversity Webinar Feb 7 2012
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 1
State of Data Today
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 2
A Growing Trend
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 3
Requirement Expectations Reality Speed Speed of the Internet Speed = Infra + Arch +
Design Accessibility Accessibility of a
Smartphone BI Tool licenses &
security Usability IPAD - Mobility Web Enabled BI Tool
Availability Google Search Data & Report Metadata Delivery Speed of questions Methodology & Signoff
Data Access to everything Structured Data Scalability Cloud (Amazon) Existing Infrastructure
Cost Cell phone or Free WIFI Millions
Expectations for BI are changing w/o anyone telling us
The Wisdom of Crowds ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 4
Data Deluge = Business Insights ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 5
BIG Data ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 6
Structured
UnStructured
ERP CRM SCM
Content Management Systems
Email Call Center
Documents Contracts
Current New
What’s so Big about Big Data
Velocity Volume Variety
Complexity Ambiguity
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 7
So you are about to start the Big Data Project
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 8
Tools
instructions
Data
Output
The Normal Way Results In ……..
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 9
Image Source: Web
Why Big Data can Fail on the RDBMS?
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 10
New Data Types
New volume
New analytics
New workload
New metadata
Current Data
Management Platform
(RDBMS + ETL+BI)
• POOR Performance
• Failed Programs
Scalability; Sharding; ACID;
BIG Data • Workload Demands
• Process dynamic data content
• Process unstructured data
• Systems that can scale up and scale out with high volume data
• Perform complex operations within reasonable response time
• Infrastructure Requirements • Scalable platform • Database independence • Fault tolerant
architectures • Low cost of acquisition
and store • Supported by standard
toolsets
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 11
Hadoop
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 12
Design Goals ü System Shall Manage and
Heal Itself ü Performance Shall Scale
Linearly ü Compute Shall Move to
Data ü Simple Core, Modular and
Extensible
Hadoop Differentiators
Schema-on-Write: RDBMS
Schema-on-Read: Hadoop
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 13
• Schema must be created before data is loaded.
• An explicit load operation has to take place which transforms the data to the internal structure of the database.
• New columns must be added explicitly before data for such columns can be loaded into the database.
• Read is Fast.
• Standards/Governance.
• Data is simply copied to the file store, no special transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns.
• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them.
• Load is Fast
• Evolving Schemas/Agility
Hadoop Known Limitations • Write-once model • A namespace with an extremely large number of files exceeds
Namenode’s capacity to maintain • Cannot be mounted by exisiting OS
• Getting data in and out is tedious • Virtual File System can solve problem
• HDFS does not implement / support • User quotas • Access permissions • Hard or soft links • Data balancing schemes
• No periodic checkpoints • Namenode is single point of failure
• Automatic restart and failover to another machine not yet supported
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 14
Hadoop Tips • Hadoop is useful
• When you must process lots of unstructured data
• When running batch jobs is acceptable
• When you have access to lots of cheap hardware
• Hadoop is not useful • For intense calculations with little or
no data • When your data is not self-contained • When you need interactive results
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 15
• Implementation • Think big, start small • Build on agile cycles • Focus on the data, as you will
always develop schema on write.
• Available Optimizations • Input to Maps • Map only jobs • Combiner • Compression • Speculation • Fault Tolerance • Buffer Size • Parallelism (threads) • Partitioner • Reporter • DistributedCache • Task child environment settings
Hadoop Tips • Performance Tuning
• Increase the memory/buffer allocated to the tasks
• Increase the number of tasks that can be run in parallel
• Increase the number of threads that serve the map outputs
• Disable unnecessary logging • Turn on speculation • Run reducers in one wave as they
tend to get expensive • Tune the usage of DistributedCache,
it can increase efficiency
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 16
• Troubleshooting • Are your partitions uniform? • Can you combine records at the map
side? • Are maps reading off a DFS block
worth of data? • Are you running a single reduce wave
(unless the data size per reducers is too big) ?
• Have you tried compressing intermediate data & final data?
• Are there buffer size issues • Do you see unexplained “long tails” • Are your CPU cores busy? • Is at least one system resource being
loaded?
NoSQL • Stands for Not Only SQL • Based on CAP Theorem • Usually do not require a fixed table schema nor do they
use the concept of joins • All NoSQL offerings relax one or more of the ACID
properties • NoSQL databases come in a variety of flavors
• XML (myXMLDB, Tamino, Sedna) • Wide Column (Cassandra, Hbase, Big Table) • Key/Value (Redis, Memcached with BerkleyDB) • Graph (neo4j, InfoGrid) • Document store (CouchDB, MongoDB)
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 17
NoSQL Footprint
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 18
Size
Complexity
Key Value
Big Table
Doc Database
Graph
Amazon Dynamo
Google Big Table
Cassandra
Lotus Notes HBase
Voldermort
Graph Theory
NoSQL
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 19
• Best Practices • Design for data collection • Plan the data store • Organize by type and semantics • Partition for performance
• Access and Query is run time dependent
• Horizontal scaling • Memory Caching
• Access and Query • RESTful interfaces (HTTP as an
accessAPI) • Query languages other than SQL
• SPARQL - Query language for the SemanticWeb
• Gremlin - the graph traversal language
• Sones Graph Query Language • Data Manipulation / Query API
• The Google BigTable DataStoreAPI
• The Neo4jTraversalAPI • Serialization Formats
• JSON • Thrift • ProtoBuffers • RDF
Forest Rim Technology – Textual ETL Engine (TETLE) – is an integration tool for turning text into a structure of data that can be analyzed by standard analytical tools
Textual ETL Engine
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 20
• Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data.
• The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords
• Easy to implement and easy to realize ROI
• Advantages • Simple to use • No MR or Coding required for text analysis
and mining • Extensible by Taxonomy integration • Works on standard and new databases • Produces a highly columnar key-value
store, ready for metadata integration
• Disadvantages • Not integrated with Hadoop as a rules
interface • Currently uses Sqoop for metadata
interchange with Hadoop or NoSQL interfaces
• Current GA does not handle distributed processing outside Windows platform
Integration • All RDBMS vendors today are supporting Hadoop or NoSQL as
an integration or extension • Oracle Exalytics / Big Data Appliance • Teradata Aster Appliance • EMC Greenplum Appliance • IBM BigInsights • Microsoft Windows Azure Integration
• There are multiple providers of Hadoop distribution • CloudEra • HortonWorks • Zettaset
• Adapters from vendors to interface with CloudEra or HortonWorks distributions of Hadoop are available today. There are integration efforts to release Hadoop as an integral engine across the RDBMS vendor platforms
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 21
Conceptual SoluEon Architecture ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 22
Metadata
Data Warehouse
Taxonomy
Big Data DW Textual
ETL
ETL ELT CDC Reporting
Analytics Search OLAP
Text Mining Content Analytics
Knowledge Analytics
MDM
DataMart’s
OLTP
BIG Data Content Email Docs
MR / Ruby / Java (Hadoop)
And / Or
Integration Tips • The key to the castle in integrating Big Data is metadata • Whatever the tool, technology and technique, if you do not
know your metadata, your integration will fail • Semantic technologies and architectures will be the way to
process and integrate the Big Data, much akin to Web 2.0 models
• Data quality for Big Data is a very questionable goal. To get some semblance of quality, taxonomies and ontologies can be of help
• 3rd part data providers also provide keywords, trending tags and scores, these can provide a lot of integration support
• Writing business rules for Big Data can be very cumbersome and not all programs can be written in MapReduce
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 23
Which Tool
Application Hadoop NoSQL Textual ETL Machine Learning x x
Sentiments x x x Text Processing x x x
Image Processing x x Video Analytics x x
Log Parsing x x x Collaborative
Filtering x x x
Context Search x Email & Content x
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 24
Success Stories
• Machine learning & Recommendation Engines – Amazon, Orbitz
• CRM - Consumer Analytics, Metrics, Social Network Analytics, Churn, Sentiment, Influencer, Proximity
• Finance – Fraud, Compliance • Telco – CDR, Fraud • Healthcare – Provider / Patient analytics, fraud, proactive
care • Lifesciences – clinical analytics, physician outreach • Pharma – Pharmacovigilance, clinical trials • Insurance – fraud, geo-spatial • Manufacturing – warranty analytics, supplier quality
metrics
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 25
Data Science
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 26
Data Analytics Content Customer Product Behaviors Optimization
Big Data Processing & ETL
APPLIED SCIENCE
User Interest Prediction inventory prediction
Machine learning Pattern Mining
Advanced Regression Analysis
Business Intelligence Advanced Analytics
Art & Science
Challenges
• Resources Availability • MR is hard to implement • Speech to text
• ConversaEon context is oJen missing • Quality of recording • Accent issues
• Visual data tagging • Images • Text embedded within images
• Metadata is not available • Data is not trusted • Content management plaMorm capabiliEes • Ontologies Ambiguity • Taxonomy IntegraEon
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 27
Contact • Krish Krishnan [email protected]
Twitter: @datagenius
©2012 Sixth Sense Advisors, Inc. All Rights Reserved 28