Ia Pe 2013-10-Building the BigData EDW
-
Upload
lawise-aresof-tarik -
Category
Documents
-
view
54 -
download
0
description
Transcript of Ia Pe 2013-10-Building the BigData EDW
-
Building ABig Data Data Warehouse
Integrating Structured and Unstructured Data
DAMA IOWA October 2013
Krish Krishnan Founder Sixth Sense Advisors Inc
-
Discussion Focus
S Big data and the data warehousethe new landscape
S Technology overview: Hadoop, NoSQL, Cassandra, BigQuery, Drill, Redshift, AWS (S3, EC2); programming with MapReduce; understanding analytical requirements, self-service discovery platforms
S The challenges of data processing: Workloads; data management; infrastructure limitations
S Next-generation data warehouse: Solution architectures; the three Ss: scalability, sustainability, and stability
2 @2013 Copyright Sixth Sense Advisors
-
A New Landscape
3 @2013 Copyright Sixth Sense Advisors
-
A Growing Trend
@2013 Copyright Sixth Sense Advisors 4
Requirement Expectations Reality
Speed Speed of the Internet Speed = Infra + Arch + Design
Accessibility Accessibility of a Smartphone
BI Tool licenses & security
Usability IPAD - Mobility Web Enabled BI Tool
Availability Google Search Data & Report Metadata
Delivery Speed of questions Methodology & Signoff
Data Access to everything Structured Data
Scalability Cloud (Amazon) Existing Infrastructure
Cost Cell phone or Free WIFI Millions
Expectations for BI are changing w/o anyone telling us
-
State of Data Today
@2013 Copyright Sixth Sense Advisors 5
-
Data Growth Trends
@2013 Copyright Sixth Sense Advisors 6
Facebook has an average of 30 billion pieces of content added every month
YouTube receives 24hours of video, every minute
15 Billion mobile phones predicted to be in use in 2015
A leading retailer in the UK collects 1.5 billion pieces of information to adjust prices and promotions
Amazon.com: 30% of sales is out of its recommendation engine
A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvement
CERN Haldron Collider produces 15PB of data for each cycle of execution.
-
Decision Support = #Fail?
S Decision support platforms of today are not satisfying the needs of the business user
S Decisions being driven in the organization are not based on 360 degree views of the organization and its performance
S Business transformations are not completely successful due to the lack of information presented in the Business Intelligence Architecture
S Analytics and Key Performance Indicators are not available in a timely manner and the data that is presented is not sufficient to complete any business decisions with utmost confidence
@2013 Copyright Sixth Sense Advisors 7
-
State of the Data Warehouse
8 @2013 Copyright Sixth Sense Advisors
-
@2012 Copyright Sixth Sense Advisors
What We Have Built
@2013 Copyright Sixth Sense Advisors 9
-
Business Thinking
@2013 Copyright Sixth Sense Advisors 10
New Data Increasing Complexity
Increase Quality of Service
Increase Agility
Digital Intelligence
Customer Centric Cost driven
TCO Opportunity Cost Competitive Cost
Digital Connected
Mobile Metrics Driven
Big Data Social Media Corporate Data
New Data Smarter Consumer
Global Competition Cost
-
CIO Thinking
@2013 Copyright Sixth Sense Advisors 11
-
Flexibility
Reliability
Simplicity
Scalability
Modularity
Architects Thinking
@2013 Copyright Sixth Sense Advisors 12
-
Users Needs
@2013 Copyright Sixth Sense Advisors 13
Every Data, All Shapes, Sizes and Formats Are Needed By The Users
-
Why The Database Alone Cannot Be The
Platform The Limitations of Databases
14 @2013 Copyright Sixth Sense Advisors
-
The Disappointment
@2013 Copyright Sixth Sense Advisors 15
S Distributed S Transactional Databases S Data Warehouses S Datamarts S Analytical Databases S CRM Databases S SCM Databases S ERP Databases
S Redundant
S Weak Metadata
S Weak Integration
-
Base Graph Courtesy Dr. Richard Hackathorn
Why The Data Warehouse Fails
@2013 Copyright Sixth Sense Advisors 16
Action time or Action distance Time
Business Value
Data Latency
Analysis Latency
Decision Latency
Business Situation
Data is ready
Information is available
Decision is made
Los
t V
alue
Lost value = Sum (Latencies)+ Opportunity Cost
-
Data Warehouse Computing Today
@2013 Copyright Sixth Sense Advisors 17
Transactional Systems
ODS
Enterprise Datawarehouse
Datamarts & Analytical Databases
Datamarts & Analytical Databases
Datamarts & Analytical Databases
Transactional Systems
ODS
Transactional Systems
ODS
Reports
Dashboards
Analytic Models
Other Applications
Data Transformation
-
The Bottom Line
S We have designed, architected, deployed systems that have been built on architectures that were not intended to be used for complex processing and compute requirements
S The real issue lies in the fact that the architectures that were designed for the RDBMS platform differ widely in their abilities to handle diverse types of workloads
S In order to design and manage complex workloads, architects need to understand the underlying platforms capabilities with relation to the type of workload being designed
@2013 Copyright Sixth Sense Advisors 18
-
Shared Everything Architecture
S Resources are distributed and shared S CPUs are shared across the
databases S Memory is shared across
CPUs and databases S Disk architecture is shared
across CPUs
S Big disadvantage is the sharing of resources limits the scalability
S Addition of the resources will not increase linear scalability and performance but only cost
@2013 Copyright Sixth Sense Advisors 19
-
Issues
S Shared Everything architecture cannot scale and handle workloads effectively
S You cannot achieve 100% linear scalability in a shared architecture environment
S Compute and store happen in disparate environments
S Infrastructure limitations create more latencies in the overall system
S Data Governance is complex subject area that adds to the weakness of the architecture
@2013 Copyright Sixth Sense Advisors 20
-
BIG Data Example
@2013 Copyright Sixth Sense Advisors 21
To: [email protected] Dear Mr. Collins, This email is in reference to my bank account which has been efficiently handled by your bank for more than five years. There has been no problem till date until last week the situation went out of the hand. I have deposited one of my high amount cheque to my bank account no: 65656512 which was to be credited same day but due to your staff carelessness it wasnt done and because of this negligence my reputation in the market has been tarnished. Furthermore I had issued one payment cheque to the party which was showing bounced due to Insufficient balance just because my cheque didnt make on time. My relationship with your bank has matured with the time and its a shame to tell you about this kind of services are not acceptable when it is question of somebodys reputation. I hope you got my point and I am attaching a copy of the same for further rapid procedures and remit into my account in a day. Yours sincerely Daniel Carter Ph: 564-009-2311
-
Big Data Example
S We will o2en imply addi6onal informa6on in spoken language by the way we place stress on words.
S The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent diculty a natural language processor can have in parsing it. S "I never said she stole my money" - Someone else said it, but I didn't. S "I never said she stole my money" - I simply didn't ever say it. S "I never said she stole my money" - I might have implied it in some way, but I
never explicitly said it. S "I never said she stole my money" - I said someone took it; I didn't say it was
she. S "I never said she stole my money" - I just said she probably borrowed it. S "I never said she stole my money" - I said she stole someone else's money. S "I never said she stole my money" - I said she stole something, but not my
money
S Depending on which word the speaker places the stress, this sentence could have several dis6nct meanings.
@2013 Copyright Sixth Sense Advisors 22 Example Source: Wikepedia
-
The Normal Way Results In
@2013 Copyright Sixth Sense Advisors 23
-
Impact on Data Warehouse
@2013 Copyright Sixth Sense Advisors 24
New Data Types
New volume
New analytics
New workload
New metadata
POOR Performance
Failed Programs
Scalability; Sharding; ACID;
Why Big Data can Fail?
-
ACID is Not Good All The Time
S Atomic All of the work in a transaction completes (commit) or none of it completes
S Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.
S Isolated The results of any changes made during a transaction are not visible until the transaction has committed.
S Durable The results of a committed transaction survive failures
@2013 Copyright Sixth Sense Advisors 25
-
Where Do we Go?
@2013 Copyright Sixth Sense Advisors 26
Tools
instructions
Data &
-
Next Generation Technologies
Integrating Big Data
27 @2013 Copyright Sixth Sense Advisors
-
Innovations
@2013 Copyright Sixth Sense Advisors 28
Category New Frontiers
Infrastructure Big Data and Data Warehouse Appliances In-Memory Technologies SSD Storage Fast Networks Cloud Mobile Technologies
Software In-memory Databases Hadoop, Cassandra & NoSQL Ecosystems Columnar DBMS Improved ETL-Hadoop integration Informatica, Talend
Algorithms Mahout
Pre-Configured Architectures
IBM, Teradata, Kognitio, EMC, CloudEra, HortonWorks, Cirro, Intel, Cicso UCS, Pivotal, Oracle, MapR
-
BIG Data - Infrastructure Requirements
S Scalable platform
S Database independent
S Fault tolerant
S Low cost of acquisition
S Scalable and Reliable Storage
S Supported by standard toolsets
S Datacenter Ready
29 @2013 Copyright Sixth Sense Advisors
-
Big Data Workload Demands
30 @2013 Copyright Sixth Sense Advisors
S Process dynamic data content
S Process unstructured data
S Systems that can scale up with high volume data
S Systems that can scale out with high volume of users
S Perform complex operations within reasonable response time
-
Parallel databases
S Shared-nothing MPP architecture (a collection of independent machines, each with local hard disk and main memory, connected together on high-speed network)
S Machines are cheaper, lower-end, commodity hardware
S Scales well up to a point, tens of nodes
S Good performance
S Poor fault tolerance
S Problems with heterogeneous environment (machines must be equal in performance)
S Good support for flexible query interface
@2013 Copyright Sixth Sense Advisors 31
-
Data Warehouse Appliance
High Availability
Standard SQL Interface
Advanced Compression
MPP
Leverages existing BI, ETL and OLTP investments
Hadoop & MapReduce Interface / Embedded
Minimal disk I/O bottleneck; simultaneously load & query
Auto Database Management
@2013 Copyright Sixth Sense Advisors 32
A Data Warehouse (DW) Appliance is an integrated set of servers, storage, OS, database and interconnect specifically preconfigured and tuned for the rigors of data warehousing.
DW appliances offer an attractive price / performance value proposition and are frequently a fraction of the cost of traditional data warehouse solutions.
-
Hadoop Evolution
@2013 Copyright Sixth Sense Advisors 33
-
Hadoop
@2013 Copyright Sixth Sense Advisors 34
-
Why Hadoop
S Commodity HW S Built on inexpensive servers S Storage servers and their disks are not assumed to be highly reliable and available S Modular expansion
S Metadata-data oriented design S Namenode maintains metadata S Datanodes manage data placement and store
S Computation happens close to data S Servers have dual goals: data storage and computation S Single store and computevs. Separate clusters
S File-System Architecture S Focus is mostly sequential access S Single writers S No file locking features
35 @2013 Copyright Sixth Sense Advisors
-
Hadoop Architecture
@2013 Copyright Sixth Sense Advisors 36
-
HDFS
@2013 Copyright Sixth Sense Advisors 37
S Hadoop Distributed File System S A scalable, Fault tolerant, High
performance distributed file system
S Asynchronous replication S Write-once and read-many
(WORM)
S No RAID required S Access from C, Java,Thrift S NameNode holds filesystem
metadata
S Files are broken up and spread over the DataNodes
-
HDFS Splits & Replication
S Data is organized into files and directories
S Files are divided into uniform sized blocks and distributed across cluster nodes
S Blocks are replicated to handle hardware failure
S Filesystem keeps checksums of data for corruption detection and recovery
S HDFS exposes block placement so that computation can be migrated to data
@2013 Copyright Sixth Sense Advisors 38
-
HDFS
S Data Node S Stores data in HDFS S Can be found in multiples S Data is replicated across data
nodes
@2013 Copyright Sixth Sense Advisors 39
S File size S A typical block size is 64MB (or
even 128 MB).
S A file is chopped into 64MB chunks and stored.
S Name Node S The Name Node is the heartbeat of an HDFS file system. S It keeps the directory of all files in the file system, and tracks data
distribution across the cluster the file. S It does not store the data of these files itself. S Cluster configuration management S Transaction Log management
S Features S HDFS provides Java API for
application to use.
S Python access is also used in many applications.
S A C language wrapper for Java API is also available.
S A HTTP browser can be used to browse the files of a HDFS instance.
-
14
Data Correctness - File creation : Client computes checksum per 512 bytes DataNode stores the checksum - File Access : Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas Data Pipeline - Client retrieves a list of DataNodes on which to place replicas of a block - Client writes block to the first DataNode - The first DataNode forwards the data to the next DataNode in the Pipeline - When all replicas are written, the client moves on to write the next block in file
Rebalancer - Usually run when new DataNodes are added - Cluster is online when Rebalancer is active - Rebalancer is throttled to avoid network congestion - Command line tool
Blocks Placement - First replica on a node in a local rack - Second replica on different rack - 3rd replica on the same rack of the second replica - Clients read from nearest replica Heartbeats - DataNodes send heartbeat to the NameNode (once every 3 seconds) - NameNode used heartbeats to detect DataNode failure Replication Engine
- -
-
Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes
HDFS Features
@2013 Copyright Sixth Sense Advisors 40
-
HBASE
S Clone of Big Table (Google)
S Implemented in Java (Clients : Java, C++,Ruby...)
S Columnoriented data store
S Distributed over many servers
S Tolerant of machine failure
S Layered over HDFS
S Strong consistency
S It's not a relational database (No joins)
S Sparse data nulls are stored for free
S Supports Semi-structured and unstructured data
S Versioned data storage capability
S Extremely Scalable Goal of billions of rows x millions of columns
@2013 Copyright Sixth Sense Advisors 41
S Hbase provides storage for the Hadoop Distributed Computing Environment.
S Data is logically organized into tables, rows and columns.
-
Hive
S Data summarization and ad-hoc query interface on top of Hadoop
S MapReduce for Execution & HDFS for storage
S Hive Query Language S Basic SQL : Select, From, Join, Group By S Equi-Join, Multi-Table Insert, Multi-Group-By S Batch query
S MetaStore S Table/Partitions properties S Thrift API : Current clients in Php (Web S Interface), Python interface to Hive, Java
(Query S Engine and CLI) S Metadata stored in any SQL backend
@2013 Copyright Sixth Sense Advisors 42
Image Cloudera Hive Tutorial
-
Hbase Hive Integration
@2013 Copyright Sixth Sense Advisors 43
HBase
Hive table definitions
Points to an existing table
Points to some column
Points to other columns, different names
-
Pig
S Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
S Pig generates and compiles a Map/Reduce program(s) on the fly.
S Abstracts you from specific detail S Focus on data processing S Data flow S Built For data manipulation
S Pig is workflow driven and is easy to maintain @2013 Copyright Sixth Sense Advisors 44
-
Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters Automatic data import SQL to Hadoop Easy import data from many databases to Hadoop Generates code for use in MapReduce applications Integrates with Hive
Sqoop
@2013 Copyright Sixth Sense Advisors 45
-
All servers store a copy of the data A leader is elected at startup Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the Change
Zookeeper
@2013 Copyright Sixth Sense Advisors 46
-
24
AVRO
S A data serialization system that provides dynamic integration with scripting languages
S Avro Data S Expressive S Smaller and Faster S Dyamic
S Schema store with data S APIs permit reading and creating
S Include a file format and a textual encoding S Generates JSON Metadata Automatically
@2013 Copyright Sixth Sense Advisors 47
-
24
AVRO
S Avro RPC S Leverage versioning support S For Hadoop service provide cross-language access
@2013 Copyright Sixth Sense Advisors 48
-
25
A data collection system for managing large distributed systems Build on HDFS and MapReduce Tools kit for displaying, monitoring and analyzing the log files
Chukwa
@2013 Copyright Sixth Sense Advisors 49
-
Flume
S Flume is: S A scalable, configurable, extensible and manageable distributed
data collection service
S Developed on Open source S One-stop solution for data collection of all formats S Flexible reliability guarantees allow careful performance tuning S Enables quick iteration on new collection strategies
@2013 Copyright Sixth Sense Advisors 50
-
Oozie
S Workflow Engine in Hadoop HTTP and command line interface + Web console
S Used to S Execute and monitor workflows in Hadoop S Periodic scheduling of workflows S Trigger execution by data availability
@2013 Copyright Sixth Sense Advisors 51
-
Hadoop Differentiator
Schema-on-Write: RDBMS
Schema-on-Read: Hadoop
@2013 Copyright Sixth Sense Advisors 52
Schema must be created before data is loaded.
An explicit load operation has to take place which transforms the data to the internal structure of the database.
New columns must be added explicitly before data for such columns can be loaded into the database.
Read is Fast.
Standards/Governance.
Data is simply copied to the file store, no special transformation is needed.
A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns.
New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them.
Load is Fast
Evolving Schemas/Agility
-
HadoopDB
S Recent study at Yale University, Database Research Dep.
S Hybrid architecture of parallel databases and MapReduce system
S The idea is to combine the best qualities of both technologies
S Multiple single-node databases are connected using Hadoop as the task coordinator and network communication layer
S Queries are distributed across the nodes by MapReduce framework, but as much work as possible is done in the database node
@2013 Copyright Sixth Sense Advisors 53 Slide Courtsey: Dr.Daniel Abadi
-
HadoopDB architecture
@2013 Copyright Sixth Sense Advisors 54 Slide Courtsey: Dr.Daniel Abadi
-
Hadoop Limitations
S Write-once model
S A namespace with an extremely large number of files exceeds Namenodes capacity to maintain
S Cannot be mounted by existing OS S Getting data in and out is tedious S Virtual File System can solve problem
S HDFS does not implement / support S User quotas S Access permissions S Hard or soft links S Data balancing schemes
S No periodic checkpoints
@2013 Copyright Sixth Sense Advisors 55
-
Hadoop Tips
S Hadoop is useful S When you must process lots of unstructured
data S When running batch jobs is acceptable S When you have access to lots of cheap
hardware
S Hadoop is not useful S For intense calculations with little or
no data S When your data is not self-contained S When you need interactive results
@2013 Copyright Sixth Sense Advisors 56
Implementation Think big, start small Build on agile cycles Focus on the data, as you will always
develop schema on write.
Available Optimizations Input to Maps Map only jobs Combiner Compression Speculation Fault Tolerance Buffer Size Parallelism (threads) Partitioner Reporter DistributedCache Task child environment
settings
-
Hadoop Tips
S Performance Tuning S Increase the memory/buffer
allocated to the tasks S Increase the number of tasks that
can be run in parallel S Increase the number of threads
that serve the map outputs S Disable unnecessary logging S Turn on speculation S Run reducers in one wave as they
tend to get expensive S Tune the usage of
DistributedCache, it can increase efficiency
S Troubleshooting S Are your partitions uniform? S Can you combine records at the
map side? S Are maps reading off a DFS block
worth of data? S Are you running a single reduce
wave (unless the data size per reducers is too big) ?
S Have you tried compressing intermediate data & final data?
S Are there buffer size issues S Do you see unexplained long
tails S Are your CPU cores busy? S Is at least one system resource
being loaded?
@2013 Copyright Sixth Sense Advisors 57
-
MapReduce
S Developed for processing large data sets.
S Contains Map and Reduce functions.
S Runs on a large cluster of machines.
S Goals S Use machines across the data center S Elastic scaling S Finite programming model
@2013 Copyright Sixth Sense Advisors 58
-
Input | Map() | Copy/Sort | Reduce() | Output
Map Phase
Raw data analyzed and converted to name/value pair
Shuffle Phase
All name/value pairs are sorted and grouped by their keys
Reduce Phase
All values associated with a key are processed for results
MapReduce
@2013 Copyright Sixth Sense Advisors 59
-
Programming model
S Input & Output: each a set of key/value pairs S Programmer specifies two functions: S map (in_key, in_value) -> list(out_key, intermediate_value)
S Processes input key/value pair S Produces set of intermediate pairs S reduce (out_key, list(intermediate_value)) -> list(out_value)
S Combines all intermediate values for a particular key
S Produces a set of merged output values (usually just one)
@2013 Copyright Sixth Sense Advisors 60
-
Example
S Page 1: DAMA Conference is good
S Page 2: There are good ideas presented at DAMA
S Page 3: I like DAMA because of its variety of topics.
@2013 Copyright Sixth Sense Advisors 61
-
Map output
S Worker 1: S (DAMA1), (Conference 1), (is 1), (good 1).
S Worker 2: S (There 1), (are 1), (good 1), (ideas 1), (presented 1), (at 1), (DAMA
1).
S Worker 3: S (I 1), (Like 1), (DAMA 1), (Because 1), (of 1), (its 1), (variety 1), (of
1), (topics 1).
@2013 Copyright Sixth Sense Advisors 62
-
Reduce Input
S Worker 1: S (DAMA 1), (DAMA 1), (DAMA
1)
S Worker 2: S (is 1)
S Worker 3: S (good 1), (good 1)
S Worker 4: S (There 1)
S Worker 5: S (ideas 1)
S Worker 6: S (presented 1)
S Worker 7: S (I 1)
S Worker 8: S (like 1)
S Worker 9: S (its 1)
S Worker 10: S (variety 1)
S Worker 11: S (Topics 1)
@2013 Copyright Sixth Sense Advisors 63
-
Reduce Output
S Worker 1: S (DAMA 3)
S Worker 2: S (is 1)
S Worker 3: S (good 2)
S Worker 4: S (There 1)
S Worker 5: S (ideas 1)
S Worker 6: S (presented 1)
S Worker 7: S (I 1)
S Worker 8: S (like 1)
S Worker 9: S (its 1)
S Worker 10: S (variety 1)
S Worker 11: S (Topics 1)
@2013 Copyright Sixth Sense Advisors 64
-
MapReduce Strengths
S Tunable S Fine grained Map and Reduce tasks S Improved load balancing S Faster recovery from failed tasks
S Good fault tolerance S Can scale to thousands of nodes S Supports heterogeneous environments S Automatic re-execution on failure
S Localized execution S With large data, eliminates bandwidth problem by scheduling execution close to
location of data when possible
S Map-Reduce + HDFS is a very effective solution for scaling in a distributed geographical environment
65 @2013 Copyright Sixth Sense Advisors
-
NoSQL
S Stands for Not Only SQL
S Based on CAP Theorem
S Usually do not require a fixed table schema nor do they use the concept of joins
S All NoSQL offerings relax one or more of the ACID properties
S NoSQL databases come in a variety of flavors S XML (myXMLDB, Tamino, Sedna) S Wide Column (Cassandra, Hbase, Big Table) S Key/Value (Redis, Memcached with BerkleyDB) S Graph (neo4j, InfoGrid) S Document store (CouchDB, MongoDB)
@2013 Copyright Sixth Sense Advisors 66
-
NoSQL
@2013 Copyright Sixth Sense Advisors 67
Size
Complexity
Amazon Dynamo
Google Big Table
Cassandra
Lotus Notes HBase
Voldermort
Graph Theory
-
Approaches to CAP
68
S Eric Brewer stated in 2000 at PODC that S You have to give up one
of the following in a distributed system : S Consistency of data S Availability S Partition tolerance
S BASE S No ACID, use a single version of DB,
reconcile later
S Defer transaction commit S Until partitions fixed and replicate can run
S Eventual consistency (e.g., Amazon Dynamo) S Eventually, all copies of an object converge
S Restrict transactions (e.g., Sharded MySQL) S 1-M/c Xacts: Objects in xact are on the same
machine S 1-Object Xacts: Xact can only read/write 1
object
S Object timelines (PNUTS)
@2013 Copyright Sixth Sense Advisors
-
Consistency Model
S If copies are asynchronously updated, what can we say about stale copies? S ACID guarantees require synchronous updts S Eventual consistency: Copies can drift apart, but will
eventually converge if the system is allowed to quiesce S To what value will copies converge? S Do systems ever quiesce?
S Is there any middle ground?
@2013 Copyright Sixth Sense Advisors 69
-
Consistency Techniques S Per-record mastering
S Each record is assigned a master region S May differ between records
S Updates to the record forwarded to the master region S Ensures consistent ordering of updates
S Tablet-level mastering S Each tablet is assigned a master region S Inserts and deletes of records forwarded to the master region S Master region decides tablet splits
S These details are hidden from the application S Except for the latency impact!
@2013 Copyright Sixth Sense Advisors 70
-
HBASE
71 @2013 Copyright Sixth Sense Advisors 71
-
Architecture
@2013 Copyright Sixth Sense Advisors 72
Disk
HRegionServer
Client Client Client Client Client
HBaseMaster
REST API
Disk
HRegionServer
Disk
HRegionServer
Disk
HRegionServer
Java Client
-
HRegion Server S Records partitioned by column family into HStores
S Each HStore contains many MapFiles
S All writes to HStore applied to single memcache
S Reads consult MapFiles and memcache
S Memcaches flushed as MapFiles (HDFS files) when full
S Compactions limit number of MapFiles
@2013 Copyright Sixth Sense Advisors 73
HRegionServer
HStore
MapFiles
Memcache writes
Flush to disk reads
-
Pros and Cons
S Pros S Log-based storage for high write throughput S Elastic scaling S Easy load balancing S Column storage for OLAP workloads
S Cons S Writes not immediately persisted to disk S Reads cross multiple disk, memory locations S No geo-replication S Latency/bottleneck of HBaseMaster when using
REST
@2013 Copyright Sixth Sense Advisors 74
-
CASSANDRA
@2013 Copyright Sixth Sense
Advisors 75 75
-
Architecture
S Facebooks storage system S BigTable data model S Dynamo partitioning and consistency model S Peer-to-peer architecture
@2013 Copyright Sixth Sense Advisors 76
Cassandra node
Disk
Cassandra node
Disk
Cassandra node
Disk
Cassandra node
Disk
Client Client Client Client Client
-
Routing
S Consistent hashing, like Dynamo or Chord S Server position = hash(serverid) S Content position = hash(contentid) S Server responsible for all content in a hash interval
@2013 Copyright Sixth Sense Advisors 77
Server
Responsible hash interval
-
Cassandra Server
S Writes go to log and memory table
S Periodically memory table merged with disk table
@2013 Copyright Sixth Sense Advisors 78
Cassandra node
Disk
RAM
Log SSTable file
Memtable
Update
(later)
-
Pros and Cons S Pros
S Elastic scalability S Easy management
S Peer-to-peer configuration S BigTable model is nice
S Flexible schema, column groups for partitioning, versioning, etc. S Eventual consistency is scalable
S Cons S Eventual consistency is hard to program against S No built-in support for geo-replication S Load balancing? S System complexity
S P2P systems are complex; have complex corner cases
@2013 Copyright Sixth Sense Advisors 79
-
Cassandra Tips
Tunable memtable size Can have large memtable flushed less frequently, or small memtable
flushed frequently
Tradeoff is throughput versus recovery time Larger memtable will require fewer flushes, but will take a long time to
recover after a failure
With 1GB memtable: 45 mins to 1 hour to restart Can turn off log flushing
Risk loss of durability Replication is still synchronous with the write
Durable if updates propagated to other servers that dont fail
@2013 Copyright Sixth Sense Advisors 80
-
NoSQL
@2013 Copyright Sixth Sense Advisors 81
Best Practices Design for data collection Plan the data store Organize by type and semantics Partition for performance
Access and Query is run time dependent
Horizontal scaling Memory Cachin
Access and Query RESTful interfaces (HTTP as an
accessAPI) Query languages other than SQL
SPARQL - Query language for the SemanticWeb
Gremlin - the graph traversal language
Sones Graph Query Language Data Manipulation / Query API
The Google BigTable DataStoreAPI
The Neo4jTraversalAPI Serialization Formats
JSON Thrift ProtoBuffers RDF
-
Forest Rim Technology Textual ETL Engine (TETLE) is an integration tool for turning text into a structure of data that can be analyzed by standard analytical tools
Textual ETL Engine
@2013 Copyright Sixth Sense Advisors 82
Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data.
The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords
Easy to implement and easy to realize ROI
Advantages Simple to use No MR or Coding required for text analysis
and mining Extensible by Taxonomy integration Works on standard and new databases Produces a highly columnar key-value store,
ready for metadata integration
Disadvantages Not integrated with Hadoop as a rules
interface Currently uses Sqoop for metadata
interchange with Hadoop or NoSQL interfaces
Current GA does not handle distributed processing outside Windows platform
-
Amazon RedShift
S Goal 1 - Reduce I/O S Direct-attached storage S Large data block sizes S Columnar storage
83
S The industrys first large scale Data Warehouse As A Service.
S Designed and Architected For Petabyte Scale Deployment
S Goal 2 Optimize Hardware S Optimized for I/O intensive
workloads
S High disk density S Runs in fast network - HPC
@2013 Copyright Sixth Sense Advisors
S Goal 3 Extreme Parallelism Increased speed and efficiency
S Loading S Querying S Backup S Restore
-
SQL Clients / BI Tools
Leader Node
RedShift Architecture
Picture Amazon Presentation on RedShift - Internet
@2013 Copyright Sixth Sense Advisors 84
-
Deployment Options
S Can be hosted with RDBMS on-site and RedShift on the Cloud
85 @2013 Copyright Sixth Sense Advisors
-
Deployment Options
S Can be used as Live Archive on the Cloud
86 @2013 Copyright Sixth Sense Advisors
-
Deployment Options
S Can be used as ETL for Big Data on the Cloud
87 @2013 Copyright Sixth Sense Advisors
-
Big Data Technologies
S Apache Software Foundation S Hadoop S HBASE S Zookeeper S Oozie S Avro S Pig S Sqoop S Flume S Cassandra
S CloudEra
S HortonWorks
S MongoDB
S IBM BigInsights
S EMC Pivotal
S Teradata Aster Big Data Appliance
S Oracle Big Data Appliance
S Intel Hadoop Distribution
S MapR
S Datastax
S Rainstor
S QueryIO
@2013 Copyright Sixth Sense Advisors
-
Workloads, Architectures,
Computing
89 @2013 Copyright Sixth Sense Advisors
-
Workload
@2013 Copyright Sixth Sense Advisors 90
S Defined as the usage of resources including CPU, Disk and Memory by every query ETL, ELT, BI and Analytics
S Often misunderstood as a Database capability
S Mostly touted by vendors as a differentiator for their platform
-
Workload
S Loading S Continuous (near real-time) S Batch S Micro Batch
S Queries S Tactical S AdHoc S Analytical S Dashboard
@2013 Copyright Sixth Sense Advisors 91
MIXED Workload
-
What Are You Trying to Do?
@2013 Copyright Sixth Sense Advisors 92
Data Workloads
OLTP (Random access to
a few records)
OLAP (Scan access to a large
number of records)
Read-heavy Write-heavy By rows By columns Unstructured
Combined (Some OLTP and
OLAP tasks)
-
Data Engineering vs. Analysis/Warehousing
S Very different workloads, requirements S Warehoused data for analysis includes
S Data from serving system S Click log streams S Syndicated feeds
S Trend towards scalable stores with S Semi-structured data S Map-reduce
S The result of analysis is stored in the Data Warehouse
@2013 Copyright Sixth Sense Advisors 93
-
Workload Isolation
S Assigning the appropriate systems and processes to manage workloads
S Creates an interchangeable infrastructure
S Provides for better scalability
S Will create a heterogeneous configuration, can be deployed on a homogenized platform if desired
@2013 Copyright Sixth Sense Advisors 94
-
Workload Isolation
@2013 Copyright Sixth Sense Advisors 95
Semi-Structured
Data
-
Workload Isolation
@2013 Copyright Sixth Sense Advisors 96
Semi-Structured
Data
-
Workload Isolation
@2013 Copyright Sixth Sense Advisors 97
Semi-Structured
Data
-
Metadata
S The key to the castle in integrating Big Data is metadata
S Whatever the tool, technology and technique, if you do not know your metadata, your integration will fail
S Semantic technologies and architectures will be the way to process and integrate the Big Data.
S Business domain experts can identify large data patterns by association relationships with small metadata.
@2013 Copyright Sixth Sense Advisors 98
-
The Big Data - Data Warehouse
99 @2013 Copyright Sixth Sense Advisors
-
Multi-Tiered Workload
@2013 Copyright Sixth Sense Advisors 100
Application Unstructured Data ( File Based)
Semi-Sturctured Data (File / Digital)
Structured Data (Digital)
Social Analytics, Behavior Analytics, Recommendation Engines, Sentiment Analytics, Fraud Detection
Hadoop / NoSQL Hadoop / NoSQL RDBMS
CRM, SalesForce, Marketing RDBMS
Data Mining Hadoop / NoSQL Hadoop / NoSQL RDBMS
System Characteristics Volume: Large Concurrency: Low Consolidation: App Specific Availability: High Updated: Near Real Time to Monthly
Volume: Large Concurrency: Medium Consolidation/Integration: Variable Availability:Medium Updated: Near Real Time
Volume: Large Concurrency: High Consolidation/Integration: High Availability: High Updated: Intra-Day & Daily
-
Reference Architecture
@2013 Copyright Sixth Sense Advisors 101
-
Which Tool
Application Hadoop NoSQL Textual ETL
Machine Learning
x x
Sentiments x x x
Text Processing x x x
Image Processing
x x
Video Analytics x x
Log Parsing x x x
Collaborative Filtering
x x x
Context Search x
Email & Content
x @2013 Copyright Sixth Sense Advisors 102
-
Challenges
S Resources Availability S MR is hard to implement S Speech to text
S Conversation context is often missing S Quality of recording S Accent issues
S Visual data tagging S Images S Text embedded within images
S Metadata is not available S Data is not trusted S Content management platform capabilities S Ontologies Ambiguity S Taxonomy Integration
@2013 Copyright Sixth Sense Advisors 103
-
Thank You
@2013 Copyright Sixth Sense Advisors 104
Krish Krishnan [email protected] Twitter Handle: @datagenius