Ia Pe 2013-10-Building the BigData EDW

Building ABig Data Data Warehouse

Integrating Structured and Unstructured Data

DAMA IOWA October 2013

Krish Krishnan Founder Sixth Sense Advisors Inc

Discussion Focus

S Big data and the data warehousethe new landscape

S Technology overview: Hadoop, NoSQL, Cassandra, BigQuery, Drill, Redshift, AWS (S3, EC2); programming with MapReduce; understanding analytical requirements, self-service discovery platforms

S The challenges of data processing: Workloads; data management; infrastructure limitations

S Next-generation data warehouse: Solution architectures; the three Ss: scalability, sustainability, and stability

2 @2013 Copyright Sixth Sense Advisors

A New Landscape


A Growing Trend

@2013 Copyright Sixth Sense Advisors 4

Requirement Expectations Reality

Speed Speed of the Internet Speed = Infra + Arch + Design

Accessibility Accessibility of a Smartphone

BI Tool licenses & security

Usability IPAD - Mobility Web Enabled BI Tool

Availability Google Search Data & Report Metadata

Delivery Speed of questions Methodology & Signoff

Data Access to everything Structured Data

Scalability Cloud (Amazon) Existing Infrastructure

Cost Cell phone or Free WIFI Millions

Expectations for BI are changing w/o anyone telling us

State of Data Today


Data Growth Trends


Facebook has an average of 30 billion pieces of content added every month

YouTube receives 24hours of video, every minute

15 Billion mobile phones predicted to be in use in 2015

A leading retailer in the UK collects 1.5 billion pieces of information to adjust prices and promotions

Amazon.com: 30% of sales is out of its recommendation engine

A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvement

CERN Haldron Collider produces 15PB of data for each cycle of execution.

Decision Support = #Fail?

S Decision support platforms of today are not satisfying the needs of the business user

S Decisions being driven in the organization are not based on 360 degree views of the organization and its performance

S Business transformations are not completely successful due to the lack of information presented in the Business Intelligence Architecture

S Analytics and Key Performance Indicators are not available in a timely manner and the data that is presented is not sufficient to complete any business decisions with utmost confidence


State of the Data Warehouse


@2012 Copyright Sixth Sense Advisors

What We Have Built


Business Thinking


New Data Increasing Complexity

Increase Quality of Service

Increase Agility

Digital Intelligence

Customer Centric Cost driven

TCO Opportunity Cost Competitive Cost

Digital Connected

Mobile Metrics Driven

Big Data Social Media Corporate Data

New Data Smarter Consumer

Global Competition Cost

CIO Thinking


Flexibility

Reliability

Simplicity

Scalability

Modularity

Architects Thinking


Users Needs


Every Data, All Shapes, Sizes and Formats Are Needed By The Users

Why The Database Alone Cannot Be The

Platform The Limitations of Databases


The Disappointment


S Distributed S Transactional Databases S Data Warehouses S Datamarts S Analytical Databases S CRM Databases S SCM Databases S ERP Databases

S Redundant

S Weak Metadata

S Weak Integration

Base Graph Courtesy Dr. Richard Hackathorn

Why The Data Warehouse Fails


Action time or Action distance Time

Business Value

Data Latency

Analysis Latency

Decision Latency

Business Situation

Data is ready

Information is available

Decision is made

Los

t V

alue

Lost value = Sum (Latencies)+ Opportunity Cost

Data Warehouse Computing Today


Transactional Systems

ODS

Enterprise Datawarehouse

Datamarts & Analytical Databases




ODS


ODS

Reports

Dashboards

Analytic Models

Other Applications

Data Transformation

The Bottom Line

S We have designed, architected, deployed systems that have been built on architectures that were not intended to be used for complex processing and compute requirements

S The real issue lies in the fact that the architectures that were designed for the RDBMS platform differ widely in their abilities to handle diverse types of workloads

S In order to design and manage complex workloads, architects need to understand the underlying platforms capabilities with relation to the type of workload being designed


Shared Everything Architecture

S Resources are distributed and shared S CPUs are shared across the

databases S Memory is shared across

CPUs and databases S Disk architecture is shared

across CPUs

S Big disadvantage is the sharing of resources limits the scalability

S Addition of the resources will not increase linear scalability and performance but only cost


Issues

S Shared Everything architecture cannot scale and handle workloads effectively

S You cannot achieve 100% linear scalability in a shared architecture environment

S Compute and store happen in disparate environments

S Infrastructure limitations create more latencies in the overall system

S Data Governance is complex subject area that adds to the weakness of the architecture


BIG Data Example


To: [email protected] Dear Mr. Collins, This email is in reference to my bank account which has been efficiently handled by your bank for more than five years. There has been no problem till date until last week the situation went out of the hand. I have deposited one of my high amount cheque to my bank account no: 65656512 which was to be credited same day but due to your staff carelessness it wasnt done and because of this negligence my reputation in the market has been tarnished. Furthermore I had issued one payment cheque to the party which was showing bounced due to Insufficient balance just because my cheque didnt make on time. My relationship with your bank has matured with the time and its a shame to tell you about this kind of services are not acceptable when it is question of somebodys reputation. I hope you got my point and I am attaching a copy of the same for further rapid procedures and remit into my account in a day. Yours sincerely Daniel Carter Ph: 564-009-2311

Big Data Example

S We will o2en imply addi6onal informa6on in spoken language by the way we place stress on words.

S The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent diculty a natural language processor can have in parsing it. S "I never said she stole my money" - Someone else said it, but I didn't. S "I never said she stole my money" - I simply didn't ever say it. S "I never said she stole my money" - I might have implied it in some way, but I

never explicitly said it. S "I never said she stole my money" - I said someone took it; I didn't say it was

she. S "I never said she stole my money" - I just said she probably borrowed it. S "I never said she stole my money" - I said she stole someone else's money. S "I never said she stole my money" - I said she stole something, but not my

money

S Depending on which word the speaker places the stress, this sentence could have several dis6nct meanings.

@2013 Copyright Sixth Sense Advisors 22 Example Source: Wikepedia

The Normal Way Results In


Impact on Data Warehouse


New Data Types

New volume

New analytics

New workload

New metadata

POOR Performance

Failed Programs

Scalability; Sharding; ACID;

Why Big Data can Fail?

ACID is Not Good All The Time

S Atomic All of the work in a transaction completes (commit) or none of it completes

S Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.

S Isolated The results of any changes made during a transaction are not visible until the transaction has committed.

S Durable The results of a committed transaction survive failures


Where Do we Go?


Tools

instructions

Data &

Next Generation Technologies

Integrating Big Data


Innovations


Category New Frontiers

Infrastructure Big Data and Data Warehouse Appliances In-Memory Technologies SSD Storage Fast Networks Cloud Mobile Technologies

Software In-memory Databases Hadoop, Cassandra & NoSQL Ecosystems Columnar DBMS Improved ETL-Hadoop integration Informatica, Talend

Algorithms Mahout

Pre-Configured Architectures

IBM, Teradata, Kognitio, EMC, CloudEra, HortonWorks, Cirro, Intel, Cicso UCS, Pivotal, Oracle, MapR

BIG Data - Infrastructure Requirements

S Scalable platform

S Database independent

S Fault tolerant

S Low cost of acquisition

S Scalable and Reliable Storage

S Supported by standard toolsets

S Datacenter Ready


Big Data Workload Demands


S Process dynamic data content

S Process unstructured data

S Systems that can scale up with high volume data

S Systems that can scale out with high volume of users

S Perform complex operations within reasonable response time

Parallel databases

S Shared-nothing MPP architecture (a collection of independent machines, each with local hard disk and main memory, connected together on high-speed network)

S Machines are cheaper, lower-end, commodity hardware

S Scales well up to a point, tens of nodes

S Good performance

S Poor fault tolerance

S Problems with heterogeneous environment (machines must be equal in performance)

S Good support for flexible query interface


Data Warehouse Appliance

High Availability

Standard SQL Interface

Advanced Compression

MPP

Leverages existing BI, ETL and OLTP investments

Hadoop & MapReduce Interface / Embedded

Minimal disk I/O bottleneck; simultaneously load & query

Auto Database Management


A Data Warehouse (DW) Appliance is an integrated set of servers, storage, OS, database and interconnect specifically preconfigured and tuned for the rigors of data warehousing.

DW appliances offer an attractive price / performance value proposition and are frequently a fraction of the cost of traditional data warehouse solutions.

Hadoop Evolution


Hadoop


Why Hadoop

S Commodity HW S Built on inexpensive servers S Storage servers and their disks are not assumed to be highly reliable and available S Modular expansion

S Metadata-data oriented design S Namenode maintains metadata S Datanodes manage data placement and store

S Computation happens close to data S Servers have dual goals: data storage and computation S Single store and computevs. Separate clusters

S File-System Architecture S Focus is mostly sequential access S Single writers S No file locking features


Hadoop Architecture


HDFS


S Hadoop Distributed File System S A scalable, Fault tolerant, High

performance distributed file system

S Asynchronous replication S Write-once and read-many

(WORM)

S No RAID required S Access from C, Java,Thrift S NameNode holds filesystem

metadata

S Files are broken up and spread over the DataNodes

HDFS Splits & Replication

S Data is organized into files and directories

S Files are divided into uniform sized blocks and distributed across cluster nodes

S Blocks are replicated to handle hardware failure

S Filesystem keeps checksums of data for corruption detection and recovery

S HDFS exposes block placement so that computation can be migrated to data


HDFS

S Data Node S Stores data in HDFS S Can be found in multiples S Data is replicated across data

nodes


S File size S A typical block size is 64MB (or

even 128 MB).

S A file is chopped into 64MB chunks and stored.

S Name Node S The Name Node is the heartbeat of an HDFS file system. S It keeps the directory of all files in the file system, and tracks data

distribution across the cluster the file. S It does not store the data of these files itself. S Cluster configuration management S Transaction Log management

S Features S HDFS provides Java API for

application to use.

S Python access is also used in many applications.

S A C language wrapper for Java API is also available.

S A HTTP browser can be used to browse the files of a HDFS instance.

14

Data Correctness - File creation : Client computes checksum per 512 bytes DataNode stores the checksum - File Access : Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas Data Pipeline - Client retrieves a list of DataNodes on which to place replicas of a block - Client writes block to the first DataNode - The first DataNode forwards the data to the next DataNode in the Pipeline - When all replicas are written, the client moves on to write the next block in file

Rebalancer - Usually run when new DataNodes are added - Cluster is online when Rebalancer is active - Rebalancer is throttled to avoid network congestion - Command line tool

Blocks Placement - First replica on a node in a local rack - Second replica on different rack - 3rd replica on the same rack of the second replica - Clients read from nearest replica Heartbeats - DataNodes send heartbeat to the NameNode (once every 3 seconds) - NameNode used heartbeats to detect DataNode failure Replication Engine

- -

-

Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes

HDFS Features


HBASE

S Clone of Big Table (Google)

S Implemented in Java (Clients : Java, C++,Ruby...)

S Columnoriented data store

S Distributed over many servers

S Tolerant of machine failure

S Layered over HDFS

S Strong consistency

S It's not a relational database (No joins)

S Sparse data nulls are stored for free

S Supports Semi-structured and unstructured data

S Versioned data storage capability

S Extremely Scalable Goal of billions of rows x millions of columns


S Hbase provides storage for the Hadoop Distributed Computing Environment.

S Data is logically organized into tables, rows and columns.

Hive

S Data summarization and ad-hoc query interface on top of Hadoop

S MapReduce for Execution & HDFS for storage

S Hive Query Language S Basic SQL : Select, From, Join, Group By S Equi-Join, Multi-Table Insert, Multi-Group-By S Batch query

S MetaStore S Table/Partitions properties S Thrift API : Current clients in Php (Web S Interface), Python interface to Hive, Java

(Query S Engine and CLI) S Metadata stored in any SQL backend


Image Cloudera Hive Tutorial

Hbase Hive Integration


HBase

Hive table definitions

Points to an existing table

Points to some column

Points to other columns, different names

Pig

S Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

S Pig generates and compiles a Map/Reduce program(s) on the fly.

S Abstracts you from specific detail S Focus on data processing S Data flow S Built For data manipulation

S Pig is workflow driven and is easy to maintain @2013 Copyright Sixth Sense Advisors 44

Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters Automatic data import SQL to Hadoop Easy import data from many databases to Hadoop Generates code for use in MapReduce applications Integrates with Hive

Sqoop


All servers store a copy of the data A leader is elected at startup Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the Change

Zookeeper


24

AVRO

S A data serialization system that provides dynamic integration with scripting languages

S Avro Data S Expressive S Smaller and Faster S Dyamic

S Schema store with data S APIs permit reading and creating

S Include a file format and a textual encoding S Generates JSON Metadata Automatically


24

AVRO

S Avro RPC S Leverage versioning support S For Hadoop service provide cross-language access


25

A data collection system for managing large distributed systems Build on HDFS and MapReduce Tools kit for displaying, monitoring and analyzing the log files

Chukwa


Flume

S Flume is: S A scalable, configurable, extensible and manageable distributed

data collection service

S Developed on Open source S One-stop solution for data collection of all formats S Flexible reliability guarantees allow careful performance tuning S Enables quick iteration on new collection strategies


Oozie

S Workflow Engine in Hadoop HTTP and command line interface + Web console

S Used to S Execute and monitor workflows in Hadoop S Periodic scheduling of workflows S Trigger execution by data availability


Hadoop Differentiator

Schema-on-Write: RDBMS

Schema-on-Read: Hadoop


Schema must be created before data is loaded.

An explicit load operation has to take place which transforms the data to the internal structure of the database.

New columns must be added explicitly before data for such columns can be loaded into the database.

Read is Fast.

Standards/Governance.

Data is simply copied to the file store, no special transformation is needed.

A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns.

New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them.

Load is Fast

Evolving Schemas/Agility

HadoopDB

S Recent study at Yale University, Database Research Dep.

S Hybrid architecture of parallel databases and MapReduce system

S The idea is to combine the best qualities of both technologies

S Multiple single-node databases are connected using Hadoop as the task coordinator and network communication layer

S Queries are distributed across the nodes by MapReduce framework, but as much work as possible is done in the database node

@2013 Copyright Sixth Sense Advisors 53 Slide Courtsey: Dr.Daniel Abadi

HadoopDB architecture

@2013 Copyright Sixth Sense Advisors 54 Slide Courtsey: Dr.Daniel Abadi

Hadoop Limitations

S Write-once model

S A namespace with an extremely large number of files exceeds Namenodes capacity to maintain

S Cannot be mounted by existing OS S Getting data in and out is tedious S Virtual File System can solve problem

S HDFS does not implement / support S User quotas S Access permissions S Hard or soft links S Data balancing schemes

S No periodic checkpoints


Hadoop Tips

S Hadoop is useful S When you must process lots of unstructured

data S When running batch jobs is acceptable S When you have access to lots of cheap

hardware

S Hadoop is not useful S For intense calculations with little or

no data S When your data is not self-contained S When you need interactive results


Implementation Think big, start small Build on agile cycles Focus on the data, as you will always

develop schema on write.

Available Optimizations Input to Maps Map only jobs Combiner Compression Speculation Fault Tolerance Buffer Size Parallelism (threads) Partitioner Reporter DistributedCache Task child environment

settings

Hadoop Tips

S Performance Tuning S Increase the memory/buffer

allocated to the tasks S Increase the number of tasks that

can be run in parallel S Increase the number of threads

that serve the map outputs S Disable unnecessary logging S Turn on speculation S Run reducers in one wave as they

tend to get expensive S Tune the usage of

DistributedCache, it can increase efficiency

S Troubleshooting S Are your partitions uniform? S Can you combine records at the

map side? S Are maps reading off a DFS block

worth of data? S Are you running a single reduce

wave (unless the data size per reducers is too big) ?

S Have you tried compressing intermediate data & final data?

S Are there buffer size issues S Do you see unexplained long

tails S Are your CPU cores busy? S Is at least one system resource

being loaded?


MapReduce

S Developed for processing large data sets.

S Contains Map and Reduce functions.

S Runs on a large cluster of machines.

S Goals S Use machines across the data center S Elastic scaling S Finite programming model


Input | Map() | Copy/Sort | Reduce() | Output

Map Phase

Raw data analyzed and converted to name/value pair

Shuffle Phase

All name/value pairs are sorted and grouped by their keys

Reduce Phase

All values associated with a key are processed for results

MapReduce


Programming model

S Input & Output: each a set of key/value pairs S Programmer specifies two functions: S map (in_key, in_value) -> list(out_key, intermediate_value)

S Processes input key/value pair S Produces set of intermediate pairs S reduce (out_key, list(intermediate_value)) -> list(out_value)

S Combines all intermediate values for a particular key

S Produces a set of merged output values (usually just one)


Example

S Page 1: DAMA Conference is good

S Page 2: There are good ideas presented at DAMA

S Page 3: I like DAMA because of its variety of topics.


Map output

S Worker 1: S (DAMA1), (Conference 1), (is 1), (good 1).

S Worker 2: S (There 1), (are 1), (good 1), (ideas 1), (presented 1), (at 1), (DAMA

1).

S Worker 3: S (I 1), (Like 1), (DAMA 1), (Because 1), (of 1), (its 1), (variety 1), (of

1), (topics 1).


Reduce Input

S Worker 1: S (DAMA 1), (DAMA 1), (DAMA

1)

S Worker 2: S (is 1)

S Worker 3: S (good 1), (good 1)

S Worker 4: S (There 1)

S Worker 5: S (ideas 1)

S Worker 6: S (presented 1)

S Worker 7: S (I 1)

S Worker 8: S (like 1)

S Worker 9: S (its 1)

S Worker 10: S (variety 1)

S Worker 11: S (Topics 1)


Reduce Output

S Worker 1: S (DAMA 3)

S Worker 2: S (is 1)

S Worker 3: S (good 2)

S Worker 4: S (There 1)

S Worker 5: S (ideas 1)

S Worker 6: S (presented 1)

S Worker 7: S (I 1)

S Worker 8: S (like 1)

S Worker 9: S (its 1)

S Worker 10: S (variety 1)

S Worker 11: S (Topics 1)


MapReduce Strengths

S Tunable S Fine grained Map and Reduce tasks S Improved load balancing S Faster recovery from failed tasks

S Good fault tolerance S Can scale to thousands of nodes S Supports heterogeneous environments S Automatic re-execution on failure

S Localized execution S With large data, eliminates bandwidth problem by scheduling execution close to

location of data when possible

S Map-Reduce + HDFS is a very effective solution for scaling in a distributed geographical environment


NoSQL

S Stands for Not Only SQL

S Based on CAP Theorem

S Usually do not require a fixed table schema nor do they use the concept of joins

S All NoSQL offerings relax one or more of the ACID properties

S NoSQL databases come in a variety of flavors S XML (myXMLDB, Tamino, Sedna) S Wide Column (Cassandra, Hbase, Big Table) S Key/Value (Redis, Memcached with BerkleyDB) S Graph (neo4j, InfoGrid) S Document store (CouchDB, MongoDB)


NoSQL


Size

Complexity

Amazon Dynamo

Google Big Table

Cassandra

Lotus Notes HBase

Voldermort

Graph Theory

Approaches to CAP

68

S Eric Brewer stated in 2000 at PODC that S You have to give up one

of the following in a distributed system : S Consistency of data S Availability S Partition tolerance

S BASE S No ACID, use a single version of DB,

reconcile later

S Defer transaction commit S Until partitions fixed and replicate can run

S Eventual consistency (e.g., Amazon Dynamo) S Eventually, all copies of an object converge

S Restrict transactions (e.g., Sharded MySQL) S 1-M/c Xacts: Objects in xact are on the same

machine S 1-Object Xacts: Xact can only read/write 1

object

S Object timelines (PNUTS)


Consistency Model

S If copies are asynchronously updated, what can we say about stale copies? S ACID guarantees require synchronous updts S Eventual consistency: Copies can drift apart, but will

eventually converge if the system is allowed to quiesce S To what value will copies converge? S Do systems ever quiesce?

S Is there any middle ground?


Consistency Techniques S Per-record mastering

S Each record is assigned a master region S May differ between records

S Updates to the record forwarded to the master region S Ensures consistent ordering of updates

S Tablet-level mastering S Each tablet is assigned a master region S Inserts and deletes of records forwarded to the master region S Master region decides tablet splits

S These details are hidden from the application S Except for the latency impact!


HBASE

71 @2013 Copyright Sixth Sense Advisors 71

Architecture


Disk

HRegionServer

Client Client Client Client Client

HBaseMaster

REST API

Disk

HRegionServer

Disk

HRegionServer

Disk

HRegionServer

Java Client

HRegion Server S Records partitioned by column family into HStores

S Each HStore contains many MapFiles

S All writes to HStore applied to single memcache

S Reads consult MapFiles and memcache

S Memcaches flushed as MapFiles (HDFS files) when full

S Compactions limit number of MapFiles


HRegionServer

HStore

MapFiles

Memcache writes

Flush to disk reads

Pros and Cons

S Pros S Log-based storage for high write throughput S Elastic scaling S Easy load balancing S Column storage for OLAP workloads

S Cons S Writes not immediately persisted to disk S Reads cross multiple disk, memory locations S No geo-replication S Latency/bottleneck of HBaseMaster when using

REST


Architecture

S Facebooks storage system S BigTable data model S Dynamo partitioning and consistency model S Peer-to-peer architecture


Cassandra node

Disk

Cassandra node

Disk

Cassandra node

Disk

Cassandra node

Disk

Client Client Client Client Client

Routing

S Consistent hashing, like Dynamo or Chord S Server position = hash(serverid) S Content position = hash(contentid) S Server responsible for all content in a hash interval


Server

Responsible hash interval

Cassandra Server

S Writes go to log and memory table

S Periodically memory table merged with disk table


Cassandra node

Disk

RAM

Log SSTable file

Memtable

Update

(later)

Pros and Cons S Pros

S Elastic scalability S Easy management

S Peer-to-peer configuration S BigTable model is nice

S Flexible schema, column groups for partitioning, versioning, etc. S Eventual consistency is scalable

S Cons S Eventual consistency is hard to program against S No built-in support for geo-replication S Load balancing? S System complexity

S P2P systems are complex; have complex corner cases


Cassandra Tips

Tunable memtable size Can have large memtable flushed less frequently, or small memtable

flushed frequently

Tradeoff is throughput versus recovery time Larger memtable will require fewer flushes, but will take a long time to

recover after a failure

With 1GB memtable: 45 mins to 1 hour to restart Can turn off log flushing

Risk loss of durability Replication is still synchronous with the write

Durable if updates propagated to other servers that dont fail


NoSQL


Best Practices Design for data collection Plan the data store Organize by type and semantics Partition for performance

Access and Query is run time dependent

Horizontal scaling Memory Cachin

Access and Query RESTful interfaces (HTTP as an

accessAPI) Query languages other than SQL

SPARQL - Query language for the SemanticWeb

Gremlin - the graph traversal language

Sones Graph Query Language Data Manipulation / Query API

The Google BigTable DataStoreAPI

The Neo4jTraversalAPI Serialization Formats

JSON Thrift ProtoBuffers RDF

Forest Rim Technology Textual ETL Engine (TETLE) is an integration tool for turning text into a structure of data that can be analyzed by standard analytical tools

Textual ETL Engine


Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data.

The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords

Easy to implement and easy to realize ROI

Advantages Simple to use No MR or Coding required for text analysis

and mining Extensible by Taxonomy integration Works on standard and new databases Produces a highly columnar key-value store,

ready for metadata integration

Disadvantages Not integrated with Hadoop as a rules

interface Currently uses Sqoop for metadata

interchange with Hadoop or NoSQL interfaces

Current GA does not handle distributed processing outside Windows platform

Amazon RedShift

S Goal 1 - Reduce I/O S Direct-attached storage S Large data block sizes S Columnar storage

83

S The industrys first large scale Data Warehouse As A Service.

S Designed and Architected For Petabyte Scale Deployment

S Goal 2 Optimize Hardware S Optimized for I/O intensive

workloads

S High disk density S Runs in fast network - HPC


S Goal 3 Extreme Parallelism Increased speed and efficiency

S Loading S Querying S Backup S Restore

SQL Clients / BI Tools

Leader Node

RedShift Architecture

Picture Amazon Presentation on RedShift - Internet


Deployment Options

S Can be hosted with RDBMS on-site and RedShift on the Cloud


Deployment Options

S Can be used as Live Archive on the Cloud


Deployment Options

S Can be used as ETL for Big Data on the Cloud


Big Data Technologies

S Apache Software Foundation S Hadoop S HBASE S Zookeeper S Oozie S Avro S Pig S Sqoop S Flume S Cassandra

S CloudEra

S HortonWorks

S MongoDB

S IBM BigInsights

S EMC Pivotal

S Teradata Aster Big Data Appliance

S Oracle Big Data Appliance

S Intel Hadoop Distribution

S MapR

S Datastax

S Rainstor

S QueryIO


Workloads, Architectures,

Computing


Workload


S Defined as the usage of resources including CPU, Disk and Memory by every query ETL, ELT, BI and Analytics

S Often misunderstood as a Database capability

S Mostly touted by vendors as a differentiator for their platform

Workload

S Loading S Continuous (near real-time) S Batch S Micro Batch

S Queries S Tactical S AdHoc S Analytical S Dashboard


MIXED Workload

What Are You Trying to Do?


Data Workloads

OLTP (Random access to

a few records)

OLAP (Scan access to a large

number of records)

Read-heavy Write-heavy By rows By columns Unstructured

Combined (Some OLTP and

OLAP tasks)

Data Engineering vs. Analysis/Warehousing

S Very different workloads, requirements S Warehoused data for analysis includes

S Data from serving system S Click log streams S Syndicated feeds

S Trend towards scalable stores with S Semi-structured data S Map-reduce

S The result of analysis is stored in the Data Warehouse


Workload Isolation

S Assigning the appropriate systems and processes to manage workloads

S Creates an interchangeable infrastructure

S Provides for better scalability

S Will create a heterogeneous configuration, can be deployed on a homogenized platform if desired


Workload Isolation


Semi-Structured

Data

Metadata

S The key to the castle in integrating Big Data is metadata

S Whatever the tool, technology and technique, if you do not know your metadata, your integration will fail

S Semantic technologies and architectures will be the way to process and integrate the Big Data.

S Business domain experts can identify large data patterns by association relationships with small metadata.


The Big Data - Data Warehouse


Multi-Tiered Workload


Application Unstructured Data ( File Based)

Semi-Sturctured Data (File / Digital)

Structured Data (Digital)

Social Analytics, Behavior Analytics, Recommendation Engines, Sentiment Analytics, Fraud Detection

Hadoop / NoSQL Hadoop / NoSQL RDBMS

CRM, SalesForce, Marketing RDBMS

Data Mining Hadoop / NoSQL Hadoop / NoSQL RDBMS

System Characteristics Volume: Large Concurrency: Low Consolidation: App Specific Availability: High Updated: Near Real Time to Monthly

Volume: Large Concurrency: Medium Consolidation/Integration: Variable Availability:Medium Updated: Near Real Time

Volume: Large Concurrency: High Consolidation/Integration: High Availability: High Updated: Intra-Day & Daily

Reference Architecture


Which Tool

Application Hadoop NoSQL Textual ETL

Machine Learning

x x

Sentiments x x x

Text Processing x x x

Image Processing

x x

Video Analytics x x

Log Parsing x x x

Collaborative Filtering

x x x

Context Search x

Email & Content

x @2013 Copyright Sixth Sense Advisors 102

Challenges

S Resources Availability S MR is hard to implement S Speech to text

S Conversation context is often missing S Quality of recording S Accent issues

S Visual data tagging S Images S Text embedded within images

S Metadata is not available S Data is not trusted S Content management platform capabilities S Ontologies Ambiguity S Taxonomy Integration


Thank You


Krish Krishnan [email protected] Twitter Handle: @datagenius

Ia Pe 2013-10-Building the BigData EDW

Documents

Transcript of Ia Pe 2013-10-Building the BigData EDW