Big Data (NJ SQL Server User Group)

Post on 15-Jan-2015

1.614 views 1 download

Tags:

description

 

Transcript of Big Data (NJ SQL Server User Group)

Introduction to Big Data and NoSQL

NJ SQL Server User GroupMay 15, 2012

Don Demsak

Advisory Solutions Architect

EMC Consulting

www.donxml.com

Melissa Demsak

SQL Architect

Realogy

www.sqldiva.com

Meet Melissa

• SQL Architect– Realogy

• SqlDiva, Twitter: sqldiva• Email – melissa@sqldiva.com

Meet Don

• Advisory Solutions Architect– EMC Consulting

• Application Architecture, Development & Design

• DonXml.com, Twitter: donxml• Email – don@donxml.com• SlideShare - http://www.slideshare.net/dondemsak

The era of Big Data

How did we get here?• Expensive

o Processorso Disk spaceo Memoryo Operating Systemso Softwareo Programmers

• Culture of Limitationso Limit CPU cycleso Limit disk spaceo Limit memoryo Limited OS Developmento Limited Softwareo Programmers

• One language• One persistence store

Typical RDBMS Implementations

• Fixed table schemas

• Small but frequent reads/writes

• Large batch transactions

• Focus on ACIDo Atomicityo Consistencyo Isolationo Durability

How we scale RDBMS implementations

1st Step – Build a relational database

RelationalDatabase

2nd Step – Table Partitioning

RelationalDatabase

p1 p2 p3

3rd Step – Database Partitioning

Web TierBrowser B/L Tier RelationalDatabase

Customer #2

Web TierBrowser B/L Tier RelationalDatabase

Customer #1

Web TierBrowser B/L Tier RelationalDatabase

Customer #3

4th Step – Move to the cloud?

Web TierBrowser B/L TierSQL AzureFederation

Customer #2

Web TierBrowser B/L Tier SQL AzureFederation

Customer #1

Web TierBrowser B/L TierSQL AzureFederation

Customer #3

Problems created by too much data

• Where to store• How to store• How to process• Organization, searching, and

metadata• How to manage access• How to copy, move, and backup• Lifecycle

Polyglot Programmer

Polyglot Persistence

(how to store)

• Atlanta 2009 - No:sql(east) conference

select fun, profit from real_world where relational=false

• Billed as “conference of no-rel datastores”

• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID

(loose) Definition

Types Of NoSQL Data Stores

5 Groups of Data Models

Relational

Document

Key Value

Graph

Column Family

Document?• Think of a web page...

o Relational model requires column/tago Lots of empty columnso Wasted space and processing time

• Document model just stores the pages as iso Saves on spaceo Very flexible

• Document Databaseso Apache Jackrabbito CouchDBo MongoDBo SimpleDBo XML Databases

• MarkLogic Server• eXist.

Key/Value Stores• Simple Index on Key

• Value can be any serialized form of data

• Lots of different implementationso Eventually Consistent

• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”

o Cached in RAMo Cached on disko Distributed Hash Tables

• Exampleso Azure AppFabric Cacheo Memcache-do VMWare vFabric GemFire

Graph?• Graph consists of

o Node (‘stations’ of the graph)o Edges (lines between them)

• Graph Storeso AllegroGrapho Core Datao Neo4jo DEXo FlockDB

• Created by the Twitter folks• Nodes = Users• Edges = Nature of relationship between nodes.

o Microsoft Trinity (research project)• http://research.microsoft.com/en-us/projects/trinity/

Column Family?• Lots of variants

o Object Stores• Db4o• GemStone/S• InterSystems Caché• Objectivity/DB• ZODB

o Tabluar• BigTable• Mnesia• Hbase• Hypertable• Azure Table Storage

o Column-oriented• Greenplum• Microsoft SQL Server 2012

Okay got it, Now Let’s Compare Some Real

World Scenarios

04/10/2023Footer Text 24

You Need Constant Consistency

• You’re dealing with financial transactions• You’re dealing with medical records• You’re dealing with bonded goods• Best you use a RDMBS

04/10/2023Footer Text 25

You Need Horizontal Scalability

• You’re working across defined timezones• You’re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage.

04/10/2023Footer Text 26

Frequently Written Rarely Read

• Think web counters and the like• Every time a user comes to a page = ctr+

+• But it’s only read when the report is run• Use Key-Value Storage.

04/10/2023Footer Text 27

Here Today Gone Tomorrow

• Transient data like..o Web Sessionso Lockso Short Term Stats

• Shopping cart contents

• Use Key-Value Storage

Where to store• RAM

o Fasto Expensiveo volatile

• Parallel File Systemo HDFS (Hadoop)o Auto-replicated for

parallel decentralized I/O

• Local Disko SSD – super fasto Fast spinning disks (7200+)o High Bandwidth possibleo Persistent

• SANo Storage Area Networko Fully managedo Expensive

• Cloudo Amazono Box.Neto DropBox

Big Data

Big Data Definition

Volume

• Beyond what traditional environments can handle

Velocity

• Need decisions fast

Variety

• Many formats

Additional Big Data Concepts• Volumes & volumes of data

• Unstructured

• Semi-structured

• Not suited for Relational Databases

• Often utilizes MapReduce frameworks

Big Data Examples• Cassandra

• Hadoop

• Greenplum

• Azure Storage

• EMC Atmos

• Amazon S3

• SQL Azure (with Federations support)?

Real World Example

• Twittero The challenges

• Needs to store many graphs

Who you are following Who’s following you Who you receive phone

notifications from etc• To deliver a tweet requires

rapid paging of followers• Heavy write load as

followers are added and removed

• Set arithmetic for @mentions (intersection of users).

What did they try?• Started with Relational

Databases

• Tried Key-Value storage of denormalized lists

• Did it work?o Nope

• Either good at Handling the write load Or paging large

amounts of data But not both

What did they need?

• Simplest possible thing that would work

• Allow for horizontal partitioning

• Allow write operations to

• Arrive out of ordero Or be processed more than onceo Failures should result in redundant work

• Not lost work!

The Result was FlockDB• Stores graph data

• Not optimized for graph traversal operations

• Optimized for large adjacency listso List of all edges in a graph

• Key is the edge value a set of the node end points

• Optimized for fast read and write

• Optimized for page-able set arithmetic.

How Does it Work?• Stores graphs as sets of edges between

nodes

• Data is partitioned by nodeo All queries can be answered by a single partition

• Write operations are idempotento Can be applied multiple times without changing the result

• And commutativeo Changing the order of operands doesn’t change the result.

How to Process Big Data

ACID• Atomicity

o All or Nothing

• Consistencyo Valid according to all defined rules

• Isolationo No transaction should be able to interfere with another

transaction

• Durabilityo Once a transaction has been committed, it will remain so, even in

the event of power loss, crashes, or errors

BASE• Basically Available

o High availability but not always consistent

• Soft stateo Background cleanup mechanism

• Eventual consistencyo Given a sufficiently long period of time over which no changes are

sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

Traditional (relational) Approach

Extract

Transform

Load

Transactional Data Store

Data Warehouse

Big Data Approach

• MapReduce Pattern/Frameworko an Input Readero Map Function – To transform to a common

shape (format)o a partition functiono a compare functiono Reduce Functiono an Output Writer

MongoDB Example

> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};

> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};

> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );

What is Hadoop?• A scalable fault-tolerant grid operating system for

data storage and processing

• Its scalability comes from the marriage of:o HDFS: Self-Healing High-Bandwidth Clustered Storageo MapReduce: Fault-Tolerant Distributed Processing

• Operates on unstructured and structured data

• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)

• Open source under the friendly Apache License

• http://wiki.apache.org/hadoop/

Hadoop Design Axioms

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible

Hadoop Core Components

Store

HDFS

Self-healingHigh-bandwidth

Clustered storage

Process

Map/Reduce

Fault-tolerantdistributedprocessing

HDFS: Hadoop Distributed File System

Block Size = 64MBReplication Factor = 3

Cost/GB is a few ¢/month vs $/month

Hadoop Map/Reduce

Hadoop Job Architecture

Microsoft embraces Hadoop

Good for enterprises & developers

Great for end users!

A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs

EIS / ERP

RDBMS

File System

OData [RSS]

Azure Storage

HADOOP[Azure and Enterprise]

OCEAN OF DATA[unstructured, semi-structured, structured]

Java OMStreaming

OMHiveQL PigLatin (T)SQL.NET/C#/F#

HDFS

NOSQL ETL

04/10/2023Footer Text 52

Hive Plug-in for Excel

THANK YOU