Big Data (NJ SQL Server User Group)

53
Introduction to Big Data and NoSQL NJ SQL Server User Group May 15, 2012 Don Demsak Advisory Solutions Architect EMC Consulting www.donxml.com Melissa Demsak SQL Architect Realogy www.sqldiva.com

description

 

Transcript of Big Data (NJ SQL Server User Group)

Page 1: Big Data (NJ SQL Server User Group)

Introduction to Big Data and NoSQL

NJ SQL Server User GroupMay 15, 2012

Don Demsak

Advisory Solutions Architect

EMC Consulting

www.donxml.com

Melissa Demsak

SQL Architect

Realogy

www.sqldiva.com

Page 2: Big Data (NJ SQL Server User Group)

Meet Melissa

• SQL Architect– Realogy

• SqlDiva, Twitter: sqldiva• Email – [email protected]

Page 3: Big Data (NJ SQL Server User Group)

Meet Don

• Advisory Solutions Architect– EMC Consulting

• Application Architecture, Development & Design

• DonXml.com, Twitter: donxml• Email – [email protected]• SlideShare - http://www.slideshare.net/dondemsak

Page 4: Big Data (NJ SQL Server User Group)

The era of Big Data

Page 5: Big Data (NJ SQL Server User Group)

How did we get here?• Expensive

o Processorso Disk spaceo Memoryo Operating Systemso Softwareo Programmers

• Culture of Limitationso Limit CPU cycleso Limit disk spaceo Limit memoryo Limited OS Developmento Limited Softwareo Programmers

• One language• One persistence store

Page 6: Big Data (NJ SQL Server User Group)

Typical RDBMS Implementations

• Fixed table schemas

• Small but frequent reads/writes

• Large batch transactions

• Focus on ACIDo Atomicityo Consistencyo Isolationo Durability

Page 7: Big Data (NJ SQL Server User Group)

How we scale RDBMS implementations

Page 8: Big Data (NJ SQL Server User Group)

1st Step – Build a relational database

RelationalDatabase

Page 9: Big Data (NJ SQL Server User Group)

2nd Step – Table Partitioning

RelationalDatabase

p1 p2 p3

Page 10: Big Data (NJ SQL Server User Group)

3rd Step – Database Partitioning

Web TierBrowser B/L Tier RelationalDatabase

Customer #2

Web TierBrowser B/L Tier RelationalDatabase

Customer #1

Web TierBrowser B/L Tier RelationalDatabase

Customer #3

Page 11: Big Data (NJ SQL Server User Group)

4th Step – Move to the cloud?

Web TierBrowser B/L TierSQL AzureFederation

Customer #2

Web TierBrowser B/L Tier SQL AzureFederation

Customer #1

Web TierBrowser B/L TierSQL AzureFederation

Customer #3

Page 12: Big Data (NJ SQL Server User Group)

Problems created by too much data

• Where to store• How to store• How to process• Organization, searching, and

metadata• How to manage access• How to copy, move, and backup• Lifecycle

Page 13: Big Data (NJ SQL Server User Group)
Page 14: Big Data (NJ SQL Server User Group)

Polyglot Programmer

Page 15: Big Data (NJ SQL Server User Group)

Polyglot Persistence

(how to store)

Page 16: Big Data (NJ SQL Server User Group)

• Atlanta 2009 - No:sql(east) conference

select fun, profit from real_world where relational=false

• Billed as “conference of no-rel datastores”

• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID

(loose) Definition

Page 17: Big Data (NJ SQL Server User Group)

Types Of NoSQL Data Stores

Page 18: Big Data (NJ SQL Server User Group)

5 Groups of Data Models

Relational

Document

Key Value

Graph

Column Family

Page 19: Big Data (NJ SQL Server User Group)

Document?• Think of a web page...

o Relational model requires column/tago Lots of empty columnso Wasted space and processing time

• Document model just stores the pages as iso Saves on spaceo Very flexible

• Document Databaseso Apache Jackrabbito CouchDBo MongoDBo SimpleDBo XML Databases

• MarkLogic Server• eXist.

Page 20: Big Data (NJ SQL Server User Group)

Key/Value Stores• Simple Index on Key

• Value can be any serialized form of data

• Lots of different implementationso Eventually Consistent

• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”

o Cached in RAMo Cached on disko Distributed Hash Tables

• Exampleso Azure AppFabric Cacheo Memcache-do VMWare vFabric GemFire

Page 21: Big Data (NJ SQL Server User Group)

Graph?• Graph consists of

o Node (‘stations’ of the graph)o Edges (lines between them)

• Graph Storeso AllegroGrapho Core Datao Neo4jo DEXo FlockDB

• Created by the Twitter folks• Nodes = Users• Edges = Nature of relationship between nodes.

o Microsoft Trinity (research project)• http://research.microsoft.com/en-us/projects/trinity/

Page 22: Big Data (NJ SQL Server User Group)

Column Family?• Lots of variants

o Object Stores• Db4o• GemStone/S• InterSystems Caché• Objectivity/DB• ZODB

o Tabluar• BigTable• Mnesia• Hbase• Hypertable• Azure Table Storage

o Column-oriented• Greenplum• Microsoft SQL Server 2012

Page 23: Big Data (NJ SQL Server User Group)

Okay got it, Now Let’s Compare Some Real

World Scenarios

Page 24: Big Data (NJ SQL Server User Group)

04/10/2023Footer Text 24

You Need Constant Consistency

• You’re dealing with financial transactions• You’re dealing with medical records• You’re dealing with bonded goods• Best you use a RDMBS

Page 25: Big Data (NJ SQL Server User Group)

04/10/2023Footer Text 25

You Need Horizontal Scalability

• You’re working across defined timezones• You’re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage.

Page 26: Big Data (NJ SQL Server User Group)

04/10/2023Footer Text 26

Frequently Written Rarely Read

• Think web counters and the like• Every time a user comes to a page = ctr+

+• But it’s only read when the report is run• Use Key-Value Storage.

Page 27: Big Data (NJ SQL Server User Group)

04/10/2023Footer Text 27

Here Today Gone Tomorrow

• Transient data like..o Web Sessionso Lockso Short Term Stats

• Shopping cart contents

• Use Key-Value Storage

Page 28: Big Data (NJ SQL Server User Group)

Where to store• RAM

o Fasto Expensiveo volatile

• Parallel File Systemo HDFS (Hadoop)o Auto-replicated for

parallel decentralized I/O

• Local Disko SSD – super fasto Fast spinning disks (7200+)o High Bandwidth possibleo Persistent

• SANo Storage Area Networko Fully managedo Expensive

• Cloudo Amazono Box.Neto DropBox

Page 29: Big Data (NJ SQL Server User Group)

Big Data

Page 30: Big Data (NJ SQL Server User Group)

Big Data Definition

Volume

• Beyond what traditional environments can handle

Velocity

• Need decisions fast

Variety

• Many formats

Page 31: Big Data (NJ SQL Server User Group)

Additional Big Data Concepts• Volumes & volumes of data

• Unstructured

• Semi-structured

• Not suited for Relational Databases

• Often utilizes MapReduce frameworks

Page 32: Big Data (NJ SQL Server User Group)

Big Data Examples• Cassandra

• Hadoop

• Greenplum

• Azure Storage

• EMC Atmos

• Amazon S3

• SQL Azure (with Federations support)?

Page 33: Big Data (NJ SQL Server User Group)

Real World Example

• Twittero The challenges

• Needs to store many graphs

Who you are following Who’s following you Who you receive phone

notifications from etc• To deliver a tweet requires

rapid paging of followers• Heavy write load as

followers are added and removed

• Set arithmetic for @mentions (intersection of users).

Page 34: Big Data (NJ SQL Server User Group)

What did they try?• Started with Relational

Databases

• Tried Key-Value storage of denormalized lists

• Did it work?o Nope

• Either good at Handling the write load Or paging large

amounts of data But not both

Page 35: Big Data (NJ SQL Server User Group)

What did they need?

• Simplest possible thing that would work

• Allow for horizontal partitioning

• Allow write operations to

• Arrive out of ordero Or be processed more than onceo Failures should result in redundant work

• Not lost work!

Page 36: Big Data (NJ SQL Server User Group)

The Result was FlockDB• Stores graph data

• Not optimized for graph traversal operations

• Optimized for large adjacency listso List of all edges in a graph

• Key is the edge value a set of the node end points

• Optimized for fast read and write

• Optimized for page-able set arithmetic.

Page 37: Big Data (NJ SQL Server User Group)

How Does it Work?• Stores graphs as sets of edges between

nodes

• Data is partitioned by nodeo All queries can be answered by a single partition

• Write operations are idempotento Can be applied multiple times without changing the result

• And commutativeo Changing the order of operands doesn’t change the result.

Page 38: Big Data (NJ SQL Server User Group)

How to Process Big Data

Page 39: Big Data (NJ SQL Server User Group)

ACID• Atomicity

o All or Nothing

• Consistencyo Valid according to all defined rules

• Isolationo No transaction should be able to interfere with another

transaction

• Durabilityo Once a transaction has been committed, it will remain so, even in

the event of power loss, crashes, or errors

Page 40: Big Data (NJ SQL Server User Group)

BASE• Basically Available

o High availability but not always consistent

• Soft stateo Background cleanup mechanism

• Eventual consistencyo Given a sufficiently long period of time over which no changes are

sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

Page 41: Big Data (NJ SQL Server User Group)

Traditional (relational) Approach

Extract

Transform

Load

Transactional Data Store

Data Warehouse

Page 42: Big Data (NJ SQL Server User Group)

Big Data Approach

• MapReduce Pattern/Frameworko an Input Readero Map Function – To transform to a common

shape (format)o a partition functiono a compare functiono Reduce Functiono an Output Writer

Page 43: Big Data (NJ SQL Server User Group)

MongoDB Example

> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};

> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};

> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );

Page 44: Big Data (NJ SQL Server User Group)

What is Hadoop?• A scalable fault-tolerant grid operating system for

data storage and processing

• Its scalability comes from the marriage of:o HDFS: Self-Healing High-Bandwidth Clustered Storageo MapReduce: Fault-Tolerant Distributed Processing

• Operates on unstructured and structured data

• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)

• Open source under the friendly Apache License

• http://wiki.apache.org/hadoop/

Page 45: Big Data (NJ SQL Server User Group)

Hadoop Design Axioms

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible

Page 46: Big Data (NJ SQL Server User Group)

Hadoop Core Components

Store

HDFS

Self-healingHigh-bandwidth

Clustered storage

Process

Map/Reduce

Fault-tolerantdistributedprocessing

Page 47: Big Data (NJ SQL Server User Group)

HDFS: Hadoop Distributed File System

Block Size = 64MBReplication Factor = 3

Cost/GB is a few ¢/month vs $/month

Page 48: Big Data (NJ SQL Server User Group)

Hadoop Map/Reduce

Page 49: Big Data (NJ SQL Server User Group)

Hadoop Job Architecture

Page 50: Big Data (NJ SQL Server User Group)

Microsoft embraces Hadoop

Good for enterprises & developers

Great for end users!

Page 51: Big Data (NJ SQL Server User Group)

A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs

EIS / ERP

RDBMS

File System

OData [RSS]

Azure Storage

HADOOP[Azure and Enterprise]

OCEAN OF DATA[unstructured, semi-structured, structured]

Java OMStreaming

OMHiveQL PigLatin (T)SQL.NET/C#/F#

HDFS

NOSQL ETL

Page 52: Big Data (NJ SQL Server User Group)

04/10/2023Footer Text 52

Hive Plug-in for Excel

Page 53: Big Data (NJ SQL Server User Group)

THANK YOU