Big Data (NJ SQL Server User Group)

Introduction to Big Data and NoSQL

NJ SQL Server User GroupMay 15, 2012

Don Demsak

Advisory Solutions Architect

EMC Consulting

www.donxml.com

Melissa Demsak

SQL Architect

Realogy

www.sqldiva.com

Meet Melissa

• SQL Architect– Realogy

• SqlDiva, Twitter: sqldiva• Email – melissa@sqldiva.com

Meet Don

• Advisory Solutions Architect– EMC Consulting

• Application Architecture, Development & Design

• DonXml.com, Twitter: donxml• Email – don@donxml.com• SlideShare - http://www.slideshare.net/dondemsak

The era of Big Data

How did we get here?• Expensive

o Processorso Disk spaceo Memoryo Operating Systemso Softwareo Programmers

• Culture of Limitationso Limit CPU cycleso Limit disk spaceo Limit memoryo Limited OS Developmento Limited Softwareo Programmers

• One language• One persistence store

Typical RDBMS Implementations

• Fixed table schemas

• Small but frequent reads/writes

• Large batch transactions

• Focus on ACIDo Atomicityo Consistencyo Isolationo Durability

How we scale RDBMS implementations

1st Step – Build a relational database

RelationalDatabase

2nd Step – Table Partitioning

RelationalDatabase

p1 p2 p3

3rd Step – Database Partitioning

Web TierBrowser B/L Tier RelationalDatabase

Customer #2

Customer #1

Customer #3

4th Step – Move to the cloud?

Web TierBrowser B/L TierSQL AzureFederation

Customer #2

Web TierBrowser B/L Tier SQL AzureFederation

Customer #1

Web TierBrowser B/L TierSQL AzureFederation

Customer #3

Problems created by too much data

• Where to store• How to store• How to process• Organization, searching, and

metadata• How to manage access• How to copy, move, and backup• Lifecycle

Polyglot Programmer

Polyglot Persistence

(how to store)

• Atlanta 2009 - No:sql(east) conference

select fun, profit from real_world where relational=false

• Billed as “conference of no-rel datastores”

• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID

(loose) Definition

Types Of NoSQL Data Stores

5 Groups of Data Models

Relational

Document

Key Value

Column Family

Document?• Think of a web page...

o Relational model requires column/tago Lots of empty columnso Wasted space and processing time

• Document model just stores the pages as iso Saves on spaceo Very flexible

• Document Databaseso Apache Jackrabbito CouchDBo MongoDBo SimpleDBo XML Databases

• MarkLogic Server• eXist.

Key/Value Stores• Simple Index on Key

• Value can be any serialized form of data

• Lots of different implementationso Eventually Consistent

• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”

o Cached in RAMo Cached on disko Distributed Hash Tables

• Exampleso Azure AppFabric Cacheo Memcache-do VMWare vFabric GemFire

Graph?• Graph consists of

o Node (‘stations’ of the graph)o Edges (lines between them)

• Graph Storeso AllegroGrapho Core Datao Neo4jo DEXo FlockDB

• Created by the Twitter folks• Nodes = Users• Edges = Nature of relationship between nodes.

o Microsoft Trinity (research project)• http://research.microsoft.com/en-us/projects/trinity/

Column Family?• Lots of variants

o Object Stores• Db4o• GemStone/S• InterSystems Caché• Objectivity/DB• ZODB

o Tabluar• BigTable• Mnesia• Hbase• Hypertable• Azure Table Storage

o Column-oriented• Greenplum• Microsoft SQL Server 2012

Okay got it, Now Let’s Compare Some Real

World Scenarios

04/10/2023Footer Text 24

You Need Constant Consistency

• You’re dealing with financial transactions• You’re dealing with medical records• You’re dealing with bonded goods• Best you use a RDMBS

You Need Horizontal Scalability

• You’re working across defined timezones• You’re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage.

Frequently Written Rarely Read

• Think web counters and the like• Every time a user comes to a page = ctr+

+• But it’s only read when the report is run• Use Key-Value Storage.

Here Today Gone Tomorrow

• Transient data like..o Web Sessionso Lockso Short Term Stats

• Shopping cart contents

• Use Key-Value Storage

Where to store• RAM

o Fasto Expensiveo volatile

• Parallel File Systemo HDFS (Hadoop)o Auto-replicated for

parallel decentralized I/O

• Local Disko SSD – super fasto Fast spinning disks (7200+)o High Bandwidth possibleo Persistent

• SANo Storage Area Networko Fully managedo Expensive

• Cloudo Amazono Box.Neto DropBox

Big Data

Big Data Definition

Volume

• Beyond what traditional environments can handle

Velocity

• Need decisions fast

Variety

• Many formats

Additional Big Data Concepts• Volumes & volumes of data

• Unstructured

• Semi-structured

• Not suited for Relational Databases

• Often utilizes MapReduce frameworks

Big Data Examples• Cassandra

• Hadoop

• Greenplum

• Azure Storage

• EMC Atmos

• Amazon S3

• SQL Azure (with Federations support)?

Real World Example

• Twittero The challenges

• Needs to store many graphs

Who you are following Who’s following you Who you receive phone

notifications from etc• To deliver a tweet requires

rapid paging of followers• Heavy write load as

followers are added and removed

• Set arithmetic for @mentions (intersection of users).

What did they try?• Started with Relational

Databases

• Tried Key-Value storage of denormalized lists

• Did it work?o Nope

• Either good at Handling the write load Or paging large

amounts of data But not both

What did they need?

• Simplest possible thing that would work

• Allow for horizontal partitioning

• Allow write operations to

• Arrive out of ordero Or be processed more than onceo Failures should result in redundant work

• Not lost work!

The Result was FlockDB• Stores graph data

• Not optimized for graph traversal operations

• Optimized for large adjacency listso List of all edges in a graph

• Key is the edge value a set of the node end points

• Optimized for fast read and write

• Optimized for page-able set arithmetic.

How Does it Work?• Stores graphs as sets of edges between

• Data is partitioned by nodeo All queries can be answered by a single partition

• Write operations are idempotento Can be applied multiple times without changing the result

• And commutativeo Changing the order of operands doesn’t change the result.

How to Process Big Data

ACID• Atomicity

o All or Nothing

• Consistencyo Valid according to all defined rules

• Isolationo No transaction should be able to interfere with another

transaction

• Durabilityo Once a transaction has been committed, it will remain so, even in

the event of power loss, crashes, or errors

BASE• Basically Available

o High availability but not always consistent

• Soft stateo Background cleanup mechanism

• Eventual consistencyo Given a sufficiently long period of time over which no changes are

sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

Traditional (relational) Approach

Extract

Transform

Transactional Data Store

Data Warehouse

Big Data Approach

• MapReduce Pattern/Frameworko an Input Readero Map Function – To transform to a common

shape (format)o a partition functiono a compare functiono Reduce Functiono an Output Writer

MongoDB Example

> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};

> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};

> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );

What is Hadoop?• A scalable fault-tolerant grid operating system for

data storage and processing

• Its scalability comes from the marriage of:o HDFS: Self-Healing High-Bandwidth Clustered Storageo MapReduce: Fault-Tolerant Distributed Processing

• Operates on unstructured and structured data

• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)

• Open source under the friendly Apache License

• http://wiki.apache.org/hadoop/

Hadoop Design Axioms

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible

Hadoop Core Components

Self-healingHigh-bandwidth

Clustered storage

Process

Map/Reduce

Fault-tolerantdistributedprocessing

HDFS: Hadoop Distributed File System

Block Size = 64MBReplication Factor = 3

Cost/GB is a few ¢/month vs $/month

Hadoop Map/Reduce

Hadoop Job Architecture

Microsoft embraces Hadoop

Good for enterprises & developers

Great for end users!

A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs

EIS / ERP

File System

OData [RSS]

Azure Storage

HADOOP[Azure and Enterprise]

OCEAN OF DATA[unstructured, semi-structured, structured]

Java OMStreaming

OMHiveQL PigLatin (T)SQL.NET/C#/F#

NOSQL ETL

Hive Plug-in for Excel

THANK YOU

Big Data (NJ SQL Server User Group)

Technology

Transcript of Big Data (NJ SQL Server User Group)

SQL Club- Reimagining New SQL Server 2014download.microsoft.com/.../20140321_Session1_SQL_Club.pdf2014/03/21 · The Road to SQL Server 2014 SQL Server 2008 R2 • Multi-Server Admin

SQL Server DB (dbreport) Report : SQL Server Objects Reportsqlserverbook.sourceforge.net/samples/docbook-full-report.pdf · SQL Server DB (dbreport) Report : SQL Server Objects Report

We Practice What We Teach. Public Web Site Outages Web Servers SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server.

Table of Contents - Navicat · SQL Server Table Storage 160 SQL Server Views 160 SQL Server Functions/Procedures 161 SQL Server Indexes 163 SQL Server Synonyms 168 SQL Server Triggers

Detailed information about how to use specific features ...files.trendmicro.com/documentation/guides/ScanMail... · SQL server • SQL Server 2012 • SQL Server 2008 R2 • SQL Server

[MS-SSSO]: SQL Server System Overview...SQL Server 2008, Microsoft SQL Server 2008 R2, Microsoft SQL Server 2012, and Microsoft SQL Server 2014. The specific release is indicated when

SQL-Server 2012 Always On. SQL Server “SQL-Server 2012” Highlights High Availability SQL Server AlwaysOn Security & Manageability User-Defined Server.

SQL Server 2012 - SQL, Transact SQL

SQL Server Hardening - Cisco · SQL Server Hardening • SQLServerHardeningConsiderations,page1 • SQLServer2014SecurityConsiderations,page3 SQL Server Hardening Considerations

Installin SQL Server 2014 CTP2 without error - WordPress.com · Upgrade Advisor analyzes any SQL Server 2012, SQL Server 2008 RI, SQL Server 2008 or SQL Server 2005 components that

SQL Server Data Mining for SQL Server Professionals

SQL Server Heterogêneo: SQL Server + BigData

SQL Server and SQL

SQL Server Benchmarking The Powershell speedometer Server Benchmarking.pdf · SQL SERVER BENCHMARKING THE POWERSHELL SPEEDOMETER. ... Gather and parse SQL Server information ... SQL

440900914440900 - kutub-download.com · 440900914440900 2 : 1-Microsoft SQL Server 2-Visual Basic.NET Visual Basic.NET SQL SQL Server SQL Server Microsoft SQL Server SQL

Modern Storage Strategies for SQL Server - ActualTech Media€¦ · SQL Server I/O Performance 19 SQL Server Workload Patterns 20 Architecting the SQL Server I/O Stack 22 SQL Server

SQL Server™ Security Girish Chander, SQL Server Security Program Manager James Hamilton, SQL Server Architect.

Geek Guide > SQL Server on Linux - suse.com · Microsoft SQL Server is an offering that includes not ... SQL Server 2012, SQL Server 2014 and SQL Server 2016). SQL Server works on

sql server dba training | sql server dba training online | sql server dba course

DBI319. Demo SQL Server 2005 SP2 SQL Server 2008 SQL Server 2000 SP4 SQL Server 2008 R2 SQL Server 2008 SP2 SQL Server 2008 R2 SQL.