Big Data and NoSQL in Microsoft-Land

51
SQL Server Live! Orlando 2012 SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1 Big Data and NoSQL in Microsoft-Land Andrew Brust and Lynn Langit Blue Badge Insights & Data Wrangler Level: Intermediate CEO and Founder, Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC http://www.msbinyc.com Co-moderator, NYC .NET Developers Group http://www.nycdotnetdev.com Redmond Reviewcolumnist for Visual Studio Magazine and Redmond Developer News brustblog.com, Twitter: @andrewbrust Meet Andrew

description

SQL Server Live! Orlando 2012

Transcript of Big Data and NoSQL in Microsoft-Land

Page 1: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1

Big Data and NoSQLin Microsoft-Land

Andrew Brust and Lynn LangitBlue Badge Insights & Data Wrangler

Level: Intermediate

• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC

– http://www.msbinyc.com

• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com

• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News

• brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 2: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2

Andrew’s New Blog (bit.ly/bigondata)

• CEO and Founder, Lynn Langit consulting• Former Microsoft Evangelist (4 years)• Google Developer Expert• MongoDB Master• MCT 13 years – 7 certifications• Cloudera Certified Developer

• MSDN Magazine articles – SQL Azure– Hadoop on Azure– MongoDB on Azure

• www.LynnLangit.com• @LynnLangit

Meet Lynn

Page 3: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3

Lynn’s YouTube Channel

www.TeachingKidsProgramming.org• Free Courseware ( • Do a Recipe Teach a Kid (Ages 10 ++)• Java or Microsoft SmallBasic

• recipes)

Page 4: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4

Read all about it!

Agenda

• Overview / Landscape – Big Data, and Hadoop

– NoSQL

– The Big Data-NoSQL Intersection

• Drilldown on Big Data

• Drilldown on NoSQL

Page 5: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5

What is Big Data?• 100s of TB into PB and higher• Involving data from: financial data,

sensors, web logs, social media, etc.• Parallel processing often involved

– Hadoop is emblematic, but other technologies are Big Data too

• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety

• Big Data tech sometimes imposed on small data problems

BigData = Exponentially More Data• Retail Example -> ‘Feedback Economy’

– Number of transactions

– Number of behaviors (collected every minute)

Page 6: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6

BigData = ‘Next State’ Questions

• What could happen?• Why didn’t this happen?• When will the next new thing

happen?• What will the next new thing

be?• What happens?

Collecting Behavioral

data

What’s MapReduce?

• “Big” input data as key-value pair series

• Partition the data and send to mappers (nodes in cluster)

• Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer

• Reducer aggregates; one output per key, with value

• Map and Reduce code natively written as Java functions

Page 7: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

A MapReduce Example

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Merge tallies into one spreadsheet

• Collect the tallies

Page 8: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8

What’s a Distributed File System?

• One where data gets distributed over commodity drives on commodity servers

• Data is replicated

• If one box goes down, no data lost– “Shared Nothing”

• BUT: Immutable– Files can only be written to once

– So updates require drop + re-write (slow)

– You can append though

– Like a DVD/CD-ROM

Hadoop = MapReduce + HDFS

• Modeled after Google MapReduce + GFS

• Have more data? Just add more nodes to cluster. – Mappers execute in parallel

– Hardware is commodity

– “Scaling out”

• Use of HDFS means data may well be local to mapper processing

• So, not just parallel, but minimal data movement, which avoids network bottlenecks

Page 9: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9

Example Comparison: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query Response Time

Can be near immediate Has latency (due to batch processing)

Just-in-time Schema

• When looking at unstructured data, schema is imposed at query time

• Schema is context specific– If scanning a book, are the values words, lines, or

pages?

– Are notes a single field, or is each word value?

– Are date and time two fields or one?

– Are street, city, state, zip separate or one value?

– Pig and Hive let you determine this at query time

– So does the Map function in MapReduce code

Page 10: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10

What’s HBase?

• A Wide-Column Store NoSQL database

• Modeled after Google BigTable

• Uses HDFS– Therefore, Hadoop-compatible

• Hadoop often used with HBase– But you can use either without the other

Page 11: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11

NoSQL Confusion• Many ‘flavors’ of NoSQL data stores• Easiest to group by functionality, but…

– Dividing lines are not clear or consistent

• NoSQL choice(s) driven by many factors– Type of data– Quantity of tool– Knowledge of technical staff– Product maturity– Tooling

So much wrong information

Everything is ‘new’

Everything is ‘new’

People are religious about data storage

People are religious about data storage

Lots of incorrect

information

Lots of incorrect

information

‘Try’ before you ‘buy’ (or

use)

‘Try’ before you ‘buy’ (or

use)

Watch out for over

simplification

Watch out for over

simplification

Confusion over vendor

offerings

Confusion over vendor

offerings

Page 12: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12

Common NoSQL Misconceptions

ProblemsProblems

Everything is ‘new’

People are religious about data storage

Open source is always cheaper

Cloud is always cheaper

Replace RDBMS with NoSQL

SolutionsSolutions

‘Try’ before you ‘buy’ (or use)

Leverage NoSQL communities

Add NoSQL to existing RDBMS solution

NoSQL + Big Data• HBase and Cassandra work with Hadoop, are

NoSQL databases• MongoDB brands itself a Big Data technology• Couchbase does too• Just-in-time schema• MapReduce in MongoDB, others• Hadoop and most NoSQL DBs are

partitioned, scale-out technologies• It’s all about analytics on semi- or un-

structured data

Page 13: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13

DRILLDOWN ON BIG DATA

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 14: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14

What’s Hive?

• Began as Hadoop sub-project– Now top-level Apache project

• Provides a SQL-like (“HiveQL”) abstraction over MapReduce

• Has its own HDFS table file format (and it’s fully schema-bound)

• Can also work over HBase

• Acts as a bridge to many BI products which expect tabular data

Hadoop Distributions

• Cloudera

• Hortonworks– HCatalog: Hive/Pig/MR Interop

• MapR– Network File System replaces HDFS

• IBM InfoSphere BigInsights– HDFS<->DB2 integration

• And now Microsoft…

Page 15: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15

Microsoft HDInsight

• Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows

• Windows Azure HDInsight and Microsoft HDInsight (for Windows Server)– Single node preview runs on Windows client

• Includes ODBC Driver for Hive– And Excel Add-In that uses it

• JavaScript MapReduce framework

• Contribute it all back to open source Apache Project

HortonworksData Platform for

Windows

MRLib(NuGet

Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

Debugging

MR code in C#,

HadoopJob, MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Page 16: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16

Some ways to work• Microsoft HDInsight

– Cloud: go to www.hadooponazure.com, request invite– Local: Download Microsoft HDInsight

Runs on just about anything, including Windows XPGet it via the Web Platform installer (WebPI)

– Both are free for now; Azure HDInsight will be fee-based when RTM

• Amazon Web Services Elastic MapReduce– Create AWS account– Select Elastic MapReduce in Dashboard– Cheap for experimenting, but not free

• Cloudera CDH VM image– Download as .tar.gz file– “Un-tar” (can use WinRAR, 7zip)– Run via VMWare Player or Virtual Box– Everything’s free

Some ways to work

HDInsight EMR CDH 4

Page 17: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17

Microsoft HDInsight• Much simpler than the others• Browser-based portal

– Launch MapReduce jobs– Azure: Provisioning cluster, managing ports, gather external

data

• Interactive JavaScript & Hive console– JS: HDFS, Pig, light data visualization– Hive commands and metadata discovery– New console coming

• Desktop Shortcuts:– Command window, MapReduce, Name Node status in

browser– Azure: from portal page you can RDP directly to Hadoop

head node for these desktop shortcuts

Windows Azure HDInsight

Page 18: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18

Amazon Elastic MapReduce

• Lots of steps!

• At a high level:– Setup AWS account and S3 “buckets”

– Generate Key Pair and PEM file

– Install Ruby and EMR Command Line Interface

– Provision the cluster using CLI

A batch file can work very well here– Setup and run SSH/PuTTY

– Work interactively at command line

Amazon EMR – Prep Steps

• Create an AWS account

• Create an S3 bucket for log storage– with list permissions for authenticated users

• Create a Key Pair and save PEM file

• Install Ruby

• Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI

• Create credentials.json in EMR CLI folder– Associate with same region as where key pair created

Page 19: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19

Amazon – Security and Startup• Security

– Download PuTTYgen and run it– Click Load and browse to PEM file– Save it in PPK format– Exit PuTTYgen

• In a command window, navigate to EMR CLI folder and enter command:– ruby elastic-mapreduce --create --alive [--num-instance xx]

[--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large]

• In AWS Console, go to EC2 Dashboard and click Instances on left nav bar

• Wait until instance is running and get its Public DNS name– Use Compatibility View in IE or copy may not work

Connect!• Download and run PuTTY• Paste DNS name of EC2 instance into hostname

field • In Treeview, drill down and navigate to

Connection\SSH\Auth, browse to PPK file• Once EC2 instance(s) running, click Open• Click Yes to “The server’s host key is not cached

in the registry…” PuTTY Security Alert• When prompted for user name, type “hadoop” and

hit Enter• cd bin, then hive, pig, hbase shell• Right-click to paste from clipboard; option to go

full-screen• (Kill EC2 instance(s) from Dashboard when done)

Page 20: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20

Amazon Elastic MapReduce

Cloudera CDH4 Virtual Machine• Get it for free, in VMWare and Virtual Box

versions.– VMWare player and Virtual Box are free too

• Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP.

• Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to:– http://192.168.1.59:8888

• Can also use browser in VM and hit:– http://localhost:8888

• Work in “Hue”…

Page 21: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21

Hue• Browser based UI,

with front ends for:– HDFS (w/ upload &

download)– MapReduce job

creation and monitoring

– Hive (“Beeswax”)

• And in-browser command line shells for:– HBase– Pig (“Grunt”)

Impala: What it Is

• Distributed SQL query engine over Hadoop cluster

• Announced at Strata/Hadoop World in NYC on October 24th

• In Beta, as part of CDH 4.1

• Works with HDFS and Hive data

• Compatible with HiveQL and Hive drivers– Query with Beeswax

Page 22: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22

Impala: What it’s Not

• Impala is not Hive– Hive converts HiveQL to Java MapReduce code and

executes it in batch mode

– Impala executes query interactively over the data

– Brings BI tools and Hadoop closer together

• Impala is not an Apache Software Foundation project– Though it is open source and Apache-licensed, but

it’s still incubated by Cloudera

– Only in CDH

Cloudera CDH4

Page 23: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23

Hadoop commands• HDFS

– hadoop fs filecommand

– Create and remove directories:mkdir, rm, rmr

– Upload and download files to/from HDFSget, put

– View directory contentsls, lsr

– Copy, move, view filescp, mv, cat

• MapReduce– Run a Java jar-file based job

hadoop jar jarname params

Hadoop (directly)

Page 24: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24

HBase• Concepts:

– Tables, column families– Columns, rows– Keys, values

• Commands:– Definition: create, alter, drop, truncate– Manipulation: get, put, delete, deleteall, scan– Discovery: list, exists, describe, count– Enablement: disable, enable– Utilities: version, status, shutdown, exit

– Reference: http://wiki.apache.org/hadoop/Hbase/Shell

• Moreover,– Interesting HBase work can be done in MapReduce, Pig

HBase Examples

• create 't1', 'f1', 'f2', 'f3'

• describe 't1'

• alter 't1', {NAME => 'f1', VERSIONS => 5}

• put 't1', 'r1', 'c1:f1', 'value'

• get 't1', 'r1'

• count 't1'

Page 25: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25

HBase

Submitting, Running and Monitoring Jobs

• Upload a JAR

• Use Streaming– Use other languages (i.e. other than Java) to write

MapReduce code

– Python is popular option

– Any executable works, even C# console apps

– On MS HDInsight, JavaScript works too

– Still uses a JAR file: streaming.jar

• Run at command line (passing JAR name and params) or use GUI

Page 26: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26

Running MapReduce Jobs

Hive

• Used by most BI products which connect to Hadoop

• Provides a SQL-like abstraction over Hadoop– Officially HiveQL, or HQL

• Works on own tables, but also on HBase

• Query generates MapReduce job, output of which becomes result set

• Microsoft has Hive ODBC driver– Connects Excel, Reporting Services, PowerPivot,

Analysis Services Tabular Mode (only)

Page 27: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27

Hive, Continued

• Load data from flat HDFS files– LOAD DATA [LOCAL] INPATH 'myfile'INTO TABLE mytable;

• SQL Queries– CREATE, ALTER, DROP

– INSERT OVERWRITE (creates whole tables)

– SELECT, JOIN, WHERE, GROUP BY

– SORT BY, but ordering data is tricky!

– MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code

Excel Add-In for Hive

Page 28: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28

Hive

Pig• Instead of SQL, employs a language (“Pig

Latin”) that accommodates data flow expressions– Do a combo of Query and ETL

• “10 lines of Pig Latin ≈ 200 lines of Java.”• Works with structured or unstructured data• Operations

– As with Hive, a MapReduce job is generated– Unlike Hive, output is only flat file to HDFS or text at

command line console– With MS Hadoop, can easily convert to JavaScript array,

then manipulate

• Use command line (“Grunt”) or build scripts

Page 29: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29

Example

• A = LOAD 'myfile'AS (x, y, z);

B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);

STORE D INTO 'output';

Pig Latin Examples• Imperative, file system commands

– LOAD, STORESchema specified on LOAD

• Declarative, query commands (SQL-like)– xxx = file or data set

– FOREACH xxx GENERATE (SELECT…FROM xxx)

– JOIN (WHERE/INNER JOIN)

– FILTER xxx BY (WHERE)

– ORDER xxx BY (ORDER BY)

– GROUP xxx BY / GENERATE COUNT(xxx)(SELECT COUNT(*) GROUP BY)

– DISTINCT (SELECT DISTINCT)

• Syntax is assignment statement-based:– MyCusts = FILTER Custs BY SalesPerson eq 15;

• Access Hbase– CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');

Page 30: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30

Pig

Sqoop

sqoop import--connect "jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"

--table <from_table>--target-dir <to_hdfs_folder>--split-by <from_table_column>

Page 31: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31

Sqoop

sqoop export--connect "jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"

--table <to_table>--export-dir <from_hdfs_folder>--input-fields-terminated-by "<delimiter>"

Flume NG

• Source– Avro (data serialization system – can read json-

encoded data files, and can work over RPC)

– Exec (reads from stdout of long-running process)

• Sinks– HDFS, HBase, Avro

• Channels– Memory, JDBC, file

Page 32: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32

Flume NG (next generation)• Setup conf/flume.conf# Define a memory channel called ch1 on agent1

agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it

# to bind to 0.0.0.0:41414. Connect it to channel ch1.

agent1.sources.avro-source1.channels = ch1

agent1.sources.avro-source1.type = avro

agent1.sources.avro-source1.bind = 0.0.0.0

agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives

# and connect it to the other end of the same channel.

agent1.sinks.log-sink1.channel = ch1

agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell

# agent1 which ones we want to activate.

agent1.channels = ch1

agent1.sources = avro-source1

agent1.sinks = log-sink1

• From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

Mahout Algorithms• Recommendation

– Your info + community info– Give users/items/ratings; get user-user/item-item– itemsimilarity

• Classification/Categorization– Drop into buckets– Naïve Bayes, Complementary Naïve Bayes, Decision

Forests

• Clustering– Like classification, but with categories unknown– K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-

Shift

Page 33: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33

Workflow, Syntax• Workflow

– Run the job– Dump the output– Visualize, predict

• mahout algorithm-- input folderspec-- output folderspec-- param1 value1-- param2 value2

…• Example:

– mahout itemsimilarity --input <input-hdfs-path>--output <output-hdfs-path>--tempDir <tmp-hdfs-path>-s SIMILARITY_LOGLIKELIHOOD

The Truth About Mahout• Mahout is really just an algorithm engine

• Its output is almost unusable by non-statisticians/non-data scientists

• You need a staff or a product to visualize, or make into a usable prediction model

• Investigate Predixion Software– CTO, Jamie MacLennan, used to lead SQL Server Data

Mining team

– Excel add-in can use Mahout remotely, visualize its output, run predictive analyses

– Also integrates with SQL Server, Greenplum, MapReduce

– http://www.predixionsoftware.com

Page 34: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34

The “Data-Refinery” Idea

• Use Hadoop to “on-board” unstructured data, then extract manageable subsets

• Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine

• This is the current rationalization of Hadoop + BI tools’ coexistence

• Will it stay this way?

DRILLDOWN ON NOSQL

Page 35: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35

Hitting (Relational) Walls

• CA– Highly-available consistency

• CP– Enforced consistency

• AP– Eventual consistency

The reality…two pivots

Storage Methods• SQL (RDBMS) • NoSQL

Storage Locations• On premises • Cloud-hosted

Page 36: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36

So many NoSQL options• More than just the Elephant in the room

• Over 120+ types of noSQL databases

Flavors of NoSQL

Page 37: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37

Graph DatabaseUse for data with

– a lot of many-to-many relationships

– recursive self-joins

– when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data

– Examples: Neo4J, FreeBase (Google)

Column Database

• Wide, sparse column sets

• Schema-light

• Examples:– Cassandra

– HBase

– BigTable

– GAE HR DS

Page 38: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38

More about Column Databases

• Type A– Column-families

– Non-relational

– Sparse

– Examples: HBase, Cassandra, xVelocity (SQL 2012 BISM)

• Type B– Column-stores

– Relational

– Dense

– Example:

SQL Server 2012 Columnstore index

Demo - Document Database (MongoDB)• Use for data that is

– document-oriented (collection of JSON documents) w/semi structured data

Encodings include XML, YAML, JSON& BSON

– binary forms PDF, Microsoft Office documents -- Word, Excel…)

• Examples: MongoDB, CouchDB

Page 39: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39

Demo MongoDB

Persistent Key / Value Database• Schema-less

• State - Persistent

• Examples– AWS DynamoDB

– Azure Tables

– Project Voldemort

Page 40: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40

Volatile Key / Value Database• Schema-less

• State - Volatile

• Examples– Redis

– Memcahed

Which type of NoSQL for which type of data?

Type of Data Type of NoSQL solution

Example

Log files Wide Column HBase

Product Catalogs Key Value on disk DynamoDB

User profiles Key Value in memory Redis

Startups Document MongoDB

Social media connections

Graph Neo4j

LOB w/Transactions NONE! Use RDBMS SQL Server

Page 41: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41

What about the cloud?

Cloud-hosted NoSQL up to 50x CHEAPER

Page 42: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42

Consumer Storage Buckets• Dropbox

• Box

• Windows SkyDrive

• Google Drive

• Amazon Cloud Drive

• Apple iCloud

Developer BLOB Storage Buckets• Amazon – S3 or Glacier

• Google – Cloud Storage

• Microsoft Azure BLOBS

• Others

Page 43: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43

Cloud-hosted RDBMS• AWS RDS – SQL

Server, MySQL, Oracle– Medium cost– Solid feature set, i.e.

backup, snapshot– Use existing tooling

• Google – MySQL– Lowest cost– Most limited RDBMS

functionality

• Microsoft – Windows Azure SQL Database– Highest cost– Azure VMs w/MySQL

Page 44: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44

Other cloud data services

Hosting public datasetsHosting public datasets

• Pay to read• Earn revenue by offering for read

Cleaning / matching (your) data Cleaning / matching (your) data

• ETL – Microsoft Data Explorer, Google Refine• Data Quality – Windows Azure Marketplace,

InfoChimps, DataMarket.com

Cloud – RDBMS, NoSQL & Hadoop

AWS Google Microsoft

Cloud RDBMS SQL Server, Oracle / mySQL

MySQL SQL Azure

NoSQL buckets S3 or Glacier Cloud Storage Azure Storage

NoSQL databases DynamoDB H/R Datastore on GAE

Azure Tables

StreamingMachine Learning

Custom EC2 Prospective Search &Prediction API

StreamInsight & Mahout with Hadoop

Document or Graph

MongoDB on EC2 Freebase (g) MongoDB on Windows Azure

Hadoop Elastic MapReduce using S3 & EC2

MapR & GCE Windows Azure HDInsight

Data sets & other Karmasphere Translation APIFull-text search

Azure DataMarket

Page 45: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45

Demo Amazon RDS

Pick your mix and then…

NoSQL

• Host locally• Host in the

Cloud

RDBMS

• Host locally• Host in the

Cloud

Other Services

• Use Cloud Data Markets

• Use Cloud ETL

Page 46: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46

What about me?

Common DBA Tasks in NoSQLRDBMS NoSQL

Import Data Import Data

Setup Security Setup Security

Perform a Backup Make a copy of the data

Restore a Database Move a copy to a location

Create an Index Create an Index

Join Tables Together Run MapReduce

Schedule a Job Schedule a (Cron) Job

Run Database Maintenance Monitor space and resources used

Send an Email from SQL Server Set up resource threshold alerts

Search BOL Interpret Documentation

Page 47: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47

Making Sense – Asking Questions

Data Scientists…

Page 48: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48

Com

pari

ng…

Karmasphere Studio for AWS

Page 49: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49

Google BigQuery w/Excel• Dremel-based service

– For massive amounts of data

– BigQuery currently has quota limits

– SQL-like query language

Demo Google Big Query

Page 50: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50

NoSQL To-Do List

Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem

Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training

environments

Learn noSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon

Karmasphere, Microsoft Excel connectors, etc…

The Changing Data Landscape

NoSQLRDBMS

Other

Services

Page 51: Big Data and NoSQL in Microsoft-Land

SQL Server Live! Orlando 2012

SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51

NoSQL for .NET Developers

• RavenDB

• MongoDB C#/.NET Driver

• MongoDB on Windows Azure

• CouchBase .NET Client Library

• Riak client for .NET

• AWS Toolkit for Visual Studio

• Google cloud APIs (REST-based)