SQL Server 2012 and Big Data

SQL SERVER 2012 AND BIG DATAHadoop Connectors for SQL Server

TECHNICALLY – WHAT IS HADOOP

• Hadoop consists of two key services: • Data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called

MapReduce.

HADOOP IS AN ENTIRE ECOSYSTEM

• Hbase as database• Hive as a Data Warehouse• Pig as the query language • Built on top of Hadoop and the Map-Reduce framework.

HDFS

• HDFS is designed to scale seamlessly • That’s it’s strength!

• Scaling horizontally is non-trivial in most cases. • HDFS scales by throwing more hardware at it. • A lot of it!• HDFS is asynchronous• Is what links Hadoop to Cloud computing.

DIFFERENCES

• SQL Server & Windows 2008 R2′s NTFS?• Data is not stored in the traditional table column format.• HDFS supports only forward only parsing• Databases built on HDFS don’t guarantee ACID properties• Taking code to the data• SQL Server scales better vertically

UNSTRUCTURED DATA

• Doesn’t know/care about column names, column data types, column sizes or even number of columns.• Data is stored in delimited flat files• You’re on your own with respect to data cleansing• Data input in Hadoop is as simple as loading your data file

into HDFS• It’s very close to copying files on an OS.

NO SQL, NO TABLES, NO COLUMNSNO DATA?

• Write code to do Map-Reduce• You have to write code to get data

• The best way to get data • write code that calls the MapReduce framework to slices and dices

the stored data

• Step 1 is Map and Step 2 is Reduce.

MAP (REDUCE)

• Mapping• Pick your selection of keys from record (Linefeed)• Tell the framework what your Key is and what values that key will

hold• MR will deal with actual creation of the Map• Control on what keys to include or what values to filter out• End up with a giant hashtable

(MAP) REDUCE

• Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities.• Finally you get a blob of the mapped and reduced data.

JAVA… VS. PIG…

• Pig is a querying engine• Has a ‘business-friendly’ syntax• Spits out MapReduce code• syntax for Pig is called : Pig Latin (Don’t ask)• Pig Latin is very similar syntactically to LINQ.

• Pig converts into MapReduce and sends it off to Hadoop then retrieves the results• Half the performance• 10 times faster to write

HBASE

• HBase is a key value store on top of HDFS• This is the NOSql Database• Very thin layer over raw HDFS• Data is grouped in a Table that has rows of data.• Each row can have multiple ‘Column Families’ • Each ‘Column Family’ contain(s) multiple columns.• Each column name is the key and it has it’s corresponding column

value.• Each row doesn’t need to have the same number of columns

HIVE

• Hive is a little closer to RDBMS systems• Is a DWH system on top of HDFS and Hbase• Performs join operations between HBase tables

• Maintains a meta layer • data summation, ad-hoc queries and analysis of large data stores in

HFDS

• High level language• Hive Query Language, looks like SQL but restricted• No, Updates or Deletes are allowed• partitioning can be used to update information

o Essentially re-writing a chunk of data.

WINDOWS HADOOP- PROJECT ISOTOPE

• 2 Flavours• Cloud

o Azure CTP

• On Permiseo integration of the Hadoop File System with Active Directoryo integrate System Center Operations Manager with Hadoopo BI Integration

• Are not all that interesting in and of themselves, but data and tools areo Sqoop

– Integration with SQL Servero Flume

– Access to Lots of data

SQOOP

• Is a framework that facilitates transfer between (RDBMS) and HDFS. • Uses MapReduce programs to import and export data; • Imports and exports are performed in parallel with fault

tolerance.

• Source / Target files being used by Sqoop can be: • delimited text files• binary SequenceFiles containing serialized record data.

SQL SERVER – HORTONWORKS - HADOOP

• Spin-off from Yahoo• Bridge the technological gaps between Hadoop and Windows

Server • CTP of the Hadoop-based distribution for Windows Server

( somewhere in 2012)• Will work with Microsoft’s business-intelligence tools• including

o Excelo PowerPivot o PowerView

HADOOP CONNECTORS

• SQL Server versions• Azure• PDW• SQL 2012• SQL 2008 R2

http://www.microsoft.com/download/en/details.aspx?id=27584

http://www.microsoft.com/download/en/details.aspx?id=27584

WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN:

• Sqoop-based connector• Import• tables in SQL Server to delimited text files on HDFS• tables in SQL Server to SequenceFiles files on HDFS• tables in SQL Server to tables in Hive• Result of queries executed on SQL Server to delimited text files on HDFS• Result of queries executed on SQL Server to SequenceFiles files on HDFS• Result of queries executed on SQL Server to tables in Hive

• Export• Delimited text files on HDFS to SQL Server• DequenceFiles on HDFS to SQL Server• Hive Tables to tables in SQL Server

SQL SERVER 2012 ALONGSIDE THE ELEPHANT

• PowerView utilizes its own class of apps, if you will, that Microsoft is calling insights.• SQL Server will extend insights to Hadoop data sets• Interesting insights can be• Brought into a SQL Server environment using connectors• Drive analysis across it using BI tools.

WHY USE HADOOP WITH SQL SERVER

• Don’t just think about big data being large volumes• Analyze both structured and unstructured datasets• Think about workload, growth, accessibility and even location• Can the amount of data stored every day reliably written to a

traditional HDD

• Mapreduce is more complex then TSQL• Many companies try to avoid writing java for queries • Front ends are immature relative to the tooling available in the

relational database world• It’s not going to replace your database, but your database isn’t likely

to replace Hadoop either.

MICROSOFT AND HADOOP

• Broader access of Hadoop to: • End users• IT professionals • Developers

• Enterprise ready Hadoop distribution with greater security, performance, ease of management.• Breakthrough insights through the use of familiar tools such

as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.

ENTERPRISE HADOOP

• Installation wizard (IsotopeClusterDeployment)• Healtcheck and monitoring pages• Interactive Javascript Console

MICROSOFT ENTERPRISE HADOOP

• Machines in the Hadoop cluster must be running Windows Server 2008 or higher

• Ipv4 network enabled on all nodes• Deployment does not work on Ipv6 only network.

• The ability to create a new user account called “Isotope”. • Will be created on all nodes of the cluster. • Used for running Hadoop daemons and running jobs. • Must be able to copy and install the deployment binaries to each machine

• Windows File Sharing services must be enabled on each machine that will be joined to the Hadoop cluster.

• .Net Framework 4 installed on all nodes.• Minimum of 10G free space in C drive (JBOD HDFS configuration is

supported)

© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SQL Server 2012 and Big Data

Technology

Transcript of SQL Server 2012 and Big Data