Big Data and Hadoop On Windows

Big Data and Hadoop On Windows

.Net SIG ClevelandImage credit: morguefile.com/creative/imelenchon

http://morguefile.com/creative/imelenchon



About Me Serkan Ayvaz,

Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ.

LinkedIn: [email protected] email:[email protected] Twitter:@sayvaz

Agenda Introduction to Big Data Hadoop Framework Hadoop On Windows Ecosystem Conclusions

What is Big Data?(“Hype?”) Big data is a collection

of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.-Wikipedia

What is new? Enterprise data grows rapidly Emerging Market for Vendors New Data Sources Competitive industries - need for more

Insights Asking different questions

Generating models instead transforming data into models

What is the problem? Size of Data; Rapid growth, TBs to PBs are norm for many

organizations As of 2012, size of data sets that are feasible to process in a

reasonable amount of time were on the order of exabytes of data.

Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc Structured Unstructured Semi-structured

Rate of Data Growth As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -

Wikipedia Particularly large datasets; meteorology, genomics, complex

physics simulations, and biological and environmental research, Internet search, finance and business informatics

Critique Even as companies invest eight- and nine-figure

sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review.

Consumer privacy concerns by increasing storage and integration of personal information

Things to consider Return of Investment may differ Asking wrong questions, won’t get right

answers Experts to fit in the organization Requires leadership decision Might be fine with traditional systems(for

now)

What is Hadoop? Scalability

Scales Horizontally, Vertical scaling has limits Scales seamlesly

Moves processing to the data, opposed to traditional methods Network bandwidth is limited resource

Processes data sequentially in chunks, avoid random access Seeks are expensive, disk throughput is reasonable

Fault tolerance Data Replication

Economical Commodity-Servers(“not Low-end”) vs Specialized Servers

Ecosystem Integration with other tools

Open Source Innovative, Extensible

Hadoop Core HDFS Storage

MapReduce

Processing

What can I do with Hadoop? Distributed Programming(MapReduce) Storage, Archive Legacy data Transform Data Analysis, Ad Hoc Reporting Look for Patterns Monitoring/ Processing logs Abnormality detection Machine Learning and advanced algorithms Many more

HDFS

Blocks• Large enough to minimize the cost of seeks-64

MB default • Unit of abstraction makes storage management

simpler than file • Fits well with replication strategy and availability

NameNode• Maintains the filesystem tree and

metadata for all the files and directories • Stores the namespace image and edit log

Datanode• Store and retrieve blocks• Report the blocks back to NameNode

periodically

HDFS

Designed for and Shines with large files

Fault tolerance - Data Replication within and across Racs Hadoop breaks data into

smaller blocks Data locality

Most efficient with write-once, read-many-times pattern

Low-latency data access optimized for high throughput data,

may be at the expense of latency. Consider Hbase for low latency

Lots of small files namenode holds filesystem metadata

in memory the limit to the number of files in a

filesystem Multiple writers, arbitrary file

modifications Files in HDFS may be written to by a

single writer.

Good Not so good

Data Flow

Source:Hadoop:The Definitive Guide

Rea

dW

rit

e

MapReduce Programming Splits input files into blocks Operates on key-value pairs Mappers filter & transform input data Reducers aggregate mappers output Handles processing efficiently in parallel

Move code to data – data locality Same code run on all machines

Can be difficult to implement some algorithms Can be implemented in almost any language

Streaming MapReduce for python, ruby, perl, php etc pig latin as data flow language hive for sql users

MapReduce Programmers write two functions:

map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*

All values with the same key are reduced together For efficiency, programmers typically also write:

partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations

combine (k’, v’) → <k’, v’>* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic

The framework takes care of rest of the execution

Simple example - Word Count

// Map Reduce function in JavaScript// -------------------------------------------------------------

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "") {

context.write(words[i].toLowerCase(), 1);}

}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {

sum += parseInt(values.next());}context.write(key, sum);

};

r2

combinecombine combine combine

yx 4 1 z 9 x z6 1 y z79

partition partition partition partition

mapmap map map

k1 v1

yx 4 1 z z3 6 x z6 1 y z79

Shuffle and Sort: aggregate values by keys

reduce

reduce

reduce

x 4 6 y 1 7 z 1 9 8

r1 r3 s3

z 1 9 9

Output

InputDivide and Conquer

k3 v3k2 v2 k4 v4 k5 v5

x 10 y 8 z 19

How MapReduce Works?

Map(String docid, String text): for each word w in text: Emit(w, 1);

Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

Source:Hadoop:The Definitive Guide

How is it different from other Systems?

Parallel - Message Passing Interfaces(MPI) Compute-intensive jobs, Issue larger data volumes Network bandwidth is the bottleneck and compute nodes

become idle. Hard to implement

Challenge of Coordinating the processes in a large-scale distributed computation Handling partial failure

Managing check pointing and recovery

Comparing MapReduce to RDBMs

Traditional RDBMs

MapReduce

Data size Gigabytes PetabytesAccess Interactive and

batchBatch

Updates Read and write many times

Write once, read many times

Structure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear Linear

MapReduce MapReduce complementary to RDBMs, not

competing MapReduce good fit for analyzing the whole

dataset in batch An RDBMS is good for point queries or updates

indexed to deliver low-latency retrieval relatively small amount of data.

MapReduce suits applications where the data is written once and read many times,

An RDBMS is good for datasets that are continually updated.

Hadoop on Windows Overview

Apache Hadoop CoreCommon framework Open Source Community Shared

by all Distribution

Hortonworks Data platform

Windows Platform 100% Open Source Contributions to Community

HDInsight

HDInsight Server HDInsight on CloudFamiliar Tools &Functionality

Hadoop on Windows Standard Hadoop Modules

HDFS MapReduce Pig Hive Monitoring Pages

Easy installation and Configuration Integration with Microsoft system

Active Directory System Center etc

Why Hadoop on Windows important?

Windows Server Large Market share Large Developer and User community Existing Enterprise tools Familiarity Simplicity of Use and Management Deployment options on both Windows

Server and Windows Azure.

HADOOP[Server and Cloud]

HDFS DATA

RDBMS

[unstructured, semi-structured, structured]

Java Streaming HiveQL PigLatin Other langs..NET

NOSQL SQL

External Data•Web•Mobile Devices• Social Media

Legacy Data

HDFS

User -Self Service Tools: Data Viewers, BI, Visualization

Run Jobs Submit a JAR file(Java MapReduce) HiveQL PigLatin .NET wrapper through Streaming

.Net MapReduce LINQ to Hive

JavaScript Console Excel Hive Add-In

.Net MapReduce ExampleNuGet Packages

Reference “Microsoft.Hadoop.MapReduce.DLL”

> MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2

Create a class the implements “HadoopJob<YourMapper>Create a class called “FirstMapper” that implements “MapperBase”

install-package Microsoft.Hadoop.MapReduce

install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient

Run DLL using MRRunner Utility;

Run Invoke Exe using MRRunner Utility;

var hadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);

.Net MapReduce Examplepublic class FirstJob : HadoopJob<SqrtMapper> { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfiguration config = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { int inputValue = int.Parse(inputLine);

// Perform the work. double sqrt = Math.Sqrt((double)inputValue);

// Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); } }

Hadoop Ecosystem Hadoop

Common, MapReduce, HDFS HBase

Column oriented distributed database Hive

Distributed data warehouse-SQL like query platform Pig

Data transformation language Sqoop

Tool for bulk Import/export between HDFS, HBase, Hive and relational databases

Mahout Data Mining Algorithms

ZooKeeper Distributed Coordination service

Oozie Job Running and scheduling workflow service

What’s HBase? Column Oriented Distiributed DB Inspired by Google BigTable Uses HDFS Interactive Processing

Can use either without MapRed PUT, GET, SCAN Commands

What’s Hive? Translate HiveQL ,similar to SQL, to

MapReduce A Distributed Data warehouse HDFS table file format Integrate with BI products on tabular

data, Hive ODBC, JDBC drivers

Hive

o HiveQL – Familiar, high level language

o Batch jobs – Ad Hoc Querieso Self service BI tools via

ODBC, JDBCo Schema but not strict as

traditional RDBMso Supports UDFso Easy access to Hadoop data

• No Updates or deletes, Insert only

• Limited Indexes, built-in optimizer, no caching

• Not OLTP• Not fast as MapReduce

Good for

Not so good for

Conclusion Hadoop is great for its Purposes and here to stay

BUT Not a common cure for every problem Developing standards and best practices very important

Users may abuse the resources and scalability Integration with Windows Platform

Existing systems, tools, Expertise Parallelization

Easier to scale as need Economical

Commodity Hardware Relatively short training, application development time with

Windows

Resources&References Hadoop: The Definitive Guide by Tom White

http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010.

Apache Hadoop http://hadoop.apache.org/

Microsoft Big data page http://

www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx

Hortonworks Data Platform http://hortonworks.com/products/hortonworksdataplatform/

Hadoop SDK http://hadoopsdk.codeplex.com/



http://hadoop.apache.org/

http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx



http://hortonworks.com/products/hortonworksdataplatform/

http://hadoopsdk.codeplex.com/

Thank you!

Any Questions?

Big Data and Hadoop On Windows

Documents

Transcript of Big Data and Hadoop On Windows