Big Data and Hadoop On Windows
description
Transcript of Big Data and Hadoop On Windows
Big Data and Hadoop On Windows
.Net SIG ClevelandImage credit: morguefile.com/creative/imelenchon
About Me Serkan Ayvaz,
Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ.
LinkedIn: [email protected] email:[email protected] Twitter:@sayvaz
Agenda Introduction to Big Data Hadoop Framework Hadoop On Windows Ecosystem Conclusions
What is Big Data?(“Hype?”) Big data is a collection
of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.-Wikipedia
What is new? Enterprise data grows rapidly Emerging Market for Vendors New Data Sources Competitive industries - need for more
Insights Asking different questions
Generating models instead transforming data into models
What is the problem? Size of Data; Rapid growth, TBs to PBs are norm for many
organizations As of 2012, size of data sets that are feasible to process in a
reasonable amount of time were on the order of exabytes of data.
Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc Structured Unstructured Semi-structured
Rate of Data Growth As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -
Wikipedia Particularly large datasets; meteorology, genomics, complex
physics simulations, and biological and environmental research, Internet search, finance and business informatics
Critique Even as companies invest eight- and nine-figure
sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review.
Consumer privacy concerns by increasing storage and integration of personal information
Things to consider Return of Investment may differ Asking wrong questions, won’t get right
answers Experts to fit in the organization Requires leadership decision Might be fine with traditional systems(for
now)
What is Hadoop? Scalability
Scales Horizontally, Vertical scaling has limits Scales seamlesly
Moves processing to the data, opposed to traditional methods Network bandwidth is limited resource
Processes data sequentially in chunks, avoid random access Seeks are expensive, disk throughput is reasonable
Fault tolerance Data Replication
Economical Commodity-Servers(“not Low-end”) vs Specialized Servers
Ecosystem Integration with other tools
Open Source Innovative, Extensible
Hadoop Core HDFS Storage
MapReduce
Processing
What can I do with Hadoop? Distributed Programming(MapReduce) Storage, Archive Legacy data Transform Data Analysis, Ad Hoc Reporting Look for Patterns Monitoring/ Processing logs Abnormality detection Machine Learning and advanced algorithms Many more
HDFS
Blocks• Large enough to minimize the cost of seeks-64
MB default • Unit of abstraction makes storage management
simpler than file • Fits well with replication strategy and availability
NameNode• Maintains the filesystem tree and
metadata for all the files and directories • Stores the namespace image and edit log
Datanode• Store and retrieve blocks• Report the blocks back to NameNode
periodically
HDFS
Designed for and Shines with large files
Fault tolerance - Data Replication within and across Racs Hadoop breaks data into
smaller blocks Data locality
Most efficient with write-once, read-many-times pattern
Low-latency data access optimized for high throughput data,
may be at the expense of latency. Consider Hbase for low latency
Lots of small files namenode holds filesystem metadata
in memory the limit to the number of files in a
filesystem Multiple writers, arbitrary file
modifications Files in HDFS may be written to by a
single writer.
Good Not so good
Data Flow
Source:Hadoop:The Definitive Guide
Rea
dW
rit
e
MapReduce Programming Splits input files into blocks Operates on key-value pairs Mappers filter & transform input data Reducers aggregate mappers output Handles processing efficiently in parallel
Move code to data – data locality Same code run on all machines
Can be difficult to implement some algorithms Can be implemented in almost any language
Streaming MapReduce for python, ruby, perl, php etc pig latin as data flow language hive for sql users
MapReduce Programmers write two functions:
map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*
All values with the same key are reduced together For efficiency, programmers typically also write:
partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic
The framework takes care of rest of the execution
Simple example - Word Count
// Map Reduce function in JavaScript// -------------------------------------------------------------
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);}
}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {
sum += parseInt(values.next());}context.write(key, sum);
};
r2
combinecombine combine combine
yx 4 1 z 9 x z6 1 y z79
partition partition partition partition
mapmap map map
k1 v1
yx 4 1 z z3 6 x z6 1 y z79
Shuffle and Sort: aggregate values by keys
reduce
reduce
reduce
x 4 6 y 1 7 z 1 9 8
r1 r3 s3
z 1 9 9
Output
InputDivide and Conquer
k3 v3k2 v2 k4 v4 k5 v5
x 10 y 8 z 19
How MapReduce Works?
Map(String docid, String text): for each word w in text: Emit(w, 1);
Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);
Source:Hadoop:The Definitive Guide
How is it different from other Systems?
Parallel - Message Passing Interfaces(MPI) Compute-intensive jobs, Issue larger data volumes Network bandwidth is the bottleneck and compute nodes
become idle. Hard to implement
Challenge of Coordinating the processes in a large-scale distributed computation Handling partial failure
Managing check pointing and recovery
Comparing MapReduce to RDBMs
Traditional RDBMs
MapReduce
Data size Gigabytes PetabytesAccess Interactive and
batchBatch
Updates Read and write many times
Write once, read many times
Structure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear Linear
MapReduce MapReduce complementary to RDBMs, not
competing MapReduce good fit for analyzing the whole
dataset in batch An RDBMS is good for point queries or updates
indexed to deliver low-latency retrieval relatively small amount of data.
MapReduce suits applications where the data is written once and read many times,
An RDBMS is good for datasets that are continually updated.
Hadoop on Windows Overview
Apache Hadoop CoreCommon framework Open Source Community Shared
by all Distribution
Hortonworks Data platform
Windows Platform 100% Open Source Contributions to Community
HDInsight
HDInsight Server HDInsight on CloudFamiliar Tools &Functionality
Hadoop on Windows Standard Hadoop Modules
HDFS MapReduce Pig Hive Monitoring Pages
Easy installation and Configuration Integration with Microsoft system
Active Directory System Center etc
Why Hadoop on Windows important?
Windows Server Large Market share Large Developer and User community Existing Enterprise tools Familiarity Simplicity of Use and Management Deployment options on both Windows
Server and Windows Azure.
HADOOP[Server and Cloud]
HDFS DATA
RDBMS
[unstructured, semi-structured, structured]
Java Streaming HiveQL PigLatin Other langs..NET
NOSQL SQL
External Data•Web•Mobile Devices• Social Media
Legacy Data
HDFS
User -Self Service Tools: Data Viewers, BI, Visualization
Run Jobs Submit a JAR file(Java MapReduce) HiveQL PigLatin .NET wrapper through Streaming
.Net MapReduce LINQ to Hive
JavaScript Console Excel Hive Add-In
.Net MapReduce ExampleNuGet Packages
Reference “Microsoft.Hadoop.MapReduce.DLL”
> MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2
Create a class the implements “HadoopJob<YourMapper>Create a class called “FirstMapper” that implements “MapperBase”
install-package Microsoft.Hadoop.MapReduce
install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient
Run DLL using MRRunner Utility;
Run Invoke Exe using MRRunner Utility;
var hadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);
.Net MapReduce Examplepublic class FirstJob : HadoopJob<SqrtMapper> { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfiguration config = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { int inputValue = int.Parse(inputLine);
// Perform the work. double sqrt = Math.Sqrt((double)inputValue);
// Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); } }
Hadoop Ecosystem Hadoop
Common, MapReduce, HDFS HBase
Column oriented distributed database Hive
Distributed data warehouse-SQL like query platform Pig
Data transformation language Sqoop
Tool for bulk Import/export between HDFS, HBase, Hive and relational databases
Mahout Data Mining Algorithms
ZooKeeper Distributed Coordination service
Oozie Job Running and scheduling workflow service
What’s HBase? Column Oriented Distiributed DB Inspired by Google BigTable Uses HDFS Interactive Processing
Can use either without MapRed PUT, GET, SCAN Commands
What’s Hive? Translate HiveQL ,similar to SQL, to
MapReduce A Distributed Data warehouse HDFS table file format Integrate with BI products on tabular
data, Hive ODBC, JDBC drivers
Hive
o HiveQL – Familiar, high level language
o Batch jobs – Ad Hoc Querieso Self service BI tools via
ODBC, JDBCo Schema but not strict as
traditional RDBMso Supports UDFso Easy access to Hadoop data
• No Updates or deletes, Insert only
• Limited Indexes, built-in optimizer, no caching
• Not OLTP• Not fast as MapReduce
Good for
Not so good for
Conclusion Hadoop is great for its Purposes and here to stay
BUT Not a common cure for every problem Developing standards and best practices very important
Users may abuse the resources and scalability Integration with Windows Platform
Existing systems, tools, Expertise Parallelization
Easier to scale as need Economical
Commodity Hardware Relatively short training, application development time with
Windows
Resources&References Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010.
Apache Hadoop http://hadoop.apache.org/
Microsoft Big data page http://
www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx
Hortonworks Data Platform http://hortonworks.com/products/hortonworksdataplatform/
Hadoop SDK http://hadoopsdk.codeplex.com/
Thank you!
Any Questions?