Big Data and Introduction to Hadoop « the Lazy Blogger
-
Upload
simranjyotsuri7646 -
Category
Documents
-
view
215 -
download
1
description
Transcript of Big Data and Introduction to Hadoop « the Lazy Blogger
The Lazy BloggerMy ramblings on things in life (that I come across)
Blog About External Blogs
Big Data and Introduction to HadoopPosted November 4, 2011
Filed under: Big Data | Tags: Big Data, Hadoop, HBase, Hive, Map Reduce |
5 Votes
Last weekend (October 29, 2011) I attended a training on Hadoop, arranged by
my employer. It started on Friday afternoon ended Sunday evening, in 6
batches of 4 hours each. In the end, all 20 attendees had their brains spilling
out of their ears, but each one of us had a blast! It was a fabulous series!
Following (semi-technical) account is my 101 level take-away from some of the
sessions.
What is Hadoop?
Cloudera the leading vendor for Hadoop distributions defines it as follows at
their website
Technically, Hadoop consists of two key services: reliable data storage using
the Hadoop Distributed File System (HDFS) and high-performance parallel
data processing using a technique called MapReduce.
There you have it, from the horse’s mouth!
Little bit of history, Hadoop was sponsored by Yahoo! (yes, you read that right),
with Dough Cutting being the principal architect of the project. Once Hadoop
was mature enough Yahoo made Hadoop an Apache project. Dough left Yahoo
and formed Cloudera which is now considered the ‘Red Hat’ of Hadoop
distributions. If you really haven’t read Wikipedia about how Hadoop got it’s
name, it got it’s name from Doug’s son’s toy elephant!
More trivia, Hadoop project is based on Google’s papers on their
implementation of GFS and Big Table that google internally uses.
If it is just a file system + a technique how is it related to the cloud hoopla?
Well when we say Hadoop in context of a cloud we mean things on top of HDFS
and MapReduce. Basically Hadoop is the entire ecosystem built on top of the
‘classic definition’. It consists of Hbase as database, Hive as a Data Warehouse,
Pig as the query language all built on top of Hadoop and the Map-Reduce
framework.
HDFS is designed ground up to scale seamlessly as you throw hardware at it.
That’s it’s strength! Anyone designing server farms will agree, scaling
horizontally is non-trivial in most cases. For HDFS, problem of scale is simply
solved by throwing more hardware at the Farm. A lot of it is because actions
on HDFS are asynchronous. But the concept of throwing hardware at a farm
and getting scaling automatically is what endears Hadoop to Cloud computing.
Okay how is this different from SQL Server 2008 R2 running on top of
Windows 2008 R2′s NTFS?
Subscribe
RSS - Posts
Search My Blog
Post Calendar
November 2011
M T W T F S S
« Oct Jan »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Tags.net 3gs ASP.NETbirth story Bloggingbootloader C# cd CLR Coltscommericals Computer Hardware
Computer Upgrade crash dotnet dual
boot DVD Burn EF Entity
Framework firefox firefox
extensions Football GRUB iOS
iPhone jquery Mac Book
Pro MBP microsoft Murphy's
Law MVC MVC4 Nuget
Politics postaweek2011Production Support ProjectManagement Project
Management FailureSecurity Social Responsibilities Taxes
Travel Win8 windows 8WPF
Tweets
Feeling good that I debugged an
open source project and was
able 'fix' something in it... but
it's 6 am now...
yaawwnnn...|6 hours ago
Love it when these instapaper
thingies quote me for something
I am not related to
remotely!!!|6 hours ago
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
1 of 6 8/11/2012 12:54 AM
Ah ha! Now that’s a loaded question. Let’s try to go one by one (list below is
pretty random with respect to importance or concepts, I am just doing a brain
dump here)
1. Data is not stored in the traditional table column format. At best some of
the database layers mimic this, but deep in the bowels of HDFS, there are no
tables, no primary keys, no indexes. Everything is a flat file with
predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode
of storage. Every things maps down to <Key, Value> pairs.
2. HDFS supports only forward only parsing. So you are either reading ahead or
appending to the end. There is no concept of ‘Update’ or ‘Insert’.
3. Databases built on HDFS don’t guarantee ACID properties. Specially
‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data
will be saved eventually, but because of the highly asynchronous nature of the
file system you are not guaranteed at what point it will finish. So HDFS based
systems are NOT ideal for OLTP systems. RDBMS still rock there.
4. Taking code to the data. In traditional systems you fire a query to get data
and then write code on it to manipulate it. In MapReduce, you write code and
send it to Hadoop’s data store and get back the manipulated data. Essentially
you are sending code to the data.
5. Traditional databases like SQL Server scale better vertically, so more cores,
more memory, faster cores is the way to scale. However Hadoop by design
scales horizontally. Keep throwing hardware at it and it will scale.
I am beginning to get it, why is it said Hadoop deals with unstructured
data? How do we store data actually?
Unstructured is a slight misnomer in the very basic sense. By Unstructured,
Hadoop implies it doesn’t know about column names, column data types,
column sizes or even number of columns. Also there is no implicit concept of
table. Data is stored in flat files! Flat files with some kind of delimiters that
needs to be agreed upon by all users of the data store. So it could me comma
delimited, pipe delimited, tab delimited. Line feed, as a thumb-rule, is always
treated as the end of record. So there is a method to the madness
(‘unstructured-ness’) but there are no hard-binding as employed by traditional
databases. When you are dealing with data from Hadoop you are on your own
with respect to data cleansing.
Data input in hadoop is as simple as loading your data file into HDFS, and by
loading it’s very very close to copying files in the usual sense on any OS.
Okay, so there is no SQL, no Tables, no Columns, once I load my data how do
I get it back?
In Short: Write code to do Map-Reduce.
Huh! Seriously? Map-Reduce… wha…?
Yes. You have to write code to get data from a Hadoop System. The
abstractions on top of Hadoop are a few and all are sub-optimal. So the best
way to get data is to write Java code that calls the MapReduce framework that
slices and dices the stored data for you on the fly.
The Map-Reduce framework works in two steps, (no points for guessing), step 1
is Map and step 2 is Reduce.
Anyone faced this before? #Win8
#VS2012 #isolatedstorage
stackoverflow.com/questions
/1190…|9 hours ago
Looks like only the #Bing team
was really enthusiastic about
building #Metro Apps. Rest of
the apps are all major meh. 3rd
party?|11 hours ago
Okay the News app sucks less...
#Win8 Infact it's the best so
far!|11 hours ago
Folks, need help, what's the
most impressive #Metro UI you
have come across on #Win8? All
the default ones
suck!|11 hours ago
#Win8 and Media center don't go
well... Got two BSODs when
media center was streaming to
Xbox... none
otherwise|11 hours ago
Stress testing Win8, 4 VS2012
instances and
counting...|16 hours ago
Follow @sumitkm
Articles By Category
All Articles (43)
.NET (10)
ASP.NET (7)
MVC (2)
Webforms (1)
Entity Framework (2)
Sharepoint (1)
SignalR (1)
Visual Studio (1)
Big Data (1)
Blogging (8)
Cocoa (2)
iOS (2)
Entertainment (2)
Movies (1)
Football (1)
Gadgets (6)
iPhone (3)
iPhone ownership
experience (1)
Mac Book Pro (2)
Google (1)
OS (7)
Chrome OS (1)
OSX (Snow Leopard) (2)
Windows 7 (1)
Windows 8 (3)
Personal (1)
Programming (4)
C# (2)
Objective C (2)
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
2 of 6 8/11/2012 12:54 AM
Mapping Data: If it is plain de-limited text data, you have the freedom to pick
your selection of keys from the record (remember records are typically
linefeed separated) and values and tell the framework what your Key is and
what values that key will hold. MR will deal with actual creation of the Map.
When the map is being created you can control on what keys to include or
what values to filter out. In the end you end up with a giant hashtable of
filtered key value pairs. Now what?
Reducing Data: Once the map phase is complete code moves on to the reduce
phase. The reduce phase works on mapped data and can potentially do all the
aggregation and summation activities.
Finally you get a blob of the mapped and reduced data.
But… but… Do I really have to write Java?
Well, if you are that scared of Java, then you have Pig. No, I am not calling
names here. Pig is a querying engine that has more ‘business-friendly’ syntax
but spits out MapReduce code in the backend and does all the dirty work for
you. The syntax for Pig is called, of course, Pig Latin.
When you write queries in Pig Latin, Pig converts it into MapReduce and sends
it off to Hadoop, then retrieves the results and hands it back to you.
Analysis shows you get about half the performance of raw optimal hand written
MapReduce java code, but the same code takes more than 10 times the time to
write when compared to a Pig query.
If you are in the mood for a start-up idea, generating optimal MapReduce code
from Pig Latin is a topic to consider …
For those in the .NET world, Pig Latin is very similar syntactically to LINQ.
Okay, my head is now spinning, where does Hive and HBase fit in?
Describing Hive and HBase requires full articles of their own. A very brief intro
to them is as follows:
HBase
HBase is a key value store that sits on top of HDFS. It is a NOSql Database.
It has a very thin veneer over raw HDFS where in it mandates that data is
grouped in a Table that has rows of data.
Each row can have multiple ‘Column Families’ and each ‘Column Family’ can
contain multiple columns.
Each column name is the key and it has it’s corresponding column value.
So a column of data can be represented as
row[family][column] = value
Each row need not have the same number of columns. Think of each row as a
horizontal linked list, that links to a column family and then each column
family links to multiple columns as <Key, Value> pairs.
row1 -> family1 -> col A = val A
-> family2 -> col B = val B
Project Management (4)
Social Responsibilities (2)
Parenting (1)
SQL Server (1)
Uncategorized (7)
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
3 of 6 8/11/2012 12:54 AM
Share this:
0
Like this: Be the first to like this.
« Saving NSArray in iOS Fun with @SignalR »
and so on.
Hive
Hive is a little closer to traditional RDBMS systems. In fact it is a Data
Warehousing system that sits on top of HDFS but maintains a meta layer that
helps data summation, ad-hoc queries and analysis of large data stores in
HFDS.
Hive supports a high level language called Hive Query Language, that looks like
SQL but restricted in a few ways like no, Updates or Deletes are allowed.
However Hive has this concept of partitioning that can be used to update
information, which is essentially re-writing a chunk of data whose granularity
depends on the schema design.
Hive can actually sit on top of HBase and perform join operations between
HBase tables.
I heard Hadoop only requires ‘Commodity Hardware’ to run. Can I bunch
together the 486 machines gathering dust in my garage and bring up a
Hadoop cluster?
In short: NO!
When Google originally set out to build it’s search index ‘powerful’ computers
implied room sized Cray Super Computers that costed a pretty penny and
available only to the CIA!
So commodity hardware implies ‘non-supercomputers’ that can be purchased
by everybody. Today you can string together 10-12 high end blade servers each
with about 24Gb of RAM and 12-24 TB disk space and as many cores each as you
can get, to build an entry level production ready Hadoop cluster.
That’s a different point you can run code samples on a VM that will run ok on a
laptop with the latest core processors and approx 8 Gigs of RAM. But that’s
only good for code samples! Even for PoCs spinning up a EC2 cluster is the best
way to go.
Okay, with that I conclude this article. In upcoming articles we’ll see
installation as well as some real world use cases of big data on Hadoop!
4 comments so far
Ellie K on January 4, 2012
This is just the most awesome and accessible explanation I have ever read!
You did a great job here. I liked your @SignalR article, but this is the kind of
post that an actual user in the real world can enjoy and understand. You
mention flat files! Yay! I know about those from SAS (that’s not SaaaaaaS
software as a service but SAS 4th gen language for statistical analysis and
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
4 of 6 8/11/2012 12:54 AM
data management).
Also, you explained what Pig does! And Pig Latin is the query syntax! So cute!
I love piggies (my avatar is a very special type of piggy). You don’t disdain
RDBMS, columns, rows or blade servers. You explained Hive too.
But here’s a thought, a concern: I am a Data Governess (data governance
person, small joke), an upholder of consistency, benchmarks, scenario
analysis and standards. Hadoop works with a world of unstructured data, no
hard bindings, and you’re on your own for data cleansing, correct?. Doesn’t
this lead to big problems with referential integrity?
At my last job, I designed and maintained an Oracle RDBMS that supported all
medical, accounting and regulatory aspects of a drug formulary for a
pediatric managed care program. Lots of details, reports, data integrity
issues etc. Is that something that could be done with Hadoop instead of
Oracle, SQL (and Toad)? And SAS?
Reply
Sumit on January 5, 2012
Hello Ellie,
Glad you enjoyed the article and really appreciate the feedback.
Let me try and answer your questions one by one.
1. Data Consistency: Yes. All Big Data platforms at the moment have
come to accept that it is not possible to achieve the ACID properties of
traditional RDBMS systems. Instead their design and architecture goes by
what’s referred to as the CAP Theorem and try to achieve BASE
properties. Moral of the story if you are looking for a strictly
transactional and mission critical system then Big Data is not a good
starting point.
The above, kind of makes all further arguments in favor of Big Data
sound lame. That’s when Use cases come into picture. Big Data use cases
invariably involve very high volume and very high velocity e.g. log files
from networking gear, streaming data (like twitter stream), click stream
data (every click on every link from a site or group of sites) or historical
data being moved from traditional data warehouses. Typically we are
looking at beyond terrabyte data volumes. In fact Big Data is suggested
when you know your data volume is eventually going to go petabyte
scale. Once you think in those terms you will realize for that volume of
data, couple of lines out of order or corrupted don’t even register
statistically.
2. Data Cleansing/Referential Integrity: Most of the time data in Big Data
stores are denormalized data and analysis is performed pretty much by
the brute force of map-reduce jobs. As of now these jobs are low level
and written in code. So data cleansing is a matter of handling errors
correctly.
3. Statistical Analysis on Hadoop can be done through specialized version
of SAS tools. I believe there is a version of R that can deal with Big Data.
Eventually the tool will generate a bunch of Map Reduce jobs.
4. The System you are describing is a typical OTLP system and is not quite
suitable for Hadoop because of a combination of the reasons I quoted
above but mostly due to lack of ACID support and the query response
times from a Big Data system (which is computed in minutes and hours
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
5 of 6 8/11/2012 12:54 AM
instead of milliseconds and seconds). But what might be a close match
would be a system that collects Genome data and pushes all it’s variables
into a Big Data system. Then uses the power of the cluster to compute
Genome Maps (if that’s even a term …)
Hope this helps.
Regards.
Reply
Ellie K on January 8, 2012
Thank you so much for such a detailed and complete response. And
yes, that does help quite a bit!
Kiran My on January 18, 2012
Thanks for detailed article on Hadoop. hadoop can also be used olap and oltp
processing.
Please click Why Hadoop is introduced to know more on Basics of Hadoop
Reply
Leave a Reply
Blog at WordPress.com. | Theme: Light by WPzone.net.
Big Data and Introduction to Hadoop « The Lazy Blogger https://sumitmaitra.wordpress.com/2011/11/04/big-data-and-introductio...
6 of 6 8/11/2012 12:54 AM