Course Topics
Week 1 – Understanding Big Data– Introduction to HDFS– Playing around with Cluster– Data loading Techniques
Week 2– Map-Reduce Basics, types
and formats– Use-cases for Map-Reduce– Analytics using Pig– Understanding Pig Latin
Week 3 – Analytics using Hive– Understanding HIVE QL – NoSQL Databases– Understanding HBASE
Week 4– Zookeeper, Sqoop, Flume– Debug MapReduce programs
in Eclipse.– Real world Datasets and
Analysis– Planning a career in Big Data
What is Big Data?
Facebook Example
Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social networkFacebook has an average of 3.2 billion likes and comments are posted every day.
Twitter has over 500 million registered users.
The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.
79% of US Twitter users are more like to recommend brands they follow
67% of US Twitter users are more likely to buy from brands they follow
57% of all companies that use social media for business use Twitter
Twitter Example
Other Industrial Usecases
• Insurance • Healthcare• Genome Sequencing• Utilities
Hadoop Users
http://wiki.apache.org/hadoop/PoweredBy
Data volume is growing exponentially
• Estimated Global Data Volume:– 2011: 1.8 ZB– 2015: 7.9 ZB
• The world's information doubles every two years
• Over the next 10 years:– The number of servers worldwide will grow
by 10x– Amount of information managed by
enterprise data centers will grow by 50x– Number of “files” enterprise data center
handle will grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm, which was based on the 2011 IDC Digital Universe Study
Un-Structured Data is exploding
Read 1 TB Data
10 Machines 4 I/O Channels Each Channel – 100 MB/s
4 I/O Channels Each Channel – 100 MB/s
1 Machine
Why DFS?
10 Machines 4 I/O Channels Each Channel – 100 MB/s
4 I/O Channels Each Channel – 100 MB/s
1 Machine
Read 1 TB Data
45 Minutes
Why DFS?
4.5 Minutes45 Minutes
10 Machines 4 I/O Channels Each Channel – 100 MB/s
4 I/O Channels Each Channel – 100 MB/s
1 Machine
Read 1 TB Data
Why DFS?
What Is Distributed File System? (DFS)
Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of commodity computers using a simple programming model.
Companies using Hadoop:
- Yahoo
- Amazon
- AOL
- IBM
- And many more at
http://wiki.apache.org/hadoop/PoweredBy
What is Hadoop?
Hadoop Eco-System
HDFS – Hadoop Distributed File System (storage)
MapReduce (processing)
Hadoop Core Components:
Any Questions ? See you in Next class
Thankyou.Sainagaraju vaduka
Top Related