Post on 17-Aug-2015
A airline jet collect 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade
data per day to perform stock trading analytics to
determine trends for optimal trades.
3
Twitter has over 500 milion registered users.
79% of US Twitter users are more likely to buy from brands
they follow.
67% of US Twitter users are more likely to buy from brands
they follow.
57% of all companies that use social media for business use
Twitter.
“Big Data is the frontier of a firm's ability to
store, process, and access (SPA) all the data
it needs to operate effectively, make
decisions, reduce risks, and serve
customers.”
Gigabyte
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
One Byte Exabyte
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Zettabyte
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL! Yottabyte
Hobbyist Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Desktop
Hobbyist Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Desktop
Hobbyist
Internet
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Desktop
Hobbyist
Internet
Big Data
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Desktop
Hobbyist
The Future?
Internet
Big Data
Byte : one grain of rice
Kilobyte : cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 Semi trucks
Terabyte : 2 Container Ships
Petabyte : Blankets Manhattan
Exabyte : Blankets west coast states
Zettabyte : Fills the Pacific Ocean
Yottabyte : A EARTH SIZE RICE BALL!
Process data in parallel? -not simple
23
An idea: parallelism
A problem: Parallelism is Hard
Synchronization
Deadlock
Limited bandwidth
Timing issues and co-ordination
Split and Aggregation
Coputer are complicate
Driver failure
Data availability
Hey! We have Distributed computing!!!
Yes,we have distributed computing and it also come up with
some challenges
24
Resource sharing
Concurrency
Fault tolerance
Heterogeneity
Transparency
To address most of these challenges(but not all) Hadoop
come in.
Hadoop origin
25
• An Elephant can’t jump.But can carry heavy load!!!
• Apache Haddop is a framework that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.it is designed to scale
up from single servers to thousands of machines,each providing
computation and storage.
• Hadoop is an open-source implementation of Google
MapReduce,GFS(distributed file system).
• Hadoop was created by Doug Cutting the creator of Apache
Lucene,the widely used text search library.
Hadoop Architecture
26
Hadoop designed and built on two independent frame works.
Hadoop= HDFS + Map reduce
HDFS(Storage and File system):HDFS is a reliable distributed file system
that provides high-throughput access to data.
MapReduce(processing):MapReduce is a framework for performing high
performance distributed data processing using the divide and aggregate
programming paradigm.
Hadoop has a master/slave architecture for both storage and
processing.
Hadoop Master and Slave Architecture
27
The components of HDFS are
Name Node
Data Node
Secondary Name Node
Who uses Hadoop?
31
Amazon/A9
IBM
Joost
Last.fm New York Times
PowerSet
Yahoo!
Cassandra
32
• Apache Cassandra is an open source distributed database
management system designed to handle large amounts of data
across many commodity servers, providing high availability with no
single point of failure. Cassandra offers robust support for clusters
spanning multiple datacenters.
Main features
33
Cassandra places a high value on performance.
In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments.
Decentralized
Supports replication and multi data center replication
Scalability
Fault-tolerant
Query language
MapReduce support
References
37
1.Big data:the next frontier for innovation,competition
and productivity-McKinsy&company
2. Big Data Meets Big Data Analytics-SAS Company
3. Big data tutorial-Marko Grobelnik
4. Big Data Spectrum