Big Data and Hadoop Ecosystem
-
Upload
rajkumar-singh -
Category
Education
-
view
1.579 -
download
4
description
Transcript of Big Data and Hadoop Ecosystem
Big Data and Hadoop
Presenter
Rajkumar Singhhttp://rajkrrsingh.blogspot.com/http://in.linkedin.com/in/rajkrrsingh
http://rajkrrsingh.blogspot.com
Big Data and Hadoop Introduction
Volume
FacebookGoogle Plus
TwitterLinkedIn
Stock ExchangeHealthcare
Telecom
Variety Structured,SemiStructured,unstructured
Velocity
FacebookStock Exchange
HealthcareTelecom
Mobile DevicesGPS
Security Infrastructure
http://rajkrrsingh.blogspot.com
Challenges In Big data
• Storage -- PB
• Processing – In a timely manner
• Variety of data -- S/SS/US
• Cost
http://rajkrrsingh.blogspot.com
To overcome Big Data Challenges Hadoop evolves
• Cost Effective – Commodity HW
• Big Cluster – (1000 Nodes) --- Provides Storage n Processing
• Parallel Processing – Map reduce
• Big Storage – Memory per node * no of Nodes / RF
• Fail over mechanism – Automatic Failover
• Data Distribution
• Map Reduce Framework
• Moving Code to data
• Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of any memory and CPU configuration)
• Scalable
http://rajkrrsingh.blogspot.com
What is Hadoop
• Java Framework to Process erroneous amount of data
Hadoop Core• HDFS • Programming Construct (Map Reduce)
http://rajkrrsingh.blogspot.com
Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications.
http://rajkrrsingh.blogspot.com
HDFS : Use Cases
• Very large file.
• Reading/Streaming Data Access.Read data in large volume
Write once and Read frequent
• Expensive Hardware.
• Low latency Access.
• Lots of small files
• Parallel write/ Arbitrary Read
http://rajkrrsingh.blogspot.com
HDFS Building Blocks
1GB file = 1024 MB/128 MB = 8 Blocks
Default Block Size64MB128MB
For Small File Size
100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of HDFS of size 100 MB
http://rajkrrsingh.blogspot.com
HDFS Daemon Services
• Name Node• Secondary Name Node• Data Node GFS (Master/Slave Architecture)
http://rajkrrsingh.blogspot.com
HDFS Write
128 MBRF = 3 D1,D2,D4
D1 D2 D3 D4
File 1: D1,D2,D4File 2: D1,D2,D3
http://rajkrrsingh.blogspot.com