Brief Introduction about Hadoop and Core Services.

22
Msquare Systems Inc., INFORMATION TECHNOLOGY & CONSULTING FIRM Visit: http:/www.msquaresystems.com/

description

I have given quick introduction about Hadoop, Big Data, Business Intelligence and other core services and program involved to use Hadoop as a successful tool for Big Data analysis. My true understanding in Big-Data: “Data” become “information” but now big data bring information to “Knowledge” and ‘knowledge” becomes “Wisdom” and “Wisdom” turn into “Business” or “Revenue”, All if you use promptly & timely manner

Transcript of Brief Introduction about Hadoop and Core Services.

Page 1: Brief Introduction about Hadoop and Core Services.

Msquare Systems Inc.,

INFORMATION TECHNOLOGY & CONSULTING FIRM

Visit: http:/www.msquaresystems.com/

Page 2: Brief Introduction about Hadoop and Core Services.

What is Hadoop?

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.

Page 3: Brief Introduction about Hadoop and Core Services.

Core services on Hadoop

MapReduce:

MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant.

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware.

The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

Page 4: Brief Introduction about Hadoop and Core Services.

HDFS:

Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of clusters.

This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.

The primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions.

The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system

Core services on Hadoop

Page 5: Brief Introduction about Hadoop and Core Services.

Hadoop Yarn:

Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.

Its a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework and is now commonly considered to consist of a number of related projects as well

Core services on Hadoop

Page 6: Brief Introduction about Hadoop and Core Services.

Apache Tez:

Tez generalizes the MapReduce paradigm tois a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc.

The data-processing pipeline engine where-in one can plug-in input, processing and output implementations to perform arbitrary data-processing.

Every 'task' in tez has the following,Input to consume key/value pairs from,Processor to process them,Output to collect the processed key/value pairs a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.

Core services on Hadoop

Page 7: Brief Introduction about Hadoop and Core Services.

Apache Pig:

Its a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce.

Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets

Hadoop Data Services

Page 8: Brief Introduction about Hadoop and Core Services.

Apache Hbase:

(HBase) is the Hadoop database.

It is a distributed, scalable, big data store.

HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read and write access to your big data.

Hadoop Data Services

Page 9: Brief Introduction about Hadoop and Core Services.

Apache Hive:

Data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.

Hadoop Data Services

Page 10: Brief Introduction about Hadoop and Core Services.

Apache flume:

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a simple and flexible architecture based on streaming data flows.

It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

It uses a simple extensible data model that allows for online analytic application. lume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend.

Hadoop Data Services

Page 11: Brief Introduction about Hadoop and Core Services.

Hadoop Data Services

Apache Mahout:

Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.

Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

Page 12: Brief Introduction about Hadoop and Core Services.

Apache Accumulo :

Is a sorted, distributed key/value store and is at the core of Sqrrl Enterprise.

It handles large amounts of structured, semi-structured, and unstructured data as a robust, scalable, and real-time data storage and retrieval system.

Fine-grained security controls allow organizations to control data at the cell-level and promote a data-centric security model without degrading performance.

Accumulo can support a wide variety of real-time analytics, including statistics and graph analytics, via Accumulo’s server-side programming framework called iterators.

Hadoop Data Services

Page 13: Brief Introduction about Hadoop and Core Services.

Apache Storm:

Storm is a distributed realtime computation system.

Storm provides a set of general primitives for doing realtime computation.

Storm is simple, can be used with any programming language, and is a lot of fun to use!

Hadoop Data Services

Page 14: Brief Introduction about Hadoop and Core Services.

Apache Sqoop:

Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.

It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Hadoop Data Services

Page 15: Brief Introduction about Hadoop and Core Services.

Apache Catalog:

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid

HCatalog is a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid.

It includes providing a shared schema and data type mechanism for Hadoop tools.

HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.

Hadoop Data Services

Page 16: Brief Introduction about Hadoop and Core Services.

Apache Zookeeper :

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as nodes.

Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.

A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs.

Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.

Hadoop Operational Services

Page 17: Brief Introduction about Hadoop and Core Services.

Apache Falcon:

Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop.

It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases.

Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.

Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with.

Hadoop Operational Services

Page 18: Brief Introduction about Hadoop and Core Services.

Apache Ambari :

Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop clusters.

Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.

Ambari includes an intuitive Web interface that allows you to easily provision, configure and test all the Hadoop services and core components.

Ambari provides tools to simplify cluster management. The Web interface allows you to start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.

Hadoop Operational Services

Page 19: Brief Introduction about Hadoop and Core Services.

Hadoop Operational Services

Apache knox :

The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache™ Hadoop® services in a cluster.

The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.

Page 20: Brief Introduction about Hadoop and Core Services.

Hadoop Operational Services

Apache Oozie :

Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.

Oozie combines multiple jobs sequentially into one logical unit of work.

It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.

Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple component tasks.

Apache Oozie helps administrators derive more value from their Hadoop investment.

Page 21: Brief Introduction about Hadoop and Core Services.

What Hadoop can, and can't do

What Hadoop can't do

You can't use Hadoop for Structured data Transactional data

What Hadoop can do

You can use Hadoop for Big Data

Page 22: Brief Introduction about Hadoop and Core Services.

Support & Partner

Getting Hadoop Started or Need Support –

Muthu Natarajan

[email protected]

www.msquaresystems.com

Phone: 212-941-6000/703-222-5500