Brief Introduction about Hadoop and Core Services.

Msquare Systems Inc.,

INFORMATION TECHNOLOGY & CONSULTING FIRM

Visit: http:/www.msquaresystems.com/

http://www.msquaresystems.com/

What is Hadoop?

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.

Core services on Hadoop

MapReduce:

MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant.

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware.

The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

http://www.webopedia.com/TERM/C/commodity_hardware.html

HDFS:

Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of clusters.

This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.

The primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions.

The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system


http://www.apache.org/

http://www.apache.org/



http://www.webopedia.com/TERM/P/partition.html

http://www.webopedia.com/TERM/H/hadoop.html

http://www.webopedia.com/TERM/F/file_management_system.html

Hadoop Yarn:

Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.

Its a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework and is now commonly considered to consist of a number of related projects as well


Apache Tez:

Tez generalizes the MapReduce paradigm tois a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc.

The data-processing pipeline engine where-in one can plug-in input, processing and output implementations to perform arbitrary data-processing.

Every 'task' in tez has the following,Input to consume key/value pairs from,Processor to process them,Output to collect the processed key/value pairs a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.


Apache Pig:

Its a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce.

Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets

Hadoop Data Services

http://www.webopedia.com/TERM/H/hadoop.html

http://www.webopedia.com/TERM/H/hadoop_mapreduce.html

http://www.webopedia.com/TERM/S/SQL.html

Apache Hbase:

(HBase) is the Hadoop database.

It is a distributed, scalable, big data store.

HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read and write access to your big data.


Apache Hive:

Data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.


http://hadoop.apache.org/core/



Apache flume:

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a simple and flexible architecture based on streaming data flows.

It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

It uses a simple extensible data model that allows for online analytic application. lume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend.



Apache Mahout:

Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.

Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

Apache Accumulo :

Is a sorted, distributed key/value store and is at the core of Sqrrl Enterprise.

It handles large amounts of structured, semi-structured, and unstructured data as a robust, scalable, and real-time data storage and retrieval system.

Fine-grained security controls allow organizations to control data at the cell-level and promote a data-centric security model without degrading performance.

Accumulo can support a wide variety of real-time analytics, including statistics and graph analytics, via Accumulo’s server-side programming framework called iterators.


Apache Storm:

Storm is a distributed realtime computation system.

Storm provides a set of general primitives for doing realtime computation.

Storm is simple, can be used with any programming language, and is a lot of fun to use!


Apache Sqoop:

Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.

It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.


Apache Catalog:

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid

HCatalog is a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid.

It includes providing a shared schema and data type mechanism for Hadoop tools.

HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.


Apache Zookeeper :

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as nodes.

Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.

A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs.

Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.

Hadoop Operational Services

Apache Falcon:

Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop.

It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases.

Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.

Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with.


Apache Ambari :

Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop clusters.

Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.

Ambari includes an intuitive Web interface that allows you to easily provision, configure and test all the Hadoop services and core components.

Ambari provides tools to simplify cluster management. The Web interface allows you to start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.



Apache knox :

The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache™ Hadoop® services in a cluster.

The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.


Apache Oozie :

Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.

Oozie combines multiple jobs sequentially into one logical unit of work.

It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.

Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple component tasks.

Apache Oozie helps administrators derive more value from their Hadoop investment.

What Hadoop can, and can't do

What Hadoop can't do

You can't use Hadoop for Structured data Transactional data

What Hadoop can do

You can use Hadoop for Big Data

Support & Partner

Getting Hadoop Started or Need Support –

Muthu Natarajan

[email protected]

www.msquaresystems.com

Phone: 212-941-6000/703-222-5500

Brief Introduction about Hadoop and Core Services.

Technology

Transcript of Brief Introduction about Hadoop and Core Services.