Big data overview

Presented By Ladislav Urban

www.syoncloud.com

Ladislav Urban CEO of Syoncloud.

Syoncloud is a consulting company specialized in Big Data analytics and integration of existing

systems.

WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474

CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED

Documents

Existing relational databases (CRM, ERP, Accounting, Billing)

E-mails and attachments

Imaging data (graphs, technical plans)

Sensor or device data

Internet search indexing

Log files

Social media

CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED

Telephone conversations Videos Pictures Clickstreams (clicks from users on web pages)

SCALE OF THE DATA

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?

If relational databases do not scale to your traffic needs If normalized schema of your relational database became too

complex. If your business applications generate lots of supporting and

temporary data If database schema is already denormalized in order to

improve response times If joins in relational databases slow the system down to a crawl

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? We try to map complex hierarchical documents to

Database tables Documents from different sources require flexible

schema When more data beats clever algorithms Flexibility is required for analytics Queries for values at specific time in history Need to utilize outputs from many existing systems

WHEN DO WE NEED NOSQL / BIG DATA SOLUTION? To analyze unstructured data such as documents, log

files or semi-structured data such as CSV files and forms

WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES? SQL language. It is well known, standardized and based on

strong mathematical theories.

Database schemas that do not to be modified during production.

Scalability is not required

Mature security features: Role-based security, encrypted communications, row and field access control

Full support of ACID transactions (atomicity, consistency, isolation, durability)

Support for backup and rollback for data in case of data loss or corruption.

Relational database do have development, tuning and monitoring tools with good GUI

WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?

Batch vs Real-time Processing

Batch processing is used when real-time processing is not required, not possible or too expensive.

Conversion of unstructured data such as text files and log files into more structured records

Transformation during ETL Ad-hoc analysis of data Data analytics application and reporting

BATCH PROCESSING INFRASTRUCTURE

Batch processing systems utilize Map/Reduce and HDFS implementation in Apache Hadoop.

It is possible to develop batch processing application in Java using only Hadoop but we should mention other important systems and how they fit into Hadoop infrastructure.

BATCH PROCESSING INFRASTRUCTURE

APACHE AVRO In order to process data we need to have information

about data-types and data-schemas. This information is used for serialization and

deserialization for RPC communications as well as reading and writing to files.

RPC and serialization system that supports reach data structures

It uses JSON to define data types and protocols It serializes data in a compact binary format Avro supports Schema evolution Avro will handle missing/extra/modified fields.

APACHE AVRO

SCRIPT LANGUAGE FOR MAP/REDUCE

We need a quick and simple way to create Map/Reduce transformations, analysis and applications.

We need a script language that can be used in scripts as well as interactively on command line.

APACHE PIG

High-level procedural language for querying large semi-structured data sets using Hadoop and the Map/Reduce Platform

Pig simplifies the use of Hadoop by allowing SQL-like queries to run on distributed dataset.

APACHE PIG

An example of filtering log file for only Warning messages that will run in parallel on large cluster.

Given script is automatically transformed into Map/Reduce program and distributed across Hadoop cluster.

APACHE PIG

messages = LOAD '/var/log/messages';warns = FILTER messages BY $0 MATCHES '.*WARN+.*';DUMP warns

FILTER - Select a set of tuples from a relation based on a condition. FOREACH - Iterate the tuples of a relation, generating a data

transformation. GROUP - Group the data in one or more relations. JOIN - Join two or more relations (inner or outer join). LOAD - Load data from the file system. ORDER - Sort a relation based on one or more fields. SPLIT - Partition a relation into two or more relations. STORE - Store data in the file system.

APACHE PIGRelational operators that can be used in Pig

What if we want to use SQL to create map/reduce jobs?

Apache Hive is a data warehousing infrastructure based on the Hadoop

It provides query language called Hive QL, which is based on SQL.

APACHE HIVE Hive functions: data summarization, query and

analysis. It uses system catalog called Hive-Metastore. Hive is not designed for OLTP or Real-time queries. It is best used for batch jobs over large sets of append-

only data.

APACHE HIVE

HiveQL language supports ability to Filter rows from a table using a where clause. Select certain columns from the table using a select clause. Do equi-joins between two tables. Evaluate aggregations on multiple "group by" columns for the

data stored in a table. Store the results of a query into another table. Download the contents of a table to a local (NFS) directory.

HiveQL language supports ability to

Store the results of a query in a HDFS directory. Manage tables and partitions (create, drop and alter). Plug in custom scripts in the language of choice for custom

map/reduce jobs.

APACHE OOZIE Map/Reduce jobs, Pig Scripts and Hive queries

should be simple and single purposed. How can we create complex ETL or data analysis in

Hadoop? We chain scripts so output of one script is an input

for another. Complex workflows that represents real-world

scenarios need workflow engine such as Apache Oozie.

Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce, Pig jobs and other.

Oozie workflow is a collection of actions arranged in DAG (Directed Acyclic Graph).

This means that second action can not run until the first one is completed.

Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL).

APACHE OOZIE

Workflow actions start jobs in Hadoop cluster. Upon action completion, the Hadoop callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.

Oozie workflows contain control flow nodes (start, end, fail, decision, fork and join) and action nodes (Actual Jobs).

Workflows can be parameterized (using variables like ${inputDir} within the workflow definition)

APACHE OOZIE

Example of OOZIE workflow definition

Example of OOZIE workflow definitionworkflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property>

Example of OOZIE workflow definition < property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/></workflow-app>

APACHE Sqoop

Apache Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases or data warehouses.

It can be used to populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule

and automate import and export tasks. Sqoop uses a connector based architecture which

supports plugins that provide connectivity to external systems.

APACHE Sqoop

Sqoop includes connectors for databases such as MySQL, PostgreSQL, Oracle, SQL Server, DB2 and generic JDBC connector.

Transferred dataset is sliced up into partitions and map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

Sqoop uses the database metadata to infer data types

APACHE Sqoop

Apache Sqoop – Import to HDFS

APACHE Sqoop Sqoop example to import data from MySQL database ORDERS

table to Hive table running on Hadoop.

sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** --hive-import

Sqoop takes care of populating Hive metastore with appropriate metadata for the table and also invokes necessary commands to load the table or partition.

Apache Sqoop – Export to Database

APACHE FLUME▪ Is a distributed system to reliably collect, aggregate and

move large amounts of log data from many different sources to a centralized data store.

APACHE FLUME

Flume Source consumes events delivered to it by an external source like a web server.

When a Flume Source receives an event, it stores it into one or more Channels.

The Channel is a passive store that keeps the event until it is consumed by a Flume Sink.

The Sink removes the event from the Channel and puts it into an external repository like HDFS

APACHE FLUME FEATURES

It allows to build multi-hop flows where events travel through multiple agents before reaching the final destination.

It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume uses a transactional approach to guarantee reliable delivery of events.

Events are staged in the channel, which manages recovery from failure.

Flume supports log stream types such as Avro, Syslog, Netcat .

DISTCP - DISTRIBUTED COPY DistCp (distributed copy) is a tool used for large

inter/intra-cluster copying. It uses Map/Reduce for its distribution, error handling

and recovery and reporting. It expands a list of files and directories into input to map

tasks, each of which will copy a partition of the files specified in the source list.

REAL-TIME PROCESSING – NOSQL DATABASES

▪ 5.1 Document stores

Apache CouchDB, MongoDB,

▪ 5.2 Graph Stores

▪ 5.3 Key-Value Stores

Apache Cassandra, Riak

▪ 5.4 Tabular Stores

Apache Hbase

CAP THEOREM

HBASE ARCHITECTURE

QUESTIONS & ANSWERS

www.syoncloud.com

info@syoncloud.com

Mobile : 077 9664 6474

LADISLAV URBAN

Big data overview

Documents

Transcript of Big data overview

An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

Big Data Platform - Carnegie Mellon University...Talk Overview - DCO Big Data Problem Space - DoD’s Big Data Platform - Scaling for Big Data - Multi-Tenancy - Lessons Learned Problem

An overview of Big Data Technology

Overview of Big (Geospatial) Data Concepts and Technologies Data Workshop - Final... · Overview of Big (Geospatial) Data Concepts and Technologies ... + Overview of Big Geospatial

Big data-predictive-analytics-overview

Deciphering Big Data Stacks: An Overview of Big Data Tools · PDF fileDeciphering Big Data Stacks: An Overview of Big Data Tools ... It then presents a network of deployment dependencies

Ensemble modeling overview, Big Data meetup

Big Data - An Overview

Big Data and Hadoop Overview

Big data Overview for SLDS in Education

Big data Overview

Big Data : a 360° Overview

Big data – a brief overview

Big data overview external

Big Data Overview 2013-2014

Talend Big Data Capabilities Overview

Deciphering Big Data Stacks: An Overview of Big Data Toolsbib.irb.hr/datoteka/740470.PDAC14_FINAL.pdf · Deciphering Big Data Stacks: An Overview of Big Data Tools Tomislav Lipic1,

NOAA Big Data Partnership Project Overview - ACT-IAC Big Data Partnership - ACT... · NOAA Big Data Partnership Project Overview ... Surface Observing System ... • Government is

Oracle Big Data - Overview - AIOUG Big Data... · Oracle Big Data - Overview Hariharaputhran Vaithinathan Director of Membership AIOUG Sai Janakiram Penumuru Vice President AIOUG.

Big Data Overview