Open source stak of big data techs open suse asia
-
Upload
muhammad-rifqi -
Category
Data & Analytics
-
view
207 -
download
2
Transcript of Open source stak of big data techs open suse asia
The Open Source Stack of Big Data Technology
Muhammad Rifqi Ma'arif [email protected] | [email protected]
openSUSE Asia Summit 2016
2
Presentation Online
• Big Data – Formal Introduction• The Technological Stack• Implementing Big Data Tech.• Beyond Hadoop
Big Data -Formal Introduction
4
The World is Changing
• Old World → Few companies are generating data and the rest of the world are consuming the data.
• Current and Future World → All of us generating data and all of us consuming the data.
5
6
The Four Elements
7
The Technological Stack
9
10
Hadoop Ecosystem – The Anchestor
11
Hadoop in a Nutshell
• Hadoop was created by Doug Coutting and Mike Carafella in 2005
• A data processing framework
• Large scaled data
• Distributed manner
• Horizontal scaling on commodity hardware
12
HDFS (Hadoop Distributed File Systems)
• Scalable distributed filesystem• Distribute data on local disks on
several nodes handled by low cost commodity hardware.
• HDFS design goals:‒ Data Replication – helps
handle hardware failures‒ Move computation close to
data• Singe name node as a master
(the boss) and multiple data nodes who listen to the master and manage their own logical storage
13
Moving computation to data...
• Old fashion: ‒ Separated data → integration → computation → information
• HDFS fashion: ‒ Separated data → computation → integration → information
14
MapReduce – The Programming Framework
• Originated at Google, and they said it's a simple programming model to process large scale data in parallel and distributed way
https://www.tutorialspoint.com/map_reduce/
15
MapReduce – The Hello Word
https://www.tutorialspoint.com/map_reduce/
16
MapReduce – The Hello Word Scripts
mapper
reducer
17
Wee need to make the elephant faster
18
Inside The ZooCoordination of config, data naming and synchronization of Hadoop projects
Monitoring and management of Hadoop clusters and nodes.
Tool for data ingestion from various data sources into Hadoop
A workflow scheduller tool to manage MapReduce Jobs
A tool for managing data transfer between Hadoop and Relational Database Management System (RDBMS)
Data Mining/Machine Learning library that works directly with Hadoop Data. You can also use R with its lib RHadoop
Scripting language for a analyzing large dataset. Compiled to MapReduce Jobs
Facilitates easy ad-hoc queries and summarization and to data which stored in HDFS with the SQL-like interface named HiveQL
A non-relational and distributed database system that run on top HDFS file system.
Implementing Big Data
20
The Lambda Architecture
http://jameskinley.tumblr.com
21
Lambda Architecture Workflow
• All data entering the system is dispatched to both the batch layer and the speed layer for processing.
• The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
• The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
• The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
• Any incoming query can be answered by merging results from batch views and real-time views.
22
Plotting Arsenal to Lambda Architecture
Batch Layer
Speed Layer
Serving Layer Query
23
Simpler Implementation - Datawarehouse
24
25
Datawarehouse using Hadoop Framework
Schema
26
Sqoop (SQL to Hadoop)
• Importing MySQL table values to HDFS can be done straightforwardly with Sqoop
27
Apache Pig
• Pig provides an engine for executing data flows in parallel on Hadoop and makes use of HDFS and MapReuce
• Pig philosophy → Pigs eat anything Input data can come in any format – popular formats.
• Pig includes a language called Pig Latin for expressing data flows
‒ Pig Latin includes operators for many of the traditional data operations (not to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD and STORE.
‒ Express data transformation tasks in just a few lines of code
‒ 10 lines of Pig Latin = ~200 lines of Java
‒ Simplifying the process of writing MapReduce Program
• The most important is You can create UDF in Pig!
28
How Pig Works in A Nutshell
29
Pig Latin Example – The Wordcount
30
Apache Hive
• Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook
• MapReduce is powerful but writing M/R program just like ask application developers to specify physical execution plan in the database on their code!
• Combine the simplicity of SQL and the power of MapReduce
‒ Efficient implementations of SQL statements on top of map reduce via Hive Query Language (HiveQL)
31
Hive Architecture
http://www.hadooptpoint.com/hadoop-hive-architecture/
32
Datawarehouse using Hadoop Framework
Schema
Beyond Hadoop
34
Lambda Architecture
Batch Layer
Speed Layer
Serving Layer Query/Viz
35
Another Data Processing Framework
SMACK Stack
Spark Mesos Akka Cassandra Kafka
http://www.natalinobusa.com
36
Pick up your own weapon
37
Refferences
• Thomas Bernardz, 2015, Big Data Wokshop – Quenssland University of Technology
• Guid Schmutz, 2014, Big Data and Fast Data, http://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action
• James Kinely, 2015, The Lambda architecture: principles for architecting realtime Big Data systems, http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for
• https://www.tutorialspoint.com• http://hadoopoints.com• http://hortonworks.com
Thank you.
Join the conversation,contribute & have a lot of fun!www.opensuse.org
General DisclaimerThis document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.
LicenseThis slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license.
Details can be found at https://creativecommons.org/licenses/by-sa/4.0/
Credits
TemplateRichard Brown [email protected]
Design & InspirationopenSUSE Design Teamhttp://opensuse.github.io/branding-guidelines/