Bigdata design doc.pptx
-
Upload
amir-sanjar -
Category
Data & Analytics
-
view
300 -
download
0
Transcript of Bigdata design doc.pptx
Develop Big Data Solutionwith Juju
From complexity to simplicity
Agenda
● What is Canonical?● Challenges of building big data solutions.● What is Juju?● Apache Hadoop pluggable model
○ Pluggin big data services into Hadoop● Demo
�
Canonical the company behind Ubuntu.
Challenges of building big data solutions.
● Many hadoop distributions● Many apache projects to integrate into solutions.
Hadoop distributions
● Similar to Linux, Hadoop has many distributions● Top commercial offerings: Cloudera, MapR, Hortonworks, IBM BigInsights● Open source distribution: Apache Hadoop
● Issues● Each distribution has different packaging style● Each distribution has different installation blueprints
● e.g., users, install locations, etc.
● Different dependencies ● e.g., IBM BigInsights requires IBM JAVA
● Different hardware● e.g., POWER, x86, ARM
Big Data Ecosystem
● Apache Hadoop - provides following services:● HDFS - Hadoop distributed file system, manages data● MapReduce - Hadoop data processing unit, Hadoop job● YARN - Hadoop resource manager and job scheduler - manages jobs
● Apache Spark - In memory data processing unit, integrated with YARN.
● The Hadoop Ecosystem includes many additional components - big data service consumers: ● Data Ingestion: Flume, Hue and Sqoop, etc.● Data Analysis: Spark, Hive, Pig, Impala, etc.● Data Visualization: Hue, Zeppelin, etc.
What is Juju?
● Juju is the modeling language for service oriented deployment in cloud.● Juju allows you to deploy, configure, manage, maintain, and scale big data services quickly on
public clouds, as well as on physical servers, OpenStack, and containers. ○ Juju major properties:
■ Deploy, connect, scale■ Reliability■ Open Source■ Repeatability■ Speed■ Observability
What is Juju?....
● Juju has two components:○ Charms, model of how a (unique) micro-service shall be deployed,
scaled and integrated
■ Could be written in any language - big data charms are mostly
coded in python
○ Bundles, that represent a set of charms/services integrated together, regardless of their individual scale■ Big data Solution
What is Juju?...
How we used Juju to solve the problem?
● Developed a vendor/release agnostic installation Charms.● All big data services use Apache standard interfaces to connect big data
services (dfs, map-reduce)● Introduced Apache hadoop plugin charm● Enables diverse solutions regardless of core and surrounding services
● Swappable components means rapid development at every layer● Data Ingestion● Data Processing● Data Visualizations
What that means?
● Install time: Common Installation method for services○ Vendor agnostic
○ Release agnostic - except for new features
● Run time: pluggable services interaction ○ lego blocks
Apache Hadoop Pluggable model
YARN Service
HDFSService
Spark Service
Big Data services accessing Apache Hadoop HDFS and YARN● Hadoop-Client charm
● Using Hadoop command-line component● Preconfigured to run MapReduce jobs● Preconfigured to access HDFS
● Hadoop-plugin charm● For Big Data services requiring Hadoop Java API (used by Hive, Pig, etc.)● Preconfigured to connect to Hadoop cluster
● Hadoop Services Relations● Provides hostname/port to communicate with HDFS/YARN
Vendor agnostic Installation
● Operating System independence● Tarballs● Eliminate OS packaging dependencies
● Architecture independence● Determine requirements at deployment time
● Example from Hive● http://bazaar.launchpad.net/~bigdata-dev/charms/trusty/apache-hive/trunk/view/head:/resources.yaml
resources: hive-ppc64le: url: http://<url>/apache-hive-0.13.0-bin.tar.gz hash: 4c835644eb72a08df059b86c45fb159b95df08e831334cb57e24654ef078e7ee hash_type: sha256 hive-x86_64: url: http://<url>apache-hive-1.0.0-bin.tar.gz hash: b8e121f435defeb94d810eb6867d2d1c27973e4a3b4099f2716dbffafb274184 hash_type: sha256
Vendor agnostic Installation ….
● Vendor properties● Provide default values and allows fine-tuning● Allows vendor-specific configuration
● Example from Hive● http://bazaar.launchpad.net/~bigdata-dev/charms/trusty/apache-hive/trunk/view/head:/dist.yaml
vendor: 'apache'hadoop_version: '2.4.1'packages: - 'libmysql-java' - 'mysql-client'groups: - 'hadoop'users: hive: groups: ['hadoop']dirs: hive: path: '/usr/lib/hive' owner: 'hive' group: 'hadoop'ports: hive: port: 10000
Hadoop Plugin Charm
● Single, simplified connection point to Hadoop HDFS and YARN● Relating to plugin installs and manages:
● Java Runtime● Access to interact with the data set
○ Hadoop API and CLI○ Hadoop config /etc/hadoop/conf○ /etc/hosts updates○ update environments: i.e. HADOOP_CONF_DIR
● Allows service reusability across Hadoop versions and distributions
from jujubigdata.relations import HadoopPlugin
if HadoopPlugin().hdfs_is_ready():pig.install()pig.configure()
First, the “hard” part:
juju quickstart apache-core-batch-processing
if needed> juju add-units -n 10 compute-slave
Ok, now that you have a fully working Big Data deployment with Apache Hadoop, let’s get to the interesting bit.
Let’s add Apache Hive
Data Analytics with MySQL
juju deploy apache-hive hive
juju deploy mysql
juju add-relation plugin hive
juju add-relation hive mysql
And Apache Pig
Data Analysis with Apache Pig language
juju deploy apache-pig pig
juju add-relation plugin pig
References and Contact Info
● Core bundle technical documentation● Mailing Lists
○ [email protected]○ Juju: https://lists.ubuntu.com/mailman/listinfo/juju
● IRC (Freenode)○ #juju
■ asanjar, cory_fu, kwmonroe
● Web○ jujcharms.com/big-data○ jujucharms.com/docs