Hadoop & Greenplum: Why Do Such a Thing?
-
Upload
ed-kohlwey -
Category
Technology
-
view
22 -
download
1
description
Transcript of Hadoop & Greenplum: Why Do Such a Thing?
1© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum & Hadoop
Why do such a thing?
Donald MinerSolutions ArchitectAdvanced Technologies [email protected]
2© Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM DATABASEQUICK INTRODUCTION TO
3© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database Basics
Massively Parallel Processing (MPP) Database
Uses commodity hardware
Data is distributed by auser-defined “distribution key”
Master node delegatesqueries to segments
1:1 segment and mastermirroring for redundancy
Master
Segment Segment Segment Segment
Master
GREENPLUM DATABASE
4© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database FeaturesFull SQL support based on PostgreSQL 8.2
Columnar or row-oriented storage with compression
Multi-level table partitioning with query time partition pruning
B-tree and bitmap indexes
JDBC, ODBC, OLEDB, etc. interfaces
High speed, parallel bulk ingest
Parallel query optimizer
External tables
GREENPLUM DATABASE
5© Copyright 2012 EMC Corporation. All rights reserved.
MADlib Analytics with Greenplum
Scalable and in-database
Mathematical, statistical, machine learning
Active open source project
> SELECT householdID, variables FROM households ORDER BY RANDOM() LIMIT 100000;> SELECT run_univariate_analysis ( 'households_training', 'variables'); WHERE pvalue<.01 AND r2>.01;> SELECT run_regression( 'univariate_results', 'households_training');> SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households;
GREENPLUM DATABASE
6© Copyright 2012 EMC Corporation. All rights reserved.
MADlib In-Database Analytical Functions
Descriptive Statistics Modeling
Quantile Correlation Matrix
Profile Association Rule Mining
CountMin (Cormode-Muthukrishnan) Sketch-based Estimator K-Means Clustering
FM (Flajolet-Martin) Sketch-based Estimator Naïve Bayes Classification
MFV (Most Frequent Values) Sketch-based Estimator Linear Regression
Frequency Logistic Regression
Histogram Support Vector Machines
Bar Chart SVD Matrix Factorisation
Box Plot Chart Decision Trees/CART
Latent Dirichlet Allocation Topic Modeling
GREENPLUM DATABASE
7© Copyright 2012 EMC Corporation. All rights reserved.
PostGIS Support in Greenplum DBPostGIS adds support for geographic objects in PostgreSQL
Example: find all records within 25 miles of hurricane path
customer_id | st_astext | phone_num------------+-----------------------------+-------------493140 | POINT(-80.040397 26.570613) | 1231231234192401 | POINT(-81.820933 26.242611) | 2342342345
select customer_id, ST_AsText(lat_lon), phone_numfrom clientswhere ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(-79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())
http://postgis.refractions.net/
GREENPLUM DATABASE
8© Copyright 2012 EMC Corporation. All rights reserved.
Solr integration with GPDBSolr is an open source enterprise search engine
Enable in-database text indexing and search
select t.id, q.score, t.message_textfrom message t, gptext.search( 'twitter.public.message', '(iphone and (hate or love))', 'author_lang:en', 100 ) qwhere t.id=q.idorder by score desc;
id | score | message_text -----------+------------------+------------------------------------------- 71552856 | 5.43078422546387 | Hates BB's Love IPhones!
91373993 | 4.06371879577637 | Its a love hate relationship with iPhone spellcheck
25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate relationship...
120166038 | 3.39410924911499 | Love the new iPhone 4s, hate @ATT service #Verizonhereicome
117498183 | 3.39181470870972 | I got a love-hate relationship for my iPhone!!!
86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me..
GREENPLUM DATABASE
9© Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM HADOOP
10© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum “HD”GREENPLUM HADOOP
• Bundled open source
• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Mahout
11© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum “MR”GREENPLUM HADOOP
• Bundled MapR, a commercial version of Hadoop• API compatible with traditional Hadoop• MapR improvements over Hadoop:
– Improved control system– Major portions of HDFS re-implemented
in C++– HDFS is NFS mountable– Improved shuffle and sort– Distributed NameNode– Supports large number of files– Mirroring, snapshot capability
12© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?Greenplum DB
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and SchemasGPMapReduce
Indexing
Partitioning
Text objects
GP Solr/LuceneMADLib
PostGIS
13© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?Hadoop
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
HiveMapReduce
PigXML, JSON, … Flat files
Schema on load
14© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?HBase
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
Hive MapReduce
PigHBase Tables
Row keys
Flexible schema
15© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?Hybrid architecture with all three (or two…)
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
HBase Tables
Row keys
Flexible schema
SQL
RDBMS
Tables and SchemasGPMapReduce
Indexing
Partitioning
Text objects
HiveMapReduce
Pig XML, JSON, … Flat files
Schema on loadGP Solr/Lucene
MADLib
PostGIS
16© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Unified Analytics Platform
17© Copyright 2012 EMC Corporation. All rights reserved.
Hadoop External Tables in GPDBExternal tables bring external data into the database.
Native support for HDFS with parallelized loading.
Can write to HDFS or read from HDFS.
> SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word;
> CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|');
> WRITE INTO hdfs_export SELECT * FROM gpdb_source;
18© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?
Many of the same use cases of a HBase/Hadoop environment
Use Hadoop as a data groomer
Do rollups in Hadoop and store results in GPDB
Use the best tool for the job (structured vs. unstructured)
Use GPDB to host data sets in a more real-time layer for ad-hoc analytics
19© Copyright 2012 EMC Corporation. All rights reserved.
EMC Isilon
Hardware appliance for scale-outnetwork-attached storage (NAS)
Stripes data across all nodes
Uses Infiniband for intra-clustercommunication
Up to 15.5PB total storage
3 different hardware configurationsto handle different workloads
Uses “OneFS”, Isilon’s operating system and file system
Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more.
20© Copyright 2012 EMC Corporation. All rights reserved.
Isilon HDFS interface
Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data.
Underlying system is OneFS and does not follow the traditional HDFS scheme.
Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster.
21© Copyright 2012 EMC Corporation. All rights reserved.
Pros & Cons
Isilon is more dense
Isilon can be mounted via a number of protocols
– Easier ingest / egress– Raw data accessible by applications
Isilon is easy to manage
Free of certain HDFS limitations
Isilon loses data locality (~250MB/sec throughput per node over network)
22© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing? Hadoop backup or archive
– More dense than HDFS, more accessible than tape, no need for compute
Complete HDFS replacement– More dense, more accessible, utilize existing
Isilon, slower per terabyte of storage
Hot/warm storage– Use HDFS as primary, but Isilon as secondary
Storage for original content– Use MapReduce to extract metadata from original
content, and leave original content in place
23© Copyright 2012 EMC Corporation. All rights reserved.
HBase External Tables in GPDBProject in development
Load data in parallel from HBase by specifying table name and column qualifiers
> SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word;
> CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ’CUSTOM' (formatter=‘gpdbwriteable_import’);
24© Copyright 2012 EMC Corporation. All rights reserved.
HBase External Tables in GPDB
Possible TODO list:
Specify range of rowkeys
Support writes into HBase
Specify filter criteria on the external table
select * from hbase_external where ROWKEY=‘abc’
Accumulo?
25© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?
Have HBase store semi-structured data
Exploit the strengths of each
Use HBase for really really wide tables
Use HBase as a scalable archive of raw records
Leverage existing HBase applications
26© Copyright 2012 EMC Corporation. All rights reserved.
Greenplum On HDFS
Get Greenplum Database to run natively off of HDFS
Underlying Greenplum Database data is stored in HDFS
Unifies the two platform further – no need for external tables
Fully supports Greenplum’s append-only tables
Early project in R&D
Talk will be given by Chang Lei at Yahoo Summit
27© Copyright 2012 EMC Corporation. All rights reserved.
NamenodeB
replication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
Segment
Segment (Mirror)
Segment host
Segment
Segment host
Segment
Segment host
Segment
Segment host
Master host
Meta Ops
Interconnect
Segment (Mirror)
Segment (Mirror) Segment
(Mirror)
Segment (Mirror)
Tables in HDFS filespace
Greenplum On HDFS
28© Copyright 2012 EMC Corporation. All rights reserved.
Why do such a thing?
Covers many of the same use cases as Hive
Run Hadoop MapReduce over data managed by Greenplum DB
Initial results show it is faster than Hive
You only have to store your data in one system