How Hue integrates Hadoop with Django

Post on 26-Jan-2015

125 views 4 download

Tags:

description

Given the different structure of big data systems, they can be difficult to query, and even more difficult to explore. Hue, a Django-drive web application, integrates with these components and provides a clean, easy-to-use interface. In this discussion, we'll cover how the Hue project addressed communicating with Hbase, Hdfs, and various query engines. We'll also cover the reasons behind these design decisions.

Transcript of How Hue integrates Hadoop with Django

Django+NoSQLHOW Hue Integrates with HadoopAbraham ElmahrekCloudera - March 5th, 2014

Monday, March 3, 14

What is Hue?

HUE 1

Desktop-like in a browser, did its job but pretty slow, memory leaks and not very IE friendly but definitely advanced for its time (2009-2010).

Monday, March 3, 14

HISTORY

HUE 2

The first flat structure port, with Twitter Bootstrap all over the place.

Monday, March 3, 14

HISTORY

HUE 2.5

New apps, improved the UX adding new nice functionalities like autocomplete and drag & drop.

Monday, March 3, 14

HISTORY

HUE 3 ALPHA

Proposed design, didn’t make it.

Monday, March 3, 14

HISTORY

HUE 3

Transition to the new UI, major improvements and new apps.

Monday, March 3, 14

HISTORY

HUE 3.5+

Monday, March 3, 14

APPS

PIGJO

B BROWSER

JOB DESIGNER

OOZIE

HIVE IMPA

LA

METASTO

RE BROWSERSEARCH

HBASE BROWSER

SQOOP

ZOOKEEPERUSER ADMIN

DB QUERY

SPARK

HOME ...

GUI DESIGN

FILE BROWSER

USER

USER WORKFL

OWS

USER

Monday, March 3, 14

YARN JobTracker Oozie

Pig

HDFS

HiveServer2

HiveMetastore

ClouderaImpala

Solr

HBase

Sqoop2

Zookeeper

LDAPSAML

Hue Plugins

APPS

Monday, March 3, 14

FAST PACE

LAST MONTH

91 issues created and 90 resolved.Core team + Community

Monday, March 3, 14

STACK

BACKEND

Python + Django (2.6+/1.4.5)

FRONTEND

jQueryBootstrap

Knockout.jsLove

Monday, March 3, 14

HADOOP INTERFACES

REST & THRIFT

Many Hadoop interfaces used

WebHDFSYARN API (RM, NM, MR...)HiveServer2ImpalaHBaseOozieSqoop2ZooKeeper...

CUSTOM CLIENTS

Provide custom clients for more explicit API definitions

Monday, March 3, 14

PROTOCOLS

REST

Use python-requests and a custom client to streamline RESTful interface calls.

http_client.HttpClient(url,

exc_class=WebHdfsException,

logger=LOG)

if security_enabled:

client.set_kerberos_auth()

return client

Thrift

Custom connection pooling and socket multiplexing to streamline thrift calls.

thrift_util.get_client(TCLIService.Client,

query_server['server_host'],

query_server['server_port'],

service_name=query_server['server_name'],

kerberos_principal=kerberos_principal_short_name,

use_sasl=use_sasl,

mechanism=mechanism,

username=user.username,

timeout_seconds=conf.SERVER_CONN_TIMEOUT.get(),

use_ssl=conf.SSL.ENABLED.get(),

ca_certs=conf.SSL.CACERTS.get(),

keyfile=conf.SSL.KEY.get(),

certfile=conf.SSL.CERT.get(),

validate=conf.SSL.VALIDATE.get())

Monday, March 3, 14

ACCESSIBILITY

Middleware

Make Hadoop interfaces accessible in request objects

class ClusterMiddleware(object):

def process_view(self, request, ...):

request.fs = cluster.get_hdfs(request.fs_ref)

if request.user.is_authenticated():

if request.fs is not None:

request.fs.setuser(request.user.username)

def download(request, path):

if not request.fs.exists(path):

raise Http404(_("File not found."))

if not request.fs.isfile(path):

raise PopupException(_("not a file."))

Monday, March 3, 14

HDFS

Goal

Easily browse, create, read, update, and delete files in HDFS

Monday, March 3, 14

HDFS - Communication

REST

The NameNode provides a RESTful server called WebHDFS

def download(request, path):

if not request.fs.exists(path):

raise Http404(_("File not found."))

if not request.fs.isfile(path):

raise PopupException(_("not a file."))

Request Accessible

Provide a middleware for populating a request member

Explicit Client

Provide an API that is explicit

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN

...

class WebHdfs(Hdfs):

def create(self, path, ...):

...

def read(self, path, ...):

...

Monday, March 3, 14

HDFS - Cool Things

MIME Type Detection

Detect the various kinds of files being read: Avro, GZIP, etc.

Pagination

Nice pagination by block size when viewing a file (soon to be more like a PDF reader with content automatically being added)

Monday, March 3, 14

HBase

Goal

Make it easy to view and search HBase

Monday, March 3, 14

HBase - Technical Risk

2 Dimensions

Infinitely many columns and rows

Sparseness

Column names will often differ per row

Monday, March 3, 14

HBase - Communication

Thrift

Communicate with HBase using Thrift for better filtering

Explicit Client

Provide an API that is explicit

class HBaseApi(Hdfs):

def createTable(self, cluster, tableName, ...):

...

def getRows(self, cluster, tableName, columns, ...):

...

Monday, March 3, 14

Hive

Goal

Make it easy to run queries in Hive

Monday, March 3, 14

Hive - Communication

Thrift

Communicate with HiveServer2 using Thrift

DBMS

Further the capacities of the DBMS in Hue

Explicit Client

Provide a higher level API that is explicit and easy to configure

class HiveServerClient:

HS2_MECHANISMS = {'KERBEROS': 'GSSAPI', 'NONE': 'PLAIN', 'NOSASL': 'NOSASL'}

def __init__(self, query_server, user, ...):

thrift_util.get_client(TCLIService.Client,

...

thrift_util.get_client(TCLIService.Client,

query_server['server_host'],

query_server['server_port'],

service_name=query_server['server_name'],

...)

class HiveServer2Dbms(object):

def get_databases(self):

return self.client.get_databases()

...

def select_star_from(self, database, table):

hql = "SELECT * FROM `%s.%s` %s" % (database, table.name, self._get_browse_limit_clause(table))

return self.execute_statement(hql)

...

Monday, March 3, 14

Hive - Results

One Page App

Intelligent view that lets users worry about their queries

Navigation

Able to navigate databases and tables easily

Secure

Achieved some level of security through SASL, Kerberos, and SSL

Monday, March 3, 14

DEMO TIME

Monday, March 3, 14

What else does Hue do with Django?

Extensible settings

Configuration of settings.py provided through the hue.ini

Testing

Mocked and functional tests via nose + django-nose

Authentication

LDAP, PAM, OAuth, etc. provided through authentication backends

Security

Configurable session timeouts, SAML authentication, etc.

Doc Model

Polymorphic documents via a base document model

Permissions

Per-app permissions configurable in the UserAdmin

Monday, March 3, 14

GET HUE

Try in advance the latest and greatest but you’ll have to configure everything on your own.

Get to play with Hue and various Hadoop components in 5 minutes. It’s a self contained CDH environment ready to use.

Newer version than HDP, close to the original 2.5 minus apps like HBase, Impala, Sqoop, Search.

The newest addition, ships Hue 3.0 through the GreenButton products.

Stable and highly tested releases perfectly integrated with the Hadoop ecosystem, automagically configured by Cloudera Manager.

In HDP there’s an old forked version of Hue 2.3.

CLOUDERA’S CDH TARBALL CLOUDERA’S DEMO VM

HORTONWORKS* MAPR* HP CLOUD*

* YOUR MILEAGE MAY VARY.

BIGTOP EMBEDDED/DEMO IN IND. COMPANIES

Monday, March 3, 14

THANKS.

gethue.com

QUESTIONS?

Monday, March 3, 14