Hydra - Getting Started
-
Upload
abramsm -
Category
Technology
-
view
420 -
download
3
description
Transcript of Hydra - Getting Started
Hydra - A Practical Introduction
Big Data DC - @bigdatadc Matt Abrams - @abramsm
March 4th 2013
Agenda
• What is Hydra?
• Sample Data and Analysis Questions
• Getting started with a local Hydra dev environment
• Hydra’s Key Concepts
• Creating your first Hydra job
• Putting it all together
Hydra’s Goals• Support Streaming and Batch
Processing
• Massive Scalability
• Fault tolerant by design (bend but do not break)
• Incremental Data Processing
• Full stack operational support
• Command and Control
• Alerting
• Resource Management
• Data/Task Rebalancing
• Data replication and Backup
What Exactly is Hydra?
• File System
• Data Processing
• Query System
• Job/Cluster Management
• Operational Alerting
• Open Source
Hydra - Terms• Job: a process for processing data
• Task: a processing component of a job. A job can have one to n tasks
• Node: A logic unit of processing capacity available to a cluster
• Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes
• Spawn: Cluster management controller and UI
Hydra Cluster
Our Sample Data (Log-Synth)
3.535, 5214d63bab95687d, 166.144.203.186, "the then good"
3.568, 5dbd9451948ad895, 88.120.153.226, "know boys"
4.206, 5dbd9451948ad895, 88.120.153.226, "to"
4.673, b967d99cad0b3e60, 88.120.153.226, "seven"
4.900, bd0d760fbb338955, 166.144.203.186, "did local it"
What do we want to know?• What are the top IP addresses by request count?
• What are the top IP address by unique user count?
• What are the most common search terms?
• What are the most common search terms in the slowest 5% of queries?
• What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?
Setting up Hydra’s Local Stack
Vagrant
• $ vagrant init precise32 http://files.vagrantup.com/precise32.box
• // add: config.vm.network :forwarded_port, guest: 5052, host: 5052 to your Vagrantfile
• $ vagrant up
• $ vagrant ssh
Java7
• $ sudo apt-‐get update
• $ sudo apt-‐get install python-‐software-‐properties
• $ sudo add-‐apt-‐repository ppa:webupd8team/java
• $ sudo apt-‐get update
• $ sudo apt-‐get install oracle-‐java7-‐installer
RabbitMQ, Maven, Git, Make
• $ sudo apt-‐get install rabbitmq-‐server
• $ sudo apt-‐get install maven
• $ sudo apt-‐get install git
• $ sudo apt-‐get install make
Copy on Write• $ wget http://xmailserver.org/fl-‐cow-‐0.10.tar.gz
• $ tar zxvf fl-‐cow-‐0.10.tar.gz
• $ cd fl-‐cow-‐0.10
• $ ./configure —prefix=/usr
• $ make; make check
• $ sudo make install
• $ export LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
Hydra
• $ git clone https://github.com/addthis/hydra.git
• $ cd hydra; mvn clean -‐Pbdbje package
• $ ./hydra-‐uber/bin/local-‐stack.sh start
• $ ./hydra-‐uber/bin/local-‐stack.sh start
• $ ./hydra-‐uber/bin/local-‐stack.sh seed
Stage Sample Data in Stream Directory
• $ mkdir ~/hydra/hydra-‐local/streams/log-‐synth
• $ cp $YOUR_SAMPLE_DATA_DIR ~/hydra/hydra-‐local/streams/log-‐synth
Pipes and Filters
• Return true or false • Operate on entire
rows • Add/Remove columns • Edit Column Values • May include a call to
ValueFilter
• Operate on single volume values
• Return a value or null
• No visibility to full row
• Often take input from BundleFilter
BundleFilters ValueFilters
// chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}
BundleFilter - Chain
// false if UID column is null {"op":"field", "from":"UID"},
BundleFilter - Existence
// joins FOO and BAR // Stores output in new column “OUTPUT” !{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
Bundle Filter - Concatenation
// FIELD_ONE == FIELD_TWO !{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
BundleFilter - Equality Testing
// DUR = Math.round((end-start)/1000) !
{"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
BundleFilter - Math!
Stack Math - Sample Data
C0,START_TIME C1,END_TIME
100,234 200,468
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set
100,234
200,468 200,468-100,234 =100,234
Sub
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set
100,234
1000 100,234/1000 =100.234
DDIV
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set
100.234 100toint
Stack Math - Sample Result
C0,START_TIME C1,END_TIME C2,DURATION
100,234 200,468 100
{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}
ValueFilter - Glob
ValueFilter
BundleFilter
{op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}},
ValueFilter - Chain, Split, Index
ValueFilter
ValueFilter(s)
Data Attachments
Data Attachments are Hydra’s Secret Weapon
• Top-K Estimator
• Cardinality Estimation (HyperLogLog Plus)
• Quantile Estimation (Q,T-Digest)
• Bloom Filters
• Multiset streaming summarization (CountMin Sketch)
Data Attachment ExampleA single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time
Putting it All Together
Job Structure
• Jobs have three sections
• Source
• Map
• Output
Source
• Defines the properties of the input data set
• Several built in source types:
• Mesh
• Local File System
• Kafka
Map
• Select fields from input record to process
• Apply filters to rows and columns
• Drop or expand rows
Output - Tree• Output(s) can be trees
or data files
• Trees represent data aggregations that can be queried
• Files Output Targets
• File System
• Cassandra
• HDFS
Lets put it all Together
Create Hydra Job
Run Job
Query
What are the top IP Addresses By Record Count?• Exact
• path: root/byip/+:+hits
• ops: gather=ks;sort=1:n:d;limit=100
• Approximate
• path: root/byip/+$+uidcount
• ops: gather=ks;sort=1:n:d;limit=100
What are the top IPs by unique user count?
• Exact
• path: root/byip/+/+
• ops: gather=kk;sort=0;gather=ku;sort=1:n:d
• Approximate
• path: root/byip/+$+uidcount
• ops: gather=ks;sort=1:n:d;limit=100
What are the search terms for the slowest 5%?
• First get the 95th percentile query time
• path: /root$+timeDigest=quantile(.95)
• ops: num=c0,toint,v0,set;gather=a
• Now find all queries then 95th percentile
• path: /root/bytime/+/+:+hits
• ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
Daily Unqiue Searches, Users, IPs and distribution of response times?• Query Path:
• root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits
• Ops:
• gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999
• Remote Ops:
• num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num=c7,toint,v7,set;num=c8,toint,v8,set;
But yeah, I could do that with CLI!
Related Open Source Projects
• Meshy - https://github.com/addthis/meshy
• Codec - https://github.com/addthis/codec
• Muxy - https://github.com/addthis/muxy
• Bundle - https://github.com/addthis/bundle
• Basis - https://github.com/addthis/basis
• Column Compressor - https://github.com/addthis/columncompressor
• Cluster Boot Service - https://github.com/stewartoallen/cbs
Helpful Resources• Hydra - https://github.com/addthis/hydra
• Hydra User Reference - http://oss-docs.addthiscode.net/hydra/latest/user-reference/
• Hydra User Guide - http://oss-docs.addthiscode.net/hydra/latest/user-guide/
• IRC - #hydra
• Mailing List - https://groups.google.com/forum/#!forum/hydra-oss