Post on 28-Jul-2015
HADOOP EAGLEFull-stack realtime monitoring framework for eBay hadoop
Edward Zhang @yonzhang2012 | Hao Chen @ihaoch
2
Use case: Detect node anomaly by analyzing task failure ratio across all nodes
Assumption : task failure ratio for every node should be approximately equal
Algorithm : node by node compare (symmetry violation) and per node trend
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Background – initial use cases
3
Host: Task failure based anomaly host detection
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Anomaly Detection & Alerting Analysis Auto-Remediation
4
Scale Challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• 10+ large Hadoop clusters• 10,000+ data nodes• 50,000+ jobs per day• 50,000,000+ tasks per day• 500+ types of Hadoop/Hbase native metrics• Billions of audit events, metrics per day
5
Use cases challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Host• Task failure ratio based machine anomaly detection
• Job monitoring across its lifetime• Real-time running job performance analysis• Near real-time job history analytics• Data skew detection
• Hadoop native metrics• Hdfs• Hbase• M/R
• Logs• GC log• Hadoop daemon log• Audit log• HDFS image file
• Yarn Framework• Queue
6 HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Engineering Challenges @ eBay Hadoop Monitoring
• Varieties of data sourcesM/R history job, running, GC log, namenode log, hadoop native metrics, YARN queue, audit log, hdfs image file etc.
• Varieties of data collectorspull form hdfs, pull YARN API, ship logs, …
• Complex business logicjoin outside data, pre-aggregations, memory window …
• Alert rules can’t be hot deployed
• Scalability issue with single process
7
Job History Performance Analyzer
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitor job history files in near real-time• Crawl job history files immediately after it is completed• Apply expertise rules for job performance suggestions• Job history trend for the same type of job
Job Start Event
Task Start Event
Task End
Event
Task roll-up
Task2 Start Event
Task2 End
Event
Task roll-up
Job End
Event
Job Suggestion
Rules
8
Job real-time monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitoring running job in real time• Minute-level job progress snapshots• Minute-level resource usage
snapshots• CPU, HDFS I/O, Disk I/O, slot
seconds• Roll up to user/queue/cluster level• Slide window based alert
9
Service: GC Log / Server Log
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• GC event detection and prediction• Log metrics statistics• Real-time log anomaly detection
11
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards• We need create framework to cover full stack in monitoring system
Programming Paradigm and Abstraction
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
12
As a framework, Eagle does not assume :• Data source (where, what)• Business logic execution path (how)• Policy engine implementation (how)• Data sink (where, what)
Eagle as a Framework
HADOOP EAGLE – EBAY INC
As a framework, Eagle does the following:• SQL-like service API• High-performing query framework• Lightweight streaming process java API• Extensible policy engine
implementation• Scalable and distributed rule evaluation• Native HBase data storage support• Metadata driven stream processing• Data source extensibility• Data sink extensibility• Interactive dashboard
HADOOP EAGLE
HADOOP EAGLE – EBAY INC 14
Eagle Monitoring Framework Internals
• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards
HADOOP EAGLE
15
Facts• Computation is based on single
event which constitutes endless continuous stream
• Computation can be aggregation, time-window, length-window or join outside data etc.
• Filter design pattern is used for modularizing code at the beginning
Lightweight Streaming Process Framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Abstraction Inspired by cascading framework, we
abstract a light-weight streaming programing API which is independent of execution environment
Streaming process is directed acyclic graph This layer of indirection is for code
modularization, code reuse and prevention of coupling with specific execution environment
Runs on single process, Storm or other streaming technology like Spark
16
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
@Overrideprotected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
Step 2: Inter-task data exchange protocol
@Overrideprotected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
17
Execution Graph development, compile and deploy
Development / Compile Phase Deployment / Runtime Phase
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
HADOOP EAGLE – EBAY INC 18
Eagle Monitoring Framework Internals
• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards
HADOOP EAGLE
19
Extensible & Scalable Policy framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Scalability• Dynamic policy partitioning across compute nodes based on configurable partition class• Dynamic policy deployment• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
Features• Policy CRUD• Stream metadata (event attribute name, attribute type, attribute value resolver, …)
22
Extensibility of policy framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
public interface PolicyEvaluatorServiceProvider { public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations
HADOOP EAGLE – EBAY INC 23
Eagle Monitoring Framework Internals
• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards
HADOOP EAGLE
HADOOP EAGLE – EBAY INC 24
Eagle Query Framework
HADOOP EAGLE
Persistence• Metric• Event• Metadata• Alert• Log• Customized
Structure• …
Query• Search• Filter• Aggregation• Sort• Expression • ….
Features• Simple API• Powerful query• High performance• Scalability• Pluggable • …
The light-weight metadata-driven store layer to servecommonly shared storage & query requirements of most monitoring system
HADOOP EAGLE – EBAY INC 26
Eagle Query Framework
HADOOP EAGLE
• Metadata definition ORM• High performance RESTful API supporting CRUD• SQL-like declarative query syntax• Generic service client library• Native support HBase and RDBMS• Interactive and customizable dashboard
27
• Annotations are metadata to entity• Metadata driven query compiling and
response rendering• Metadata driven ser/deser• Rename column to shorter string(hbase)• Entity metadata primitives
• Table• ColumnFamily• Prefix(the very first partition key)• Service(entity identifier)• Partition• Tags• Indexes• Column
Metadata definition ORM
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
@Table("alertdef")@ColumnFamily("f")@Prefix("alertdef")@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)@TimeSeries(false)@Partition({"cluster", "datacenter"})@Tags({"programId", "alertExecutorId", "policyId", "policyType"})@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" })})public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")private String desc;@Column("b")private String policyDef;@Column("c")private String dedupeDef;@Column("d")private String notificationDef;@Column("e")private String remediationDef;@Column("f")private boolean enabled;
28
Generic RESTful API & Query
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
::= <EntityName> “[" <FilterCondition> "]" "<" <GroupbyFields> ">" "{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ]
eagle-service/rest/entities?query=
29
Generic RESTful API Query Syntax
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100
Aggregation Query ::= <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions}query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100
CONTAINS, IN, !=, =, <, <=, >, >=
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R…….
Search Query
Aggregate Query
TimeSeries Histogram Queryquery=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440&pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster
Operators
Numeric Filters
Paginations
30
Generic Eagle Service Client Library
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Basic CRUD• Fluent DSL• Metric Builder API• Parallel Client• Asynchronous Client
client.metric("unit.test.metrics") .batch(5) .tags(tags) .send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3) .send(System.currentTimeMillis(), 0.1) .send(System.currentTimeMillis(),0.1,0.2) .send(System.currentTimeMillis(),tags,0.1,0.2,0.3) .send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3) .flush();
client.search("GenericMetricService[@cluster=\"cluster4ut\" AND @datacenter = \"datacenter4ut\"]<@cluster>{sum(value)}") .startTime(0) .endTime(System.currentTimeMillis()+24 * 3600 * 1000) .metricName("unit.test.metrics") .pageSize(1000) .send();
31
Uniform rowkey design
• Metric
• Entity
• Log
HBase Storage Design
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …Rowvalue ::= Log Content
com.ebay.eagle.coprocessor.AggregateProtocol
32
HBase Coprocessor
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
avg count max min sum0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
nocoprocesso in single region
coprocessor in single region
estimated in cluster
33
• Uniform HBASE row-key design for all types of monitoring data sources• Logically partition data by tags which is defined in annotation @Partition({“cluster”,
“datacenter”})• Physically shard data by HBASE native feature: rowkey range and region mapping• Write throughput optimized by using HBASE multi-put• Co-processor to maximize query performance• Push evaluation of numeric filters down to HBase• Secondary index support• Inspection of RESTful resources and entity metadata• Numeric filters• Expression evaluation in output fields• Rowkey inspection
Tuning for HBase Storage
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
HADOOP EAGLE – EBAY INC 34
Eagle Monitoring Framework Internals
• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards
HADOOP EAGLE
35
• Interactive: IPython notebook-like interactive visualization analysis and troubleshooting.
• Dashboard: Customizable dashboard layout and drill-down path, persist and share.
Generic Dashboard Analytics for Eagle Store
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
36
Open Source Soon …
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• First use case: Eagle to secure Hadoop platform based on Eagle framework
• Work closely with Hortonworks, Dataguise, …
• Share with community and get community’s support
• Continue to open source job monitoring, GC monitoring etc.