Hadoop @ … and other stuff. Who Am I? I'm this guy! [email protected] Staff.
-
Upload
piers-wilson -
Category
Documents
-
view
218 -
download
0
Transcript of Hadoop @ … and other stuff. Who Am I? I'm this guy! [email protected] Staff.
Hadoop @… and other stuff
Hadoop… what is it good for?
Directly influenced by Hadoop
Indirectly influenced by Hadoop
Additionally, 50% for business analytics
A long long time ago (or 2009)
• 40 Million members
• Apache Hadoop 0.19
• 20 node cluster
• Machines built from Frys (pizza boxes)
• PYMK in 3 days!
Now-ish
• Over 5000 nodes
• 6 clusters (1 production, 1 dev, 2 etl, 2 test)
• Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)
• Security turned on
• About 900 users
• 15-20K Hadoop Jobs submissions a day
• PYMK < 12 hours!
Current Setup
• Use Avro (mostly)
• Dev/Adhoc clustero Used for development and testing of workflowso For analytic queries
• Prod clusterso Data that will appear on our websiteo Only reviewed workflows
• ETL clusters
• Walled off
Three Common Problems
Hadoop Cluster (not to scale)
Data In Data OutProcess
Data
Data In
Databases (c. 2009-2010)
• Originally pulled directly through JDBC on backup DBo Pulled deltas when available and merged
• Data comes extra late (wait for replication of replicas)o Large data pulls affected by daily locks
• Very manual. Schema, connections, repairs (manual)
• No delta’s meant no Scoop
• Costly (Oracle)
Live SiteOffline Data copy
Live SiteLive Site
Offline Data copy
Offline Data copy
DWH
24 hr 5-12 hrHadoop
Databases (Present)
• Commit logs/deltas from Production
• Copied directly to HDFS
• Converted/merged to Avro
• Schema is inferred
HadoopLive SiteLive Site
Live Site < 12 hr
Datastores
Databases (Future 2014?)
• Diffs sent directly to Hadoop
• Avro format
• Lazily merge
• Explicit schema
HadoopDatabus ( < 15 min )
Webtrack (c. 2009-2011)
• Flat files (xml)
• Pulled from every servers periodically, grouped and gzipped
• Uploaded into Hadoop
• Failures nearly untraceable
HadoopNAS NAS
?
I seriously don’t knowhow many hops and copies
Webtrack (Present)
• Apache Kafka!! Yay!
• Avro in, Avro out
• Automatic pulls into Hadoop
• Auditing
HadoopKafkaKafka
KafkaKafka
KafkaKafka
KafkaKafka
5-10 mins end to end
Apache Kafka
• LinkedIn Events
• Service metrics
• Use schema registryo Compact data (md5)o Auto registero Validate schemao Get latest schema
• Migrating to Kafka 0.8o Replication
Apache Kafka + Hadoop = Camus
• Avro only
• Uses zookeepero Discover new topicso Find all brokerso Find all partitions
• Mappers pull from Kafka
• Keeps offsets in HDFS
• Partitions into hour
• Counts incoming events
Kafka Auditing
• Use Kafka to Audit itself
• Tool to audit and alert
• Compare counts
• Kafka 0.8?
Lesson’s We Learned
• Avoid lots of small files
• Automation with Auditing = sleep for me
• Group similar data = smaller/faster
• Spend time writing to spend less time readingo Convert to binary, partition, compress
• Future: o adaptive replication (higher for new, lower for old)o Metadata store (hcat)o Columnar store (Orc?, Parquett?)
Processing Data
Pure Java
• Time consuming writing jobs
• Little code re-use
• Shoot yourself in the face
• Only used when necessaryo Performanceo Memory
• Lots of libraries to help (boiler plate stuff)
Little Piggy (Apache Pig)
• Mainly a pigsty (Pig 11.0)
• Used by data products
• Transparent
• Good performance, tunable
• UDF’s, Datafu
• Tuples and bags? WTF
Hive
• Hive 11
• Only for Adhoc querieso Biz ops, PM’s, analyst
• Hard to tune
• Easy to use
• Lots of adoption
• Etl data in external tables :/
• Hive server 2 for JDBC
Disturbing Mascot
Future in Processing
• Giraph
• Impala, Shark/Spark… etc
• Tez
• Crunch
• Other?
• Say no to streaming
Workflows
Azkaban
• Run hadoop jobs in order
• Run regular schedules
• Be notified on failures
• Understand how flows are executed
• View execution history
• Easy to use
Azkaban @ LinkedIn
• Used in LinkedIn since early 2009
• Powers all our Hadoop data products
• Been using 2.0+ since late 2012
• 2.0 and 2.1 quietly released early 2013
Azkaban @ LinkedIn
• One Azkaban instance per cluster
• 6 clusters total
• 900 Users
• 1500 projects
• 10,000 flows
• 2500 flow executing per day
• 6500 jobs executing per day
Azkaban (before)
Engineer designed UI...
Azkaban 2.0
Azkaban Features
• Schedules DAGs for executions
• Web UI
• Simple job files to create dependencies
• Authorization/Authentication
• Project Isolation
• Extensible through plugins (works with any version of Hadoop)
• Prison for dark wizards
Azkaban - Upload
• Zip Job files, jars, project files
Azkaban - Execute
Azkaban - Schedule
Azkaban - Viewer Plugins
HDFS Browser
Reportal
Future Azkaban Work
• Higher availability
• Generic Triggering/Actions
• Embedded graphs
• Conditional branching
• Admin client
Data Out
Voldemort
• Distributed Key-Value Store
• Based on Amazon Dynamo
• Pluginable
• Open-source
Voldemort Read-Only
• Filesystem store for RO
• Create data files and index on Hadoop
• Copy data to Voldemort
• Swap
Voldemort + Hadoop
• Transfers are parallel
• Transfer records in bulk
• Ability to Roll back
• Simple, operationally low maintenance
• Why not Hbase, Cassandra?o Legacy, and no compelling reason to changeo Simplicity is niceo Real answer: I don’t know. It works, we’re happy.
Apache Kafka
• Reverse the flow
• Messages produced by Hadoop
• Consumer upstream takes action
• Used for emails, r/w store updates, where Voldemort doesn’t make sense etc
Nearing the End
Misc Hadoop at LinkedIn
• Believe in optimizationo File size, task count and utilizationo Reviews, culture
• Strict limitso Quotas size/file counto 10K task limit
• Use capacity schedulero Default queue with 15m limito marathon for others
We do a lot with little…
• 50-60% cluster utilization o Or about 5x more work than some other companies
• Every job is reviewed for productiono Teaches good practiceso Schedule to optimize utilizationo Prevents future headaches
• These keep our team size smallo Since 2009, hadoop users grew 90x, clusters grew
25x, LinkedIn employees grew 15xo hadoop team 5x (to 5 people)
More info
Our data site: data.linkedin.com
Kafka: kafka.apache.com
Azkaban: azkaban.github.io/azkaban2
Voldemort: project-voldemort.com
The End