Hadoop Summit 2010 Keynote

16
Hadoop Trends, Opportunities, Challenges Hemanth Yamijala Committer, Hadoop Technical Lead, Map/Reduce, Yahoo!

Transcript of Hadoop Summit 2010 Keynote

Page 1: Hadoop Summit 2010 Keynote

Hadoop

Trends, Opportunities, Challenges

Hemanth Yamijala

Committer, Hadoop

Technical Lead, Map/Reduce, Yahoo!

Page 2: Hadoop Summit 2010 Keynote

What is

• Distributed computing framework

– Offers storage and batch processing for petabytes

of data

– Very suitable for ad-hoc textual processing

applicationsapplications

• Components

– Hadoop Distributed File System

– Map/Reduce programming framework

• Apache Software Foundation project

Page 3: Hadoop Summit 2010 Keynote

Hadoop on your Yahoo! page …

Page 4: Hadoop Summit 2010 Keynote

Hadoop Adoption Trends - Yahoo!

•Runs the Yahoo! Distribution of Hadoop

•http://github.com/yahoo/hadoop

•230 jobs/hour on average

•4.38 Tb/hour of input, 936 Gb/hour of output

Page 5: Hadoop Summit 2010 Keynote

Hadoop on your FB, Twitter pages

• Facebook

– Reporting, analytics, machine learning

• Amazon

– Hosted Hadoop on top of EC2 and S3

– Product search index

• Twitter

– Analytics, social network graphs

• AOL, Microsoft (PowerSet), IBM, …

• http://wiki.apache.org/hadoop/PoweredBy

Page 6: Hadoop Summit 2010 Keynote

Support of a vibrant community

Hadoop contributions:

Core: HDFS, Map/Reduce; Non-core: sub-projects Hadoop mailing list traffic

Cloudera Distribution of Hadoop – paid, supported service offering

from Cloudera

Page 7: Hadoop Summit 2010 Keynote

Support from Academia, Research

• PSG Tech, Coimbatore

– Semantic search, information retrieval, scheduling, applications in molecular biology –Deep dive on this later

• IIIT, Hyderabad• IIIT, Hyderabad

– Applications in Indian language content processing, scheduling

• IISc, Bangalore

– Modeling a simulator for Hadoop

• Many more – M45, OpenCirrus, …

Page 8: Hadoop Summit 2010 Keynote

Hadoop – a RAD tool ?

• Without Hadoop

– Build-out and maintenance of hardware

– Transfer, storage of data - Deep dive on this later

on

– Handling failures, efficiency– Handling failures, efficiency

• Enables rapid experimentation, iteration,

repeatability, low cost of failure

• Great Ecosystem: Streaming, PIG, Hive, Hbase,

Oozie, Avro…

Page 9: Hadoop Summit 2010 Keynote

Technical focus areas at Yahoo!

• Security

– Kerberos based authentication

• Backwards Compatibility – 1.0

– APIs cannot be broken between major releases– APIs cannot be broken between major releases

– A new API in Map/Reduce that enables this

• Robustness

– Multiple bug fixes

– Map/Reduce framework refactoring for better

concurrency, simplifying control flow logic

Page 10: Hadoop Summit 2010 Keynote

Technical focus areas at Yahoo!

• Append / Sync / Flush

– Until Hadoop 0.20, files were write once

– Append going to open Hadoop for more apps

• Efficiency in scheduling, data processing

– Task scheduling for better utilization, better

sharing policies

– Zero data copy – usage of direct I/O buffers

• Quality engineering

– Automated distributed system testing,

performance benchmarks (deep dive coming)

Page 11: Hadoop Summit 2010 Keynote

Agenda for Hadoop Summit

• Lightning Talk by Hari Vasudev (VP Platform

Tech Group, Yahoo!)

• Data Management on Grid by Srikanth

Sundarrajan (Yahoo!)Sundarrajan (Yahoo!)

• Machine Learning using Hadoop- Real Case

Study by Krishna Prasad Chitrapura (Yahoo!)

• Multiple Sequence Alignment using Hadoop

by Dr. Sudha Sadhasivam (PSG Tech,

Coimbatore)

Page 12: Hadoop Summit 2010 Keynote

Agenda for Hadoop Summit

• Benchmarking and Optimizing Hadoop

deployments(benchmarking on HiBench) by Mukesh

Gangadhar (Intel)

• Challenges and Uniqueness of QE and RE processes in Hadoop

by Jayant Mahajan (Yahoo!)

• Tuning Hadoop to deliver performance to your application by

Srigurunath Chakravarthi (Yahoo!)

• Panel Discussion: Moderator: Basant Verma (Yahoo!);

Panelist: T. S. Mohan (Infosys), Sudha Sadhasivam (PSG Tech),

Chidambaran Kollengode (Yahoo!) & Jothi Padmanabhan

(Yahoo!),

• Yahoo booth throughout the day: win cool prizes ☺

Page 13: Hadoop Summit 2010 Keynote

Thank You ! – Q&A

Hemanth Yamijala

([email protected]

[email protected])

Page 14: Hadoop Summit 2010 Keynote

Backup Slides

Page 15: Hadoop Summit 2010 Keynote

Challenges for Yahoo!

• No longer just a wildly successful cool project!

– People are demanding we deliver !

• Production usage, availability, SLAs

– Jobs that MUST finish in 15 minutes, or revenue is – Jobs that MUST finish in 15 minutes, or revenue is

lost, and the time limits are going down

• Usability, Operability

• Scale, Performance

– Ever increasing demands mean we need larger

clusters, faster throughput

Page 16: Hadoop Summit 2010 Keynote

Design considerations

• Cost Effectiveness

– Runs on commodity hardware, Linux

• Linear Scale

• Fault Tolerance• Fault Tolerance

– Block replication, checksums

– Transparent monitoring and re-execution of tasks

• Efficiency

– Data locality

– Efficient resource usage