His2013 edwardyoon
-
Upload
edward-j-yoon -
Category
Technology
-
view
667 -
download
0
description
Transcript of His2013 edwardyoon
![Page 2: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/2.jpg)
I am ..
● PMC member and Committer of ..○ Apache Hama ○ Apache BigTop○ Apache MRQL
● Oracle corp. (2012 ~ 2013)● Korea Telecom (2011 ~ 2012)● NHN, corp. (2006 ~ 2011)
![Page 3: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/3.jpg)
Agenda
1. Introduction of Hamaa. What's Hama?b. Why Hama?
i. Evolution of the World Wide Webii. Evolution of the Infrastructureiii. Transition of the Technologiesiv. So, why Hama?
c. Benchmarksi. Hama vs. Spark - KMeans
2. Use cases of Apache Hamaa. Netflow Analytics at Korea Telecomb. SiteRank at Sogou.com
3. What's Next?
![Page 4: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/4.jpg)
Introduction of Hama BSP
![Page 5: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/5.jpg)
What's Hama?
● Apache Top Level Project● Written in Java● a general BSP computing engine
○ Java, Python, and C/C++ Interface○ MRQL - BSP Query Language
● 8+ Active committers!
![Page 6: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/6.jpg)
What's BSP?
● a Parallel computing model on Message-Passing Architecture
![Page 7: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/7.jpg)
MapReduce vs. Hama BSP
VS
Data-IntensiveComplex Computation-intensive
![Page 8: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/8.jpg)
MPI on YARN vs. BSP
How can you solve the below problems of MPI?
● Data partitioning and data locality aware task scheduling
● Job Fault Tolerance● Deadlock or Race Conditions● Complexity of Interface
The BSP is answer!
![Page 9: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/9.jpg)
Hadoop Ecosystem
HDFS
MapReduce Apache Hama (BSP)
Apache MRQL
YARN
Hive, Mahout, …, etc.
Spark, MPI, …, etc.
![Page 10: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/10.jpg)
Why Hama?
![Page 11: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/11.jpg)
Evolution of the WWW
● 1990 ~ : Web Documents Web 2.0 Blog, Open API
Smartphone Social Network
● ~ 2013 : Responsive Apps for multi-devices
![Page 12: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/12.jpg)
Evolution of the Infrastructure
● 1990 ~ : Server/Web Hosting Google Apps Cloud Computing IaaS, PaaS, SaaS
● ~ 2013 : Cloud/App Hosting
![Page 13: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/13.jpg)
Transition of the Technologies
● 2003 ~ : ○ SQL Database connectivity interface○ Web-scale data processing
■ MapReduce■ Hive, Pig, Mahout, ....
● 2007 ~ : ○ Key/Value interface○ Realtime, ML and Graph Processing
■ Storm■ Apache Hama!■ GraphLab, Spark, …, etc.
![Page 14: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/14.jpg)
So, why Hama?
● Simple and Flexible message-passing programming Interface
And,
● Machine Learning Package○ K-Means clustering is almost 500x ~ 1000x faster
than Mahout MR version● Graph Package (Google's Pregel)
○ PageRank is almost 10 ~ 20x faster than MapReduce version
![Page 15: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/15.jpg)
Benchmarks with 256 cores
● SSSP on random 1 Billion edges
400 secs!
● PageRank on Wikipedia link DataSet, contains 5,716,808 pages and 130,160,392 links).
17 secs!
![Page 16: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/16.jpg)
Hama vs. Spark - KMeans
Nodes
Seconds
![Page 17: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/17.jpg)
Use cases of Apache Hama
![Page 18: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/18.jpg)
Netflow Analytics at Korea Telecom
Weather forecasting for Clouds
● 4 Full Racks● Used as a Real-time event processing
○ Monitoring the network usage of each VMs as a real time
○ Detecting anomaly traffics○ Sharing the risks among Servers○ Billing, …, etc.
![Page 19: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/19.jpg)
SiteRank at Sogou.com
Sogou.com runs SiteRank algorithm on a 7,200 cores Hama cluster. ● SiteRank is the ranking generated by
applying the classical PageRank algorithm to the graph of Web sites.
● Dataset is about 400GB contains about 600M vertices and 6B edges
![Page 20: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/20.jpg)
What's Next?
● Performance Improvement● Fault-Tolerant● Asynchronous Scalable Messaging● GPU acceleration● NoSQL In/Out Formatters
● Develop the Hadoop Ecosystem
![Page 21: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/21.jpg)
Hadoop Ecosystem
HDFS
MapReduce Apache Hama (BSP)
Apache MRQL
YARN
Hive, Mahout, …, etc.
Spark, MPI, …, etc.
![Page 22: His2013 edwardyoon](https://reader033.fdocuments.in/reader033/viewer/2022051816/54620e5caf79597c138b4875/html5/thumbnails/22.jpg)
Thanks, Questions?