eHarmony in the Cloud
-
Upload
craig-dickson -
Category
Technology
-
view
1.893 -
download
3
description
Transcript of eHarmony in the Cloud
![Page 1: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/1.jpg)
eHarmony in Cloud
Subtitle
Brian Ko
![Page 2: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/2.jpg)
eHarmony
• Online subscription-based matchmaking service
• Available in United States, Canada, Australia and United Kingdom.
• On average, 236 members in US marry every day.
• More than 20 million registered users.
1
![Page 3: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/3.jpg)
Why Cloud?
• Problem exceeds the limits of the data center and data warehouse environment.
• Leverage EC2 and Hadoop to scale data
2
![Page 4: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/4.jpg)
Finding match
3
• Model Creation
![Page 5: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/5.jpg)
Find matching
• Matching
4
![Page 6: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/6.jpg)
Find Matching
• Predicative Model Scores
5
![Page 7: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/7.jpg)
Requirement
• All the matches, scores, and user information should be archived daily
• Ready for 10X growth
• Possible O(n2) problem
• Need to support set of models becoming more complex
6
![Page 8: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/8.jpg)
Challenge
• Current architecture is multi-tiered with a relational back-end
• Scoring is DB join intensive• Data need constant archiving
– Matches, match scores, user attributes at time of match creation
– Model validation is done at a later time across many days
• Need a non-DB solution
7
![Page 9: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/9.jpg)
Solution
• Open Source Java implementation of Google’s MapReduce framework
– Distributes work across vast amounts of data– Hadoop Distributed File System (HDFS)
provides reliability through replication– Automatic re-execution on failure/distribution– Scale horizontally on commodity hardware
8
![Page 10: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/10.jpg)
Slide 9
• Simple Storage Service (S3) provides cheap unlimited storage.
• Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand.
9
![Page 11: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/11.jpg)
MapReduce
• A large server farm can use MapReduce to process huge dataset.
• Map step– Master node takes the input– Chops it up into smaller sub-problems– Distributes those to worker nodes.
• Reduce step– Master node takes the answers to all the sub-
problems – Combines them in a way to get the output
10
![Page 12: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/12.jpg)
Why Hadoop
• Mapper and Reducer are written by you
• Hadoop provides– Parallelization– Shuffle and sort
11
![Page 13: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/13.jpg)
Actual Process
• Upload to S3 and start EC2 Cluster
13
![Page 14: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/14.jpg)
Actual Process
• Process and archive
14
![Page 15: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/15.jpg)
Amazon Elastic MapReduce
• It is a web service
• EC2 cluster is managed for you behind the scenes
• Starts Hadoop implementation of the MapReduce framework on Amazon EC2
• Each step can read and write data directly from and to S3
• Based on Hadoop 0.18.3
15
![Page 16: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/16.jpg)
Elastic MapReduce
• No need to explicitly allocate, start and shutdown EC2 instances
• Individual jobs were managed by a remote script running on master node (no longer required)
• Jobs are arranged into a job flow, created with a single command
• Status of a job flow and all its steps are accessible by a REST service
16
![Page 17: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/17.jpg)
Before Elastic Map Reduce
• Allocate/Verify cluster
• Push application to cluster
• Run a control script on the master
• Kick off each job step on the master
• Create and detect a job completion token
• Shut the cluster down
17
![Page 18: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/18.jpg)
After Elastic MapReduce
• With Elastic MapReduce we can do all this with a single local command
• Uses jar and conf files stored on S3
• Various monitoring tools for EC2 and S3 are provided
18
![Page 19: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/19.jpg)
Development Environment
• Cheap to set up on Amazon
• Quick setup - Number of servers is controlled by a config variable
• Identical to production
• Separate development account recommended
19
![Page 20: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/20.jpg)
Cost comparison
• Average EC2 and S3 Cost– Each run is 2 to 3 hours– $1200/month for EC2– $100/month for S3
• Projected in-house cost– $5000/month for a local cluster of 50 nodes
running 24/7– A new company needs to add data center and
operation personnel expense
20
![Page 21: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/21.jpg)
Summary
• Dev tools really easy to work with and just work right out of the box
• Standard Hadoop AMI worked great
• Easy to write unit tests for MapReduce
• Hadoop community support is great.
• EC2/S3/EMR are cost effective
![Page 22: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/22.jpg)
The End
5 minutes of question time
starts now!
![Page 23: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/23.jpg)
Questions
4 minutes left!
![Page 24: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/24.jpg)
Questions
3 minutes left!
![Page 25: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/25.jpg)
Questions
2 minutes left!
![Page 26: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/26.jpg)
Questions
1 minute left!
![Page 27: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/27.jpg)
Questions
30 seconds left!
![Page 28: eHarmony in the Cloud](https://reader033.fdocuments.in/reader033/viewer/2022060110/555eca5dd8b42af67f8b51f1/html5/thumbnails/28.jpg)
Questions
TIME IS UP!