1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data...

Large Scale Applications on Hadoop in Yahoo

Vijay K Narayanan, Yahoo! Labs04.26.2010

Massive Data Analytics

Over the Cloud

(MDAC 2010)

Outline

Hadoop in Yahoo!

Common types of applications on Hadoop

Sample applications in:

› Content Analysis

› Web Graph

› Mail Spam Filtering

› Search

› Advertising

User Modeling on Hadoop

Challenges and Practical Considerations

Hadoop in Yahoo

By the Numbers

About 30,000 nodes in tens of clusters› 1 Node = 4 *1 TB disk, 8 cores, 16 GB RAM as a typical configuration.

Largest single cluster of about 4000 nodes

4 tiers of clusters› Application research and development

› Production clusters

› Hadoop platform development and testing

› Proof of concepts and ad-hoc work

Over 1000 users across research, engineering, operations etc.› Running more than 100,000 jobs per day

More than 3 PB of data › Compressed and un-replicated volume

Currently running Hadoop 0.20

Advantages

Wide applicability of the M/R computing model

› Many problems in internet domain can be solved by data parallelism

High throughput

› Stream through 100 TB of data in less than 1 hour

› Applications that took weeks earlier complete in hours

Research prototyping, development, and production deployment systems are (almost) identical

Scalable, economical, fault-tolerant

Shared resource with common infrastructure operations

Entities in internet eco-system

Content(pages, blogs etc.)

Search

Engine

Search

Advertising

Content/Display

Advertising

Queries Ads(Text, Display etc.)

Browses

Searches Interacts

Leverage Hadoop extensively in all of these domains in Yahoo!

Common Types of Applications

Common applications on Hadoop in Yahoo!

1. Near real-time data pipelines

› Backbone for analytics, reporting, research etc.

› Multi-step pipelines to create data feeds from logs

• Web-servers - page content, layout and links, clicks, queries etc.

• Ad servers – ad serving opportunity data, impressions

• Clicks, beacons, conversion data servers

› Process large volume of events

• Tens of billions events/day

• Tens of TB (compressed) data/day

› Latencies of tens of minutes to a few hours.

› Continuous runs of jobs working on chunks of data

Example: Data Pipelines

• Tens of billions events/day

• Parse and Transform event streams

• Join clicks with views

• Filter out robots

• Aggregate, Sort, Partition

• Data Quality Checks

Analytics

Ads and

Content

User Profiles

User Sessions

• Network analytics

• Experiment reporting

• Optimize traffic &engagement• User session & click-stream• Path and funnel analysis

• User segment analysis• Interest

• Measurements• Modeling and Scoring• Experimentation

2. High throughput engine for ETL and reporting applications

› Put large data sources (e.g. logs) on HDFS

› Run canned aggregations, transformations, normalizations

› Load reports to RDBMS/data marts

› Hourly and Daily batch jobs

3. Exploratory data research

› Ad-hoc analysis and insights into data

› Leveraging Pig and custom Map Reduce scripts

› Pig is based on Pig Latin (up-coming support for SQL)

• Procedural language, designed for data parallelism

• Supports nested relational data structures

4. Indexing for efficient retrieval

› Build and update indices of content, ads etc.

› Updated in batch mode and pushed for online serving

› Efficient retrieval of content and ads during serving

5. Offline modeling

› Supervised and un-supervised learning algorithms

› Outlier detection methods

› Association rule mining techniques

› Graph analysis methods

› Time series analysis etc.

6. Batch and near real-time scoring applications

› Offline model scoring for upload to serving applications

› Frequency: hourly or daily jobs

7. Near real-time feedback from serving systems

› Update features and model weights based on feedback from serving

› Periodically push these updates to online scoring and serving

› Typical updates in minutes or hours

8. Monitoring and performance dashboards

› Analyze scoring and serving logs for:

• Monitoring end to end performance of scoring and serving systems

• Measurements of model performance and measurements

Sample Applications

Application: Content Analysis

Web data

› Information about every web site, page, and link crawled by Yahoo

› Growing corpus of more than 100Tb+ data from 10’s of billions documents

Document processing pipeline on Hadoop

Enrich with features from page, site etc.

› Page segmentation

› Term document vector and weighted variants

Entity anlaysis

› Detection, disambiguation, resolution of entities in page

Concepts and topic modeling and clustering

Page quality analysis

Application: Web graph analysis

Directed graph of the web

Aggregated views by different dimensions

› Sites, Domains, Hosts etc.

Large scale analysis of this graph

› 2 trillion links

› Jobs utilize 100,000+ maps, ~10,000 reduces

› ~300 TB compressed output

Attribute Before Hadoop With Hadoop

Time 1 month Days

Maximum number of URLs

~ Order of 100 billion Many times larger

Application: Mail spam filtering

Scale of the problem

› ~ 25B Connections, 5B deliveries per day

› ~ 450M mailboxes

User feedback on spam is often late, noisy and not always actionable Problem Algorithm Data size Running time

on Hadoop

Detecting spam campaigns

Frequent Itemset mining

~ 20 MM spam votes

1 hour

“Gaming” of spam IP votes by spammers

Connected component (squaring a bi-partite graph)

~ 500K spammers, 500k spam IPs

1 hour

Application: Mail Spam Filtering Campaigns

9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)

9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)

9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)

9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)

9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)

9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,)

9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)

9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)

9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)

9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)

9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)

9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)

2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)

2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)

Application: Search Ranking

Rank web-pages based on relevance to queries

› Features based on content of page, site, queries, web graph etc.

› Train machine learning models to rank relevant pages for queries

› Periodically learn new models

Dimension Before Hadoop Using Hadoop

Features ~ 100’s ~ 1000’s

Running Time ~ Days to weeks ~ hours

Application: Search AssistTM

• Related concepts occur together. Analyze ~ 3 years of logs

• Build dictionaries on Hadoop and push to online serving

Dimension Before Hadoop Using Hadoop

Time 4 weeks < 30 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

Applications in Advertising

Expanding sets of seed keywords for matching with text ads

› Analyze text corpus, user query sessions, clustering keywords etc.

Indexing ads for fast retrieval

› Build and update index of more than a billion text ads

Response prediction and Relevance modeling

Categorization of pages and queries to help in matching

› Adult pages, gambling pages etc.

Forecasting of ad inventory

User modeling

Model performance dashboards

User Modeling on Hadoop

User activities

Large dimensionality of possible user activities

But a typical user has a sparse activity vector

Attributes of the events change over time

Building a pipeline on Hadoop to model user interests from activities

Attribute Possible Values Typical values per user

Pages ~ MM 10 – 100

Queries ~ 100s of MM Few

Ads ~ 100s of thousands 10s

User Modeling Pipeline

5 main components to train, score and evaluate models

1. Data Generation

a. Data Acquisition

b. Feature and Target Generation

2. Model Training

3. Offline Scoring and Evaluation

4. Batch scoring and upload to online serving

5. Dashboard to monitor the online performance

Model FilesUser event History files

Scores and Reports

Feature and Target Set

Manager

Online Serving Systems

Data Generation Modeling Engine Scoring and Evaluation

Filtering

Aggregations

Scoring

Score & graph based eval

Merging Projection

Transformations

Filtering

Model Training

Hadoop

Overview of User Modeling Pipeline

Models and

Scores

1a. Data Acquisition

› Multiple user event feeds (browsing activities, search etc.) per time period

User Time Event Source

U1 T0 visited autos.yahoo.com Web server logs

U1 T1 searched for “car insurance” Search logs

U1 T2 browsed stock quotes Web server logs

U1 T3 saw an ad for “discount brokerage”, but did not click

Ad logs

U1 T4 checked Yahoo Mail Web server logs

U1 T5 clicked on an ad for “auto insurance”

Ad logs, click server logs

Event Feeds

event Normalized

Events (NE)

Project relevant

event attributes

Filter irrelevant

events

Tag and Transform

• Categorization

• Topic

• ….

HDFSUser

Map Operations

Output:

› Single normalized feed containing all events for all users per time period

User Time Event Tag

U1 T0 Content browsing Autos, Mercedes Benz

U2 T2 Search query Category: Auto Insurance

… … ……. ………

... … ……. ………

UU2323 TT2323 Mail usageMail usage Drop eventDrop event

U36 T36 Ad click Category: Auto Insurance

1b. Feature and Target Generation

Features:› Summaries of user activities over a time window

› Aggregates, Moving averages, Rates etc. over moving time windows

› Support online updates to existing features

Targets:› Constructed in the offline model training phase

› Typically user actions in the future time period indicating interest

• Clicks/Click-through rates on ads and content

• Site and page visits

• Conversion events

– Purchases, Quote requests etc.

– Sign-ups to newsletters, Registrations etc.

1b. Feature and Target Windows

Query Visit Y! finance

Feature Window Target Window

Interest event

Moving Window

1b. Feature Generation

NE 1Feature

SetHDFSNE 4

NE 5 NE 6

NE 7 NE 8 NE 9

Aggregate

Normalized

events

U1, Event 1

U1, Event 2

Reduce 1 Reduce 2

All events for U1

U2, Event 2

U2, Event 3

U2, Event 1

All events for U2

U1 T0 Content browsing Autos, Mercedes Benz

U1 T2 Search query Category: Auto Insurance

U1 T3 Click on search result Category: Insurance premiums

U1 T4 Ad click Category: Auto Insurance

Summaries over

user event history

Aggregates within window

Time and event weighted averages

Event rates

……..

1b. Joining Features and Targets

Low target rates

› Typical response rates are in the range of 0.01% ~ 1%

Many users have no interest activities in the target window

First construct the targets

Compute the feature vector only for users with targets

› Reduces the need for computing features for users without target actions

Allows stratified sampling of users with different target and feature attributes

2. Model Training

Supervised models trained using a variety of techniques Regressions

› Different flavors: Linear, Logistic, Poisson etc.

› Constraints on weights

› Different regularizations: L1 and L2

Decision trees› Used for both regression and ranking problems

› Boosted trees

Naïve Bayes Support vector machines

› Commonly used in text classification, query categorization etc.

Online learning algorithms

2. Model Training

Maximum Entropy modeling

› Log-linear link function.

› Classification problems in large dimensional, sparse features

Constrained Random Fields

› Sequence labeling and named-entity recognition problems

Some of these algorithms are implemented in Mahout

Not all algorithms are easy to implement in MR framework

Train one model per node.

› Each node can train model for one target response

3. Offline Scoring and Evaluation

Apply weights from model training phase to features from Feature generation component

Mapper operations only

Janino* equation editor

› Embedded compiler can compile arbitrary scoring equations.

› Can also embed any class invoked during scoring

› Can modify features on the fly before scoring

Evaluation metrics

› Sort by scores and compute metrics in reducer

› Precision vs. Recall curve

› Lift charts

* http://docs.codehaus.org/display/JANINO/Home

Modeling Workflow

Target generation

Feature generation

Data Acquisition

history

Targets

Features

Model Training

Weights

Training

Target generation

Feature generation

Data Acquisition

history

Targets

Features

Evaluation

Model Scoring

Evaluation

Scores

4. Batch Scoring

Data Acquisition

history

Feature generation

Features

Online Serving

Systems

Model Scoring

Scores

Weights

User modeling pipeline system

Component Data Processed Time

Data Acquisition ~ 1 Tb per time period

2 – 3 hours

Feature and Target Generation

~ 1 Tb * Size of feature window

4 - 6 hours

Model Training ~ 50 - 100 Gb 1 – 2 hours for 100’s of models

Scoring ~ 500 Gb 1 hour

Challenges and Practical Considerations

Current challenges

Limited size of name-node

› File and block meta-data in HDFS is in RAM on name-node

› On name-node with 64Gb RAM

• ~ 100 million file blocks and 60 million files

› Upper limit of 4000 node limit cluster

› Adding more reducers leads to a large number of small files

Copying data in/out of HDFS

› Limited by read/write rates of external file systems

High latency for small jobs

› Overhead to set up may be large for small jobs

Practical considerations

Reduce amount of data transfer from mapper to reducer

› There is still disk write/read in going from mapper to reducer

• Mapper output = Reducer input files can become large

• Can run out of disk space for intermediate storage

› Project a subset of relevant attributes in mapper to send to reducer

› Use combiners

› Compress intermediate data

Distribution of keys

› Reducer can become a bottleneck for common keys

› Use Partitioner to control distribution of map records to reducers

• E.g. distribute mapper records with common keys across multiple reducers in a round robin manner

Practical considerations

Judicious partitioning of data

› Multiple files helps parallelism, but hit name-node limits

› Smaller number of files keeps name-node happy but at the expense of parallelism

Less ideal for distributed computing algorithms requiring communications (e.g. distributed decision trees)

› MPI on top of the cluster for communication

Acknowledgment

Numerous wonderful colleagues!

Questions?

Appendix:

More Applications

Application: Content Optimization

Optimizing content across the Yahoo portal pages

› Rank articles from an editorial pool of articles based on interest

• Yahoo Front Page,

• Yahoo News etc.

› Customizing feeds in My Yahoo portal page

› Top buzzing queries

› Content recommendations (RSS feeds)

Use Hadoop for feature aggregates and model weight updates

• near real-time and uploaded to online serving

Ads Optimization

Search Index

Machine Learned

Spam filters

RSS Feed Recos.

Content Optimization

Yahoo Front Page – Case Study

Application: Search Logs Analysis

Analyze search result view and click logs

› Reporting and measurement of user click response

› User session analysis

• Enrich, expand and re-write queries

• Spelling corrections

• Suggesting related queries

Traffic quality and protection

› Detect and filter out fraudulent traffic and clicks

Mail Spam Filtering: Connected ComponentsY1 = Yahoo user 1, Y2 = Yahoo user 2

IP1 = IP address of the host Y1 “voted” not-spam from

weight = 2SQUARING

Mail Spam Filtering: Connected Components Voting

y1 IP3

Set of “voted from” IPs

Set of “voted on” IPsSet of Yahoo IDs

voting notspam

Set of IPs/YIDs used exclusively for voting notspam

Set of (likely new) spamming IPs which are “worth” voting for

1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data...

Documents

Transcript of 1 Large Scale Applications on Hadoop in Yahoo Vijay K Narayanan, Yahoo! Labs 04.26.2010 Massive Data...

Chapter 22 ADS and MDAC, OLE DB, ADO, and Visual Basic

Altman, Haldeman & Narayanan - Zetatm Analysis

ASSIST by Mr. Sreeni Narayanan

Microprocessors - Meppayil Narayanan

Narayanan, 2005

Infographics Jayan Narayanan

Technical Support for Yahoo Mail, Yahoo Mail Support - PCCare247 Yahoo Support

Narayanan ramakrishnan assignment3_critique

Suresh Narayanan

eeic7.sogang.ac.kreeic7.sogang.ac.kr/paper file/domestic journal/[42... · 2008-04-02 · 80 -gee 130MS/s 108mW 1.8mm2 0.18um CMOS A/D 12Blg ADC'--I [181 01 MDAC ADC* c}. MDAC qe

Directory of Nurseries and Nursery Dealers - MDAC

January 2020 - MDAC

PowerShell Conference Gayathri Narayanan

Moving Right Along, Sirini Narayanan

MDAC - NASA · MDAC I SPACE STATION DATA SYSTEM ANALYSIS/ARCHITECTURE STUDY Task 5 - Program Plan ... Controlled System Test Library Computer Techno 1 og y As s oc 1 at e s Coefficient

Audiolab Mdac Hfc

1 Copyright 2010. All Rights Reserved. 10/13/2010 · Nov‐09 Dec‐09 Mar‐10 Jun‐10 1311.3.1 1321.3.2 1411.4.1 1441.4.4 MSIE - MDAC MSIE - MDAC MDAC MDAC MSIE - MS009-02 MSIE

Madhuram Narayanan Centre, Chennai

Narayanan 1997

Nived Narayanan CV Latest