Machine Learning for Q&A Sites: The Quora Example

47
Machine Learning for Q&A Sites: The Quora Example Xavier Amatriain (@xamat) 04/11/2016

Transcript of Machine Learning for Q&A Sites: The Quora Example

Machine Learning for Q&A Sites:The Quora Example

Xavier Amatriain (@xamat)

04/11/2016

“To share and grow the world’s

knowledge”

• Millions of questions & answers

• Millions of users

• Thousands of topics

• ...

DemandQuality

Relevance

Data

Machine LearningApplications for Q&A

Sites

Answer Ranking

Goal

• Given a question and n

answers, come up with the

ideal ranking of those n

answers

What is a good Quora answer?

• truthful

• reusable

• provides explanation

• well formatted

• ...

How are those dimensions translated

into features?

• Features that relate to the text

quality itself

• Interaction features

(upvotes/downvotes, clicks,

comments…)

• User features (e.g. expertise in topic)

Feed Ranking

• Goal: Present most interesting stories for

a user at a given time• Interesting = topical relevance +

social relevance + timeliness

• Stories = questions + answers

• ML: Personalized learning-to-rank approach

• Relevance-ordered vs time-ordered = big

gains in engagement

• Challenges:

• potentially many candidate stories

• real-time ranking

• optimize for relevance

Feed dataset: impression logs

click

upvote

downvote

expand

share

click

answer pass

downvote

follow

● Value of showing a story to a user, e.g. weighted sum of actions:

v = ∑a va 1{ya = 1}

● Goal: predict this value for new stories. 2 possible approaches:○ predict value directly

v_pred = f(x)

■ pros: single regression model

■ cons: can be ambiguous, coupled

○ predict probabilities for each action, then compute expected value:

v_pred = E[ V | x ] = ∑a va p(a | x)

■ pros: better use of supervised signal, decouples action models from action values

■ cons: more costly, one classifier per action

What is relevance?

● Essential for getting good rankings

● Better if updated in real-time (more reactive)

● Main sets of features:○ user (e.g. age, country, recent activity)

○ story (e.g. popularity, trendiness, quality)

○ interactions between the two (e.g. topic or author affinity)

Feature engineering

● Linear

○ simple, fast to train

○ manual, non-linear transforms for richer

representation (buckets, ngrams)

● Decision trees

○ learn non-linear representations

● Tree ensembles

○ Random forests

○ Gradient boosted decision trees

● In-house C++ training code, third-party

libraries for prototyping new models

Models

Ask2Answer

● Given a question and a viewer rank all

other users based on how “well-suited”

they are.

○ “Well-suited” = likelihood of viewer sending a

request + likelihood of the candidate adding a

good answer.

● A2A = extension of CTR-prediction

○ Not only care about the viewer’s probability of

sending a request, but also the recipient’s

probability of writing a good answer

A2A

● Example labels:

○ Binary label: 0 if no request was sent or no

answer was added and 1 if a request was sent

and yielded an answer with a goodness score

above some threshold.

○ Continuous label:

w1⋅had_request+w2⋅had_answer+w3⋅answer_

goodness+⋯w1⋅had_request+w2⋅had_answer+

w3⋅answer_goodness+⋯

A2A

● Features

○ Based on what the viewer or candidate has

done in the past.

○ Historical features that encapsulate the

relationship of the viewer to the candidate.

○ In addition to historical features, other features

can be devised (e.g. a binary feature saying

whether the viewer follows the candidate)

● Many more features are possible.

Feature engineering is a crucial

component of any ML system.

A2A

Topics & Users Recommendations

Goal: Recommend new topics for the

user to follow

● Based on

○ Other topics followed

○ Users followed

○ User interactions

○ Topic-related features

○ ...

Goal: Recommend new users to follow

● Based on:

○ Other users followed

○ Topics followed

○ User interactions

○ User-related features

○ ...

Related Questions

● Given interest in question A (source) what other

questions will be interesting?

● Not only about similarity, but also “interestingness”

● Features such as:

○ Textual

○ Co-visit

○ Topics

○ …

● Important for logged-out use case

Duplicate Questions

● Important issue for Q&A Sites

○ Want to make sure we don’t disperse

knowledge to the same question

● Solution: binary classifier trained with

labelled data

● Features

○ Textual vector space models

○ Usage-based features

○ ...

User Trust

Goal: Infer user’s trustworthiness in relation

to a given topic

● We take into account:

○ Answers written on topic

○ Upvotes/downvotes received

○ Endorsements

○ ...

● Trust/expertise propagates through the network

● Must be taken into account by other algorithms

Trending Topics

Goal: Highlight current events that are interesting

for the user

● We take into account:

○ Global “Trendiness”

○ Social “Trendiness”

○ User’s interest

○ ...

● Trending topics are a great discovery mechanism

Moderation

● Very important for Quora to keep quality of content

● Pure manual approaches do not scale

● Hard to get algorithms 100% right

● ML algorithms detect content/user issues

○ Output of the algorithms feed manually

curated moderation queues

Content CreationPrediction

● Quora’s algorithms not only optimize for

probability of reading

● Important to predict probability of a user

answering a question

● Parts of our system completely rely on

that prediction

○ E.g. A2A (ask to answer) suggestions

Models

● Logistic Regression

● Elastic Nets

● Gradient Boosted Decision

Trees

● Random Forests

● (Deep) Neural Networks

● LambdaMART

● Matrix Factorization

● LDA

● ...

Experimentation

⚫ Extensive A/B testing, data-driven decision-

making

⚫ Separate, orthogonal “layers” for different parts of

the system

⚫ Experiment framework showing comparisons for

various metrics

Conclusions

• Q&A sites have not only Big, but also “rich” data

• Algorithms need to understand and optimize complex

aspects such as quality, interestingness, or user

expertise

• ML is one of the keys to success

• Many interesting problems, and many unsolved

challenges

Questions?