Machine Learning for Web Data

67
Machine Learning for Web Data Hilary Mason Web Directions USA 2010

description

Presentation at Web Directions 2010, Atlanta, GA.

Transcript of Machine Learning for Web Data

Page 1: Machine Learning for Web Data

Machine Learning for Web Data

Hilary MasonWeb Directions USA 2010

Page 2: Machine Learning for Web Data
Page 3: Machine Learning for Web Data

= new capacities

(superpowers)

Machine learning is a way of thinking about data.

Page 4: Machine Learning for Web Data
Page 5: Machine Learning for Web Data

http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/?from=list&offset=0

http://bit.ly/9N7VB1

Page 6: Machine Learning for Web Data

6

Page 7: Machine Learning for Web Data

wicked hard problem

10s of millions of URLs /day

100s of millions of events / day

1000s of millions of data points

Page 8: Machine Learning for Web Data
Page 9: Machine Learning for Web Data

?

Page 10: Machine Learning for Web Data
Page 11: Machine Learning for Web Data

@hmason

Page 12: Machine Learning for Web Data

[archive photo]

Page 13: Machine Learning for Web Data

ELIZA

Page 14: Machine Learning for Web Data
Page 15: Machine Learning for Web Data
Page 16: Machine Learning for Web Data
Page 17: Machine Learning for Web Data

ML Today

Page 18: Machine Learning for Web Data

Algorithms +

On-demand computing +

Ubiquitous data

Page 19: Machine Learning for Web Data

Algorithms

New frames for modeling the world with data.

Page 20: Machine Learning for Web Data
Page 21: Machine Learning for Web Data

[moar data and new kinds of data]

Page 22: Machine Learning for Web Data

Examples

Page 23: Machine Learning for Web Data

[spam filters]

Page 24: Machine Learning for Web Data

[netflix movie recommendations]

Page 25: Machine Learning for Web Data

Language Identification

Page 26: Machine Learning for Web Data

Face Identification

Page 27: Machine Learning for Web Data

Machine Learning

Page 28: Machine Learning for Web Data

Supervised Learning

Vs

Unsupervised Learning

Page 29: Machine Learning for Web Data

Clustering

immunity

ultrasound

medical imaging

medical devices

thermoelectric devices

fault-tolerant circuits

low power devices

Page 30: Machine Learning for Web Data

Entity disambiguation

This is important.

Page 31: Machine Learning for Web Data

MEUGLY HAG

Page 32: Machine Learning for Web Data

Entity disambiguation

This is important.

Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?

Page 33: Machine Learning for Web Data

Classification

Page 34: Machine Learning for Web Data

classification

Text Feature Extractor

TrainedClassifier

Cats

Dogs

Fire

Training Data

Feature Extractor

Page 35: Machine Learning for Web Data

<math>

Page 36: Machine Learning for Web Data

Probability

P(A) is the probability that A is true.

Page 37: Machine Learning for Web Data

Axioms of Probability

0 ≤ P(A) ≤ 1

P(True) = 1

P(False) = 0

P(A or B) = P(A) + P(B) – P(A and B)

Page 38: Machine Learning for Web Data

P(A or B) = P(A) + P(B) – P(A and B)

P(A)

P(B)

P(A and B)

Page 39: Machine Learning for Web Data

Bayes Law

Page 40: Machine Learning for Web Data

ExampleThere are10,000 people.

1% have a rare disease.

Page 41: Machine Learning for Web Data

Example

• Population of 10,000• 1% have rare disease• There’s a test that is 99% effective.– 99% of sick patients test positive– 99% of healthy patients test negative

Page 42: Machine Learning for Web Data

Given a positive test result, what is the probability that the patient is sick?

Page 43: Machine Learning for Web Data
Page 44: Machine Learning for Web Data

Disease Diagnosis

99 sick patients test positive, 99 healthy patients test positive

Given a positive test, there is a 50% probability that the patient is sick.

Page 45: Machine Learning for Web Data

Bayesian Disease

Know the prob. of testing sick given healthy, and healthy given sick

Use Bayes theorem to invert probabilities

Page 46: Machine Learning for Web Data

</math>

Page 47: Machine Learning for Web Data

Obtain

Scrub

Explore

Model

iNterpret

Page 48: Machine Learning for Web Data
Page 49: Machine Learning for Web Data

1. Obtain Data

“pointing and clicking does not scale!”

http://www.delicious.com/pskomoroch/dataset

Page 50: Machine Learning for Web Data

lynx –dump http://www.nytimes.com

Lynx: http://bit.ly/a6Pumm

2. Scrub

Page 51: Machine Learning for Web Data
Page 52: Machine Learning for Web Data

3. Explore

http://vis.stanford.edu/protovis/

Page 53: Machine Learning for Web Data

4. Model

Google Prediction APIhttp://code.google.com/apis/predict/

Page 54: Machine Learning for Web Data

4. Model

Python

• NLTK - http://www.nltk.org/• Scikits Learn -

http://scikit-learn.sourceforge.net/

Page 55: Machine Learning for Web Data

4. Model

http://www.alchemyapi.com/

Page 56: Machine Learning for Web Data

5. Interpret

Andrew Vande Moore – Visual Poetry 06

Page 57: Machine Learning for Web Data

http://www.dataists.com

Page 58: Machine Learning for Web Data

One Final Example

Twitter is full of noise.

Sports – downMath – UP!Narcissism - down

Page 59: Machine Learning for Web Data

Code!

Page 60: Machine Learning for Web Data

Filtering & Relevance Ordering

http://github.com/hmason/tc

Page 61: Machine Learning for Web Data

What’s next?

Page 62: Machine Learning for Web Data

Soon:

Natural Language Generation

Rich media classification

Contextual everything

Page 63: Machine Learning for Web Data

Algorithms-As-A-Service

Page 64: Machine Learning for Web Data

infer links in data

Page 65: Machine Learning for Web Data

Filtering

Page 66: Machine Learning for Web Data

Relevance

Page 67: Machine Learning for Web Data

[email protected]@hmason

Thank you!