Machine Learning for Web Data

Post on 27-Jan-2015

119 views 10 download

Tags:

description

Presentation at Web Directions 2010, Atlanta, GA.

Transcript of Machine Learning for Web Data

Machine Learning for Web Data

Hilary MasonWeb Directions USA 2010

= new capacities

(superpowers)

Machine learning is a way of thinking about data.

http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/?from=list&offset=0

http://bit.ly/9N7VB1

6

wicked hard problem

10s of millions of URLs /day

100s of millions of events / day

1000s of millions of data points

?

@hmason

[archive photo]

ELIZA

ML Today

Algorithms +

On-demand computing +

Ubiquitous data

Algorithms

New frames for modeling the world with data.

[moar data and new kinds of data]

Examples

[spam filters]

[netflix movie recommendations]

Language Identification

Face Identification

Machine Learning

Supervised Learning

Vs

Unsupervised Learning

Clustering

immunity

ultrasound

medical imaging

medical devices

thermoelectric devices

fault-tolerant circuits

low power devices

Entity disambiguation

This is important.

MEUGLY HAG

Entity disambiguation

This is important.

Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?

Classification

classification

Text Feature Extractor

TrainedClassifier

Cats

Dogs

Fire

Training Data

Feature Extractor

<math>

Probability

P(A) is the probability that A is true.

Axioms of Probability

0 ≤ P(A) ≤ 1

P(True) = 1

P(False) = 0

P(A or B) = P(A) + P(B) – P(A and B)

P(A or B) = P(A) + P(B) – P(A and B)

P(A)

P(B)

P(A and B)

Bayes Law

ExampleThere are10,000 people.

1% have a rare disease.

Example

• Population of 10,000• 1% have rare disease• There’s a test that is 99% effective.– 99% of sick patients test positive– 99% of healthy patients test negative

Given a positive test result, what is the probability that the patient is sick?

Disease Diagnosis

99 sick patients test positive, 99 healthy patients test positive

Given a positive test, there is a 50% probability that the patient is sick.

Bayesian Disease

Know the prob. of testing sick given healthy, and healthy given sick

Use Bayes theorem to invert probabilities

</math>

Obtain

Scrub

Explore

Model

iNterpret

1. Obtain Data

“pointing and clicking does not scale!”

http://www.delicious.com/pskomoroch/dataset

lynx –dump http://www.nytimes.com

Lynx: http://bit.ly/a6Pumm

2. Scrub

3. Explore

http://vis.stanford.edu/protovis/

4. Model

Google Prediction APIhttp://code.google.com/apis/predict/

4. Model

Python

• NLTK - http://www.nltk.org/• Scikits Learn -

http://scikit-learn.sourceforge.net/

4. Model

http://www.alchemyapi.com/

5. Interpret

Andrew Vande Moore – Visual Poetry 06

http://www.dataists.com

One Final Example

Twitter is full of noise.

Sports – downMath – UP!Narcissism - down

Code!

Filtering & Relevance Ordering

http://github.com/hmason/tc

What’s next?

Soon:

Natural Language Generation

Rich media classification

Contextual everything

Algorithms-As-A-Service

infer links in data

Filtering

Relevance

h@bit.ly@hmason

Thank you!