Post on 27-Jan-2015
description
Machine Learning for Web Data
Hilary MasonWeb Directions USA 2010
= new capacities
(superpowers)
Machine learning is a way of thinking about data.
http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/?from=list&offset=0
http://bit.ly/9N7VB1
6
wicked hard problem
10s of millions of URLs /day
100s of millions of events / day
1000s of millions of data points
?
@hmason
[archive photo]
ELIZA
ML Today
Algorithms +
On-demand computing +
Ubiquitous data
Algorithms
New frames for modeling the world with data.
[moar data and new kinds of data]
Examples
[spam filters]
[netflix movie recommendations]
Language Identification
Face Identification
Machine Learning
Supervised Learning
Vs
Unsupervised Learning
Clustering
immunity
ultrasound
medical imaging
medical devices
thermoelectric devices
fault-tolerant circuits
low power devices
Entity disambiguation
This is important.
MEUGLY HAG
Entity disambiguation
This is important.
Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
Classification
classification
Text Feature Extractor
TrainedClassifier
Cats
Dogs
Fire
Training Data
Feature Extractor
<math>
Probability
P(A) is the probability that A is true.
Axioms of Probability
0 ≤ P(A) ≤ 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) – P(A and B)
P(A or B) = P(A) + P(B) – P(A and B)
P(A)
P(B)
P(A and B)
Bayes Law
ExampleThere are10,000 people.
1% have a rare disease.
Example
• Population of 10,000• 1% have rare disease• There’s a test that is 99% effective.– 99% of sick patients test positive– 99% of healthy patients test negative
Given a positive test result, what is the probability that the patient is sick?
Disease Diagnosis
99 sick patients test positive, 99 healthy patients test positive
Given a positive test, there is a 50% probability that the patient is sick.
Bayesian Disease
Know the prob. of testing sick given healthy, and healthy given sick
Use Bayes theorem to invert probabilities
</math>
Obtain
Scrub
Explore
Model
iNterpret
1. Obtain Data
“pointing and clicking does not scale!”
http://www.delicious.com/pskomoroch/dataset
lynx –dump http://www.nytimes.com
Lynx: http://bit.ly/a6Pumm
2. Scrub
4. Model
Google Prediction APIhttp://code.google.com/apis/predict/
4. Model
Python
• NLTK - http://www.nltk.org/• Scikits Learn -
http://scikit-learn.sourceforge.net/
5. Interpret
Andrew Vande Moore – Visual Poetry 06
http://www.dataists.com
One Final Example
Twitter is full of noise.
Sports – downMath – UP!Narcissism - down
Code!
What’s next?
Soon:
Natural Language Generation
Rich media classification
Contextual everything
Algorithms-As-A-Service
infer links in data
Filtering
Relevance
h@bit.ly@hmason
Thank you!