devoxx
-
Upload
narasimha-swamy -
Category
Documents
-
view
7 -
download
0
description
Transcript of devoxx
![Page 1: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/1.jpg)
Apache MahoutMaking data analysis easy
![Page 2: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/2.jpg)
Isabel Drost
Nighttime:
Co-Founder, committer Apache Mahout. Organiser of Berlin Hadoop Get Together.
Daytime:Software developer.
Guest lecturer at TU Berlin.Co-Organiser Berlin Buzzwords 2010.
![Page 3: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/3.jpg)
● “Mastering Data-Intensive Collaboration and Decision Making”
● EU funded research project– Number of partners: 8– Coordinator: Research Academic Computer Technology
Institute (CTI), Greece
![Page 4: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/4.jpg)
Hello Devoxx!
![Page 5: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/5.jpg)
Hello Devoxx!
![Page 6: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/6.jpg)
Hello Devoxx!
![Page 7: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/7.jpg)
Hello Devoxx!
![Page 8: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/8.jpg)
Hello Devoxx!
![Page 9: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/9.jpg)
Hello Devoxx!
Machine learningbackground?
![Page 10: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/10.jpg)
Hello Devoxx!
![Page 11: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/11.jpg)
Agenda
● Data Mining/ Machine Learning?
● Why is scaling hard?
● Going beyond simple statistics.
![Page 12: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/12.jpg)
Data Mining Applications
● Marketing.● Surveillance.● Fraud Detection.● Scientific Discovery.● Discover items usually purchased together.
= Extracting patterns from data.
![Page 13: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/13.jpg)
Machine Learning Applications
● E-Mail spam classification.● News-topic discovery.● Building recommender systems.
= Extracting prediction models from data.
![Page 14: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/14.jpg)
Machine learning – what's that?
![Page 15: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/15.jpg)
Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett.
Bradbury, Evans & Co, London, 1850sArchimedes taking a Warm Bath
![Page 16: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/16.jpg)
Archimedes model of nature
![Page 17: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/17.jpg)
June 25, 2008 by chase-mehttp://www.flickr.com/photos/sasy/2609508999
![Page 18: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/18.jpg)
An SVM's model of nature
![Page 19: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/19.jpg)
The challenge
![Page 20: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/20.jpg)
Mission
Provide scalable data mining algorithms.
![Page 21: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/21.jpg)
http://www.flickr.com/photos/honou/2936937247/
![Page 22: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/22.jpg)
HowTo: From data to information.
![Page 23: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/23.jpg)
January 3, 2006 by Matt Callowhttp://www.flickr.com/photos/blackcustard/81680010
![Page 24: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/24.jpg)
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
![Page 25: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/25.jpg)
http://www.flickr.com/photos/disowned/1158260369/
The HDFS filesystem is not restricted to MapReduce jobs. It can be used for other applications, many of which are under way at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and matrix operations.
![Page 26: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/26.jpg)
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/ http://www.flickr.com/photos/redux/409356158/in/photostream/
http://www.flickr.com/photos/noodlepie/2675987121/http://www.flickr.com/photos/topsy/204929063/
![Page 27: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/27.jpg)
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
![Page 28: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/28.jpg)
From data to information.From data to information.
● Collect data and define your learning problem.
● Data preparation.
● Training a prediction model.
● Checking the performance of your model.
![Page 29: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/29.jpg)
![Page 30: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/30.jpg)
● Remove noise.
![Page 31: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/31.jpg)
● Remove noise.● Convert text to vectors.
![Page 32: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/32.jpg)
From texts to vectors
![Page 33: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/33.jpg)
Sunny weather
High performance computing
If we looked at two words only:
![Page 34: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/34.jpg)
Aaron
Zuse
![Page 35: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/35.jpg)
Binary bag of words
● Imagine a n-dimensional space.● Each dimension = one possible word in texts.● Entry in vector is one, if word occurs in text.
● Problem:● Number of word occurrences not accounted for.
bi , j={1∀ x i∈d j0else }
![Page 36: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/36.jpg)
Term Frequency
● Imagine a n-dimensional space.● Each dimension = one possible word in texts.● Entry in vector equal to the words frequency.
● Problem:● Common words dominate vectors.
bi , j=ni , j
![Page 37: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/37.jpg)
TF with stop wording
● Imagine a n-dimensional space.● Each dimension = one possible word in texts.● Filter stopwords.● Entry in vector equal to the words frequency.
● Problem:● Common and uncommon words with same weight.
bi , j=ni , j
![Page 38: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/38.jpg)
TF- IDF
● Imagine a n-dimensional space.● Each dimension = one possible word in texts.● Filter stopwords.● Entry in vector equal to the weighted frequency.
● Problem:● Long texts get larger values.
bi , j=ni , j×log ∣D∣
∣{d : ti∈d }∣
![Page 39: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/39.jpg)
Normalized TF- IDF
● Imagine a n-dimensional space.● Each dimension = one possible word in texts.● Filter stopwords.● Entry in vector equal to the weighted frequency.● Normalize vectors.
● Problem:● Additional domain knowledge ignored.
bi , j=ni , j
∑knk , j
× log ∣D∣
∣{d : ti∈d }∣
![Page 40: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/40.jpg)
Reality
● There are a few more words in news.● Use all relevant features/ signals available.
● Words.● Header fields.● Characteristics of publishing url.● …
● Usually pipeline of feature extractors.
![Page 41: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/41.jpg)
From data to information.
● Collect data and define your learning problem.
● Data preparation.
● Training a prediction model.
● Checking the performance of your model.
![Page 42: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/42.jpg)
Step 2: Similarity
![Page 43: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/43.jpg)
Euclidian
![Page 44: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/44.jpg)
Euclidian
![Page 45: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/45.jpg)
Euclidian
Cosine
![Page 46: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/46.jpg)
Step 3: Clustering
![Page 47: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/47.jpg)
![Page 48: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/48.jpg)
![Page 49: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/49.jpg)
![Page 50: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/50.jpg)
![Page 51: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/51.jpg)
Until stable.
![Page 52: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/52.jpg)
Reality
● Seed selection.
● Choice of initial k.
● Continuous updates.
● Regular addition of clusters.
![Page 53: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/53.jpg)
From data to information.
● Collect data and define your learning problem.
● Data preparation.
● Training a prediction model.
● Checking the performance of your model.
![Page 54: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/54.jpg)
Evaluation
● Compare against gold standard.
● Use quality measures.
● Manual inspection.
![Page 55: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/55.jpg)
From data to information.
● Collect data and define your learning problem.
● Data preparation.
● Training a prediction model.
● Checking the performance of your model.
![Page 56: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/56.jpg)
http://www.flickr.com/photos/generated/943078008/
![Page 57: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/57.jpg)
![Page 58: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/58.jpg)
What else does Mahout have to offer.
![Page 59: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/59.jpg)
Identify dominant topics
● Given a dataset of texts, identify main topics.
● Examples:● Dominant topics in set of mails.● Identify news message categories.
Algorithms: Parallel LDA
![Page 60: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/60.jpg)
Assign items to defined categories.
● Given pre-defined categories, assign items to it.
![Page 61: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/61.jpg)
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
![Page 62: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/62.jpg)
![Page 63: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/63.jpg)
![Page 64: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/64.jpg)
Recommendation mining.
● Collaborative filtering.
![Page 65: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/65.jpg)
Show most relevant ads
![Page 66: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/66.jpg)
Show most relevant ads
![Page 67: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/67.jpg)
http://www.flickr.com/photos/alainpicard/4175214747
http://www.flickr.com/photos/25831000@N08/4156701164
http://www.flickr.com/photos/jfclere/4061801735
http://www.flickr.com/photos/claudio_ar/2643165035/
http://www.flickr.com/photos/claudio_ar/2643180457
Thanks to Falko Menge for the pictures of Brussels.
http://www.flickr.com/photos/joachim_s_mueller/2417313476/
http://www.flickr.com/photos/sebastian_bergmann/1244514498
http://www.flickr.com/photos/philfotos/4510197138/
Recommending places
![Page 68: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/68.jpg)
Recommending people
![Page 69: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/69.jpg)
Recommendation mining.
● Online collaborative filtering on single machine.● Offline Map/Reduce based version.● Content similarity can be integrated.
● Based on former Taste project.
![Page 70: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/70.jpg)
Frequent pattern mining
● Given groups of items, find commonly co-occurring items.
● Examples:● In shopping carts find items bought together.● In query logs find queries issued in one session.
![Page 71: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/71.jpg)
By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/
By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/
![Page 72: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/72.jpg)
By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/
By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/
By quinnanya, http://www.flickr.com/photos/quinnanya/2806883231/
![Page 73: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/73.jpg)
March 14, 2009 by Artful Magpiehttp://www.flickr.com/photos/kmtucker/3355551036/
Requirements to get started
![Page 74: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/74.jpg)
![Page 75: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/75.jpg)
![Page 76: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/76.jpg)
![Page 77: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/77.jpg)
![Page 78: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/78.jpg)
Why go for Apache Mahout?
![Page 79: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/79.jpg)
Jumpstart your project with proven code.
January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559
![Page 80: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/80.jpg)
Discuss ideas and problems online.
November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296
![Page 81: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/81.jpg)
![Page 82: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/82.jpg)
Become a committer.
![Page 83: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/83.jpg)
Become a committer:Of Apache Mahout
Sebastian SchelterJake Mannix
Benson MarguliesRobin AnilDavid Hall
AbdelHakim DenecheKarl Wettin
Sean OwenGrant Ingersoll
Otis GospodneticDrew Farris
Jeff EastmanTed DunningIsabel Drost
Emeritus:
Niranjan BalasubramanianErik Hatcher
Ozgur YilmazelDawid Weiss
![Page 84: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/84.jpg)
Interest in solving hard problems.
Being part of lively community.
Engineering best practices.
Bug reports, patches, features.
Documentation, code, examples.Image by: Patrick McEvoy
![Page 85: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/85.jpg)
![Page 86: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/86.jpg)
Thanks to Tim Lossen et. al for taking amazing pictures of the conf.
![Page 87: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/87.jpg)
Berlin Buzzwords 2011
Search/ Store/ Scale
May/ June 2011
Thanks to Tim Lossen et. al for taking amazing pictures of the conf.
![Page 88: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/88.jpg)
Interest in solving hard problems.
Being part of lively community.
Engineering best practices.
Bug reports, patches, features.
Documentation, code, examples.Image by: Patrick McEvoy
![Page 89: devoxx](https://reader030.fdocuments.in/reader030/viewer/2022032515/563db9e5550346aa9aa0e3c2/html5/thumbnails/89.jpg)