Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1...
Transcript of Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1...
![Page 1: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/1.jpg)
6.006IntroductiontoAlgorithms
Lecture1:DocumentDistanceProf.ErikDemaine
![Page 2: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/2.jpg)
YourProfessors
Prof.ErikDemaine Prof.Piotr Indyk Prof.Manolis Kellis
![Page 3: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/3.jpg)
YourTAs
KevinKelley JosephLaurendi Tianren Qi
NicholasZehenderDavidWen
![Page 4: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/4.jpg)
YourTextbook
![Page 5: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/5.jpg)
Administrivia• Handout: Course information• Webpage:http://courses.csail.mit.edu/6.006/spring11/• Signupforrecitationifyoudidn’tfilloutformalready• Sign up for problemsetserver: https://alg.csail.mit.edu/• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/
• Prereqs: 6.01(Python), 6.042(discretemath)• Grades: Problem sets (30%)
Quiz1 (20%;[email protected]–9.30pm)Quiz2 (20%;[email protected]–9.30pm)Final (30%)
• Lectures&Recitations;Homeworklabs;Quizreviews• Read collaboration policy!
![Page 6: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/6.jpg)
Today• Classoverview
– What’sa(good)algorithm?– Topics
• DocumentDistance– Vectorspacemodel– Algorithms– Pythonprofiling&gotchas
![Page 7: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/7.jpg)
What’sanAlgorithm?• Mathematicalabstractionofcomputerprogram
• Well‐specifiedmethodforsolvingacomputationalproblem– Typically,afinitesequenceofoperations
• DescriptionmightbestructuredEnglish,pseudocode,orrealcode
• Key: no ambiguityhttp://en.wikipedia.org/wiki/File:Euclid_flowchart_1.png
![Page 8: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/8.jpg)
al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”
http://en.wikipedia.org/wiki/File:Abu_Abdullah_Muhammad_bin_Musa_al‐Khwarizmi_edit.png
http://en.wikipedia.org/wiki/Al‐Khwarizmi
![Page 9: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/9.jpg)
al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”• Fatherofalgebra
– TheCompendiousBookonCalculationbyCompletionandBalancing(c.830)
– Linear&quadraticequations:someofthefirstalgorithms
http://en.wikipedia.org/wiki/File:Image‐Al‐Kit%C4%81b_al‐mu%E1%B8%ABta%E1%B9%A3ar_f%C4%AB_%E1%B8%A5is%C4%81b_al‐%C4%9Fabr_wa‐l‐muq%C4%81bala.jpg
http://en.wikipedia.org/wiki/Al‐Khwarizmi
![Page 10: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/10.jpg)
EfficientAlgorithms• Wantanalgorithmthat’s
– Correct– Fast– Smallspace– General– Simple– Clever
![Page 11: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/11.jpg)
EfficientAlgorithms• Mainlyinterestedinscalabilityasproblemsizegrows
![Page 12: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/12.jpg)
WhyEfficient Algorithms?• Savewaittime,storageneeds,energyconsumption/cost,…
• Scalability=win– Solvebiggerproblemsgivenfixedresources(CPU,memory,disk,etc.)
• Optimizetraveltime,scheduleconflicts,…
![Page 13: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/13.jpg)
HowtoDesignanEfficient Algorithm?
1. Definecomputational problem2. Abstract irrelevant detail3. Reducetoaproblemyoulearnhere
(or6.046oralgorithmicliterature)4. Elsedesignusing“algorithmictoolbox”5. Analyzealgorithm’sscalability6. Implement & evaluate performance7. Repeat(optimize,generalize)
![Page 14: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/14.jpg)
Modules&Applications1. Introduction Document similarity2. BinarySearchTrees Scheduling3. Hashing Filesynchronization4. Sorting Spreadsheets5. GraphSearch Rubik’s Cube6. Shortest Paths Google Maps7. Dynamic Programming Justifyingtext,packing,…8. NumbersPictures(NP) Computingπ,collision
detection,hardproblem9. Beyond Folding,streaming,bio
![Page 15: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/15.jpg)
DocumentDistance
• Giventwodocuments,howsimilararethey?
• Applications:– Findsimilardocuments– Detectplagiarism/duplicates
– Websearch(one“document”isquery)
http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/
http://www.google.com/
![Page 16: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/16.jpg)
DocumentDistance
• Howtodefine“document”?
• Word =sequenceofalphanumericcharacters
• Document=sequenceofwords– Ignorepunctuation&formatting
![Page 17: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/17.jpg)
DocumentDistance
• Howtodefine“distance”?
• Idea: focusonsharedwords
• Wordfrequencies:– =#occurrencesofwordindocument
![Page 18: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/18.jpg)
VectorSpaceModel[Salton, Wong, Yang 1975]
• Treat each document as a vector of its words– Onecoordinate foreverypossibleword
• Example:– =“thecat”– =“thedog”
• Similaritybetweenvectors?– Dotproduct:
http://portal.acm.org/citation.cfm?id=361220
‘the’
‘cat’
‘dog’
11
1
![Page 19: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/19.jpg)
VectorSpaceModel[Salton, Wong, Yang 1975]
• Problem: Dotproductnotscaleinvariant• Example1:
– =“thecat”– =“thedog”–
• Example2:– =“thecatthecat”– =“thedogthedog”–
‘the’
‘cat’
‘dog’
2
2
2
1
1 10
http://portal.acm.org/citation.cfm?id=361220
![Page 20: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/20.jpg)
VectorSpaceModel[Salton, Wong, Yang 1975]
• Idea: Normalizeby#words:
• Geometricsolution:anglebetweenvectors
– 0=“identical”, ∘ =orthogonal(nosharedwords)
‘the’
‘cat’
‘dog’
11
1
http://portal.acm.org/citation.cfm?id=361220
![Page 21: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/21.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product
![Page 22: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/22.jpg)
Algorithm1. Read documents2. Split eachdocument into words
– re.findall(‘\w+’, doc)
– Buthowdoesthisactuallywork?3. Count wordfrequencies(documentvectors)4. Compute dot product
![Page 23: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/23.jpg)
Algorithm1. Read documents2. Split eachdocument into words
– Foreachlineindocument:Foreachcharacterinline:
Ifnotalphanumeric:Addpreviousword
(ifany)tolistStartnewword
3. Count wordfrequencies(documentvectors)4. Compute dot product
![Page 24: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/24.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Sortthewordlistb. Foreachwordinwordlist:
– Ifsameaslastword:Incrementcounter
– Else:AddlastwordanditscountertolistResetcounterto0
4. Compute dot product
![Page 25: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/25.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverypossibleword:LookupfrequencyineachdocumentMultiplyAddtototal
![Page 26: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/26.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal
![Page 27: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/27.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:a. Startatfirstwordofeachdocument(insortedorder)b. Ifwordsareequal:
MultiplywordfrequenciesAddtototal
c. Inwhicheverdocumenthaslexicallylesserword,advancetonextword
d. Repeatuntileitherdocumentoutofwords
![Page 28: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/28.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Initializeadictionarymappingwordstocountsb. Foreachwordinwordlist:
– Ifindictionary:Incrementcounter
– Else:Put0indictionary
4. Compute dot product
![Page 29: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/29.jpg)
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal
![Page 30: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/30.jpg)
PythonImplementations
![Page 31: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/31.jpg)
PythonProfiling
![Page 32: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/32.jpg)
Culprit
![Page 33: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/33.jpg)
Fix
![Page 34: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/34.jpg)
PythonImplementationsdocdist1 initialversiondocdist2 addprofiling 192.5 secdocdist3 replace+ withextend 126.5secdocdist4 countfrequenciesusingdictionary 73.4 secdocdist5 splitwordswithstring.translate 18.1secdocdist6 changeinsertion sorttomergesort 11.5secdocdist7 nosorting, dotproductwithdictionary 1.8secdocdist8 split wordsonwholedocument,
notlinebyline0.2sec
ExperimentsonIntelPentium4,2.8GHz,Python2.6.2,Linux2.6.18.Document1(t2.bobsey.txt)has268,778lines,49,785words,3,354distincts.Document2(t3.lewis.txt)has1,031,470lines,182,355words,8,530distincts.
![Page 35: Prof. Erik Demaine - courses.csail.mit.edu · 2011-02-03 · Python Implementations docdist1 initial version docdist2 add profiling 192.5sec docdist3 replace +with extend 126.5 sec](https://reader033.fdocuments.in/reader033/viewer/2022042021/5e7821af83bf4e1c77326c8c/html5/thumbnails/35.jpg)
Don’tForget!• Webpage:http://courses.csail.mit.edu/6.006/spring11/
• Signupforrecitationifyoudidn’talreadyreceivearecitationassignmentfromus
• Sign up for problemsetserver:https://alg.csail.mit.edu/
• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/