10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new...

64
10-405 ML from Large Datasets William Cohen 1

Transcript of 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new...

Page 1: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

10-405MLfromLargeDatasets

WilliamCohen

1

Page 2: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

ADMINISTRIVIA

Page 3: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Who/What/Where/When• Wiki:google://”cohen CMU”àteachingà

– http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018

– thisshouldpointtoeverythingelse

• OfficeHours:– StillTBAforallofus,checkthewiki

• Courseassistant:DorothyHolland-Minkley(dfh@cs)

• Instructor:–WilliamW.Cohen

• TAs:– Introscomingup….

Page 4: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Who/What/Where/When• 10-405isabrandnewcourse!• Butit’sgoingtobeverycloseto10-605whichhasbeenhappeningsince2012

• It’saimedatundergrads–willcoveraboutthesametopics– notaprojectcourse– onechancetodoanopen-ended“extensionofanassignment”

• Anyonecanenrollbutnotallgradprogramswillgiveyoufullcreditfor4xxcoursessoreadthefineprint

Page 5: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Who/What/Where/When• Whoishere?

Page 6: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Who/What/Where/When• Mostdaystherewillbeanon-linequizyoushoulddoafterlecture

• ThequizzeswillusuallycloseFridaynoon• Wehaveonetoday– seethewiki!• Theydon’tcountforalotbuttherearenomakeups–Quizzesreinforcethelecture–Andmakesureyoukeepup

Page 7: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Who/What/Where/When• WilliamW.Cohen– 1990-mid1990’s:AT&TBellLabs(workingonILPandscalablepropositionalrule-learningalgorithms)

– Mid1990’s—2000:AT&TResearch(textclassification,dataintegration,knowledge-as-text,informationextraction)

– 2000-2002:Whizbang!Labs(infoextractionfromWeb)

– 2002-2008,2010-now:CMU(IE,biotext,socialnetworks,learningingraphs,infoextractionfromWeb,scalablefirst-orderlearning)• 2008-2009,Jan-Jul2017:VisitingScientistatGoogle

Page 8: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

TAs

Page 9: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

I am a Masters Student in Information Networking Institute going to join Microsoft AI & Research Teampost this semester. My research interests are in the domains of Scalable Machine Learning and Deep Learning.

I took the course last semester (Fall 2017) and learnt a lot.

Hope you all have a great semester. See you at the office hours!

Vidhan Agarwal (MSIN)

Page 10: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

SarthakGarg- MSinCS• IamafirstyearMastersstudentintheComputerScienceDepartment

• IaminterestedinDeepGenerativeModelsandDistributedSystemsforMachineLearning

• Itook10-605lastfallandfounditveryinteresting,hopeyouenjoythecourse!

Page 11: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Nitish Kulkarni MSinMCDS

11

I am a first year master's student in the Master of Computational Data Science program, LTI Department.

I’m interested in scalable machine learning algorithms and information retrieval.

I took 10-805 in fall ‘17. The course was quite fun and incredibly useful. I’m glad to be a TA for the course this semester, and hope you have a similar experience.

Page 12: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

VivekShankar– BSinSCS

• VivekShankar– BSinSCS

• IamafourthyearundergraduateintheSchoolofComputerScience.

• Thissummer,IinternedatGoogle,MontrealworkingonbuildingaURL-basedMachineLearningmodelfordetectingmaliciousChromeextensions.I’llbejoiningGooglefulltimeinPittsburgh.

• Iaminterestedindesigningparallelizablemachinelearningalgorithmsthatscalewelltolargedatasets– abigthemein10605.Itook10605inFall'17,anditwasoneofmyfavoriteclassesthusfaratCMU!

Page 13: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What/HowIkindoflikelanguagetasks,especiallyforthistask:– Thedata(usually)makessense– Themodels(usually)makesense– Themodels(usually)arecomplex,so• Moredataactuallyhelps• Learningsimplemodelsvs complexonesissometimescomputationallydifferent

Page 14: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What/How• ProgrammingLanguagesandSystems:– Python– HadoopandSpark

• Resources:– unix.andrew machines– Stoathadoop cluster:• 104workernodes,with8cores,16GBRAM,41TB.• 30workernodes,with8cores,16GBRAM, 250Gb+

– AmazonElasticCloud• AmazonEC2[http://aws.amazon.com/ec2/]• Allocation:$50worthoftimeperstudent

Page 15: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What/How:601co-req• Youshouldhaveasaprereq orco-req oneoftheMLD’sintroMLcourses:10-401,10-601,10-701,10-715

• Lecturesaredesignedtocomplement thatmaterial– computationalaspectsvs informationalaspects

Page 16: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What/How:cheatingvs workingtogether• Ihavealongandexplicitpolicy– stolenfromRoni Rosenfeld- readthewebpage– tl;dr:transmitinformationliketheydidinthestoneage,brain-to-brain,anddocumentit

– donotcopyanythingdigitally– exceptions(eg projects)willbeexplicitlystated

– everybodyinvolvedwillfail bydefault– every infractionalwaysgetsreporteduptotheOfficeofAcademicIntegrity,theheadofyourprogram,thedeanofyourschool,….

– asecondoffenseisverybad

Page 17: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

BIGDATAHISTORY:FROMTHEDAWNOFTIMETOTHEPRESENT

17

Page 18: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

$ ripper ../tdata/talksFinal hypothesis is:talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).

18

Page 19: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

19

Page 20: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

MoreonthispaperAlgorithm

• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily

• Phase2:prunerules– startingwithphase1output,removeconditions

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.talk_announcement :- WORDS ~ '2d416', WORDS ~ be.talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .

20

Page 21: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

MoreonthispaperAlgorithm

• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily

• Phase2:prunerules– startingwithphase1output,removeconditions,greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .

21

Page 22: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

MoreonthispaperAlgorithm

• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily

• Phase2:prunerules– startingwithphase1output,removeconditions,greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).

22

Page 23: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

MoreonthispaperAlgorithm

• FitthePOS,NEGexample• WhilePOSisn’tempty:

– LetR be“ifTrueè pos”– WhileNEGisn’tempty:

• Pickthe“best”[i]conditionc oftheform“xi=True”or“xi=false”

• Addc totheLHSofR• Removeexamplesthatdon’tsatisfyc fromNEG

• AddRtotheruleset[ii]• RemoveexamplesthatsatisfyRfromPOS

• Prunetheruleset:– …

Analysis

[i] “Best” is wrt some statistics on c’s coverage of POS,NEG

[ii] R is now of the form “if xi1=_ and xi2=_ and … è pos”

• ThetotalnumberofiterationsofL1isthenumberofconditionsintheruleset– callitm

• Pickingthe“best”conditionrequireslookingatallexamples– saytherearen ofthese

• Timeisatleastm*n• Theproblem:

– Whentherearenoisypositiveexamplesthealgorithmbuildsrulesthatcoverjust1-2ofthem

– Sowithhugenoisydatasetsyoubuildhugerulesets

L1

quadratic

cubic!

23

Page 24: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Related paper from 1995…

24

Page 25: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Soinmid1990’s…..• Experimentaldatasetsweresmall• Manycommonlyusedalgorithmswereasymptotically “slow”

• Notmanypeoplereallycared

25

Page 26: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

26

Page 27: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

27

Page 28: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

WhyMoreDataHelps:ADemo• Data:–All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

28

Page 29: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

29

Page 30: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

30

Page 31: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

http://xkcd.com/ngram-charts/

31

Page 32: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

WhyMoreDataHelps:ADemo• Data:– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)

– Demo:• /Users/wcohen/Documents/code/pyhack/bigml• eg:pythonngram-query.py data/aeffect-train.txt __Beffect__

32

Page 33: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

33

Page 34: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

WhyMoreDataHelps• Data:

– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)

• Observations [fromplayingwithdata]:– Mostlyeffect notaffect– Mostcommonwordbeforeaffect isnot– After noteffectmostcommonwordisa– …

34

Page 35: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Soin2001…..• We’relearning:– “there’snodatalikemoredata”–Formanytasks,there’snorealsubstituteforusinglotsofdata

35

Page 36: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

…andin2009Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.

Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 200936

Page 37: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

…andin2012

Dec 2011

37

Page 38: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

…andin2013

38

Page 39: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

…andin2014

39

Page 40: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Bengio,Foundations&Trends,2009

40

Page 41: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

naïve vsclever optimization

1Mvs10M

examples

2.5Mexamplesfor“pretraining”

41

Page 42: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Today….• Commonlyuseddeeplearningdatasets:– Images/videos:• ImageNet:20k+categories,14M+images• MSCOCO:91categories,2.5Mlabels,328kimages• YouTube-M:8Murls,4800classes,0.5Mhours

– Readingcomprehension:• Children’sbooktest:600k+context/querypairs• CNN/Dailymail:~300kdocs,1.2Mclozequestions

–Other:• Ubuntudialog:7M+utterances,1M+dialogs• …

42

Page 43: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

REVIEW:ASYMPTOTICCOMPLEXITY

43

Page 44: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Howdoweuseverylargeamountsofdata?• Workingwithbigdataisnot about– codeoptimization– learningdetailsoftodays hardware/software:• GraphLab,Hadoop,Spark,parallelhardware,….

• Workingwithbigdataisabout– Understandingthecostofwhatyouwanttodo– Understandingwhatthetoolsthatareavailableoffer– Understandinghowmuchcanbeaccomplishedwithlinearornearly-linearoperations(e.g.,sorting,…)

– Understandinghowtoorganizeyourcomputationssothattheyeffectivelyusewhatever’sfast

– Understandinghowtotest/debug/verifywithlargedata

*

* according to William44

Page 45: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AsymptoticAnalysis:BasicPrinciples

)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$Î

)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$WÎ

Usually we only care about positive f(n), g(n), n here…

45

Page 46: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AsymptoticAnalysis:BasicPrinciples

)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$=

)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$W=

Less pedantically:

Some useful rules:

)O( )( 434 nnnO =+

)O( )1273( 434 nnnO =+

)loglog4log 4 nO(n) O() nO( =×=

Only highest-order terms matter

Leading constants don’t matter

Degree of something in a log doesn’t matter

46

Page 47: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Backtorulepruning….

Algorithm• FitthePOS,NEGexampleWhile POSisn’tempty:

– LetR be“ifTrueè pos”– WhileNEGisn’tempty:

• Pickthe“best”[1]conditionc oftheform“xi=True”or“xi=false”

• Addc totheLHSofR• Removeexamplesthatdon’tsatisfycfromNEG

• AddRtotheruleset[2]• RemoveexamplesthatsatisfyRfromPOS

• Prunetheruleset:– Foreachconditioncintheruleset:

• Evaluatetheaccuracyoftheruleset w/oconheldout data

– Ifremovinganycimprovesaccuracy• Removecandrepeatthepruningstep

Analysis• Assumenexamples• Assume mconditionsinruleset• Growingrulestakestimeatleast

Ω(m*n)ifevaluatingcisΩ(n)• Whendataiscleanmissmall,

fittingtakeslineartime• Whenk%ofdataisnoisy,m is

Ω(n*0.01*k)sogrowingrulestakesΩ(n2)

• Pruningarulesetwithm =0.01*kn extraconditionsisveryslow:Ω(n3)ifimplementednaively

[1] “Best” is wrt some statistics on c’s coverage of POS,NEG

[2] R is now of the form “if xi1=_ and xi2=_ and … è pos”

47

Page 48: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Empiricalanalysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)

48

Page 49: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Wheredoasymptotics breakdown?• Whentheconstantsaretoobig– ornistoosmall

• Whenwecan’tpredictwhattheprogramwilldo– Eg,howmanyiterationsbeforeconvergence?Doesitdependondatasizeornot?– Thisiswhenyouneedexperiments

• Whentherearedifferenttypesofoperationswithdifferentcosts–Weneedtounderstandwhatweshouldcount

49

Page 50: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Whatdowecount?

• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.

• JeffDeanbuildshiscodebeforecommittingit,butonlytocheckforcompilerandlinkerbugs.

• JeffDeanwritesdirectlyinbinary.Hethenwritesthesourcecodeasadocumentationforotherdevelopers.

• JeffDeanonceshiftedabitsohard,itendeduponanothercomputer.

• WhenJeffDeanhasanergonomicevaluation,itisfortheprotectionofhiskeyboard.

• gcc -O4emailsyourcodetoJeffDeanforarewrite.• WhenheheardthatJeffDean'sautobiographywouldbe

exclusivetotheplatform,RichardStallmanboughtaKindle.

• JeffDeanputshispantsononelegatatime,butifhehadmorelegs,you’drealizethealgorithmisactuallyonlyO(logn)

50

Page 51: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Numbers(JeffDeansays)EveryoneShouldKnow

51

Page 52: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Update:ColinScott,UCB

file:///Users/wcohen/Documents/code/interactive_latencies/interactive_latency.html - *may need to open this from shell

52

Page 53: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What’sHappeningwithHardware?• Clockspeed:stuckat3Ghzfor~10years• Netbandwidthdoubles~2years• Diskbandwidthdoubles~2years• SSDbandwidthdoubles~3years• Diskseekspeeddoubles~10years• SSDlatencynearlysaturated

53

Page 54: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

54

Page 55: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Numbers(JeffDeansays)EveryoneShouldKnow

55

Page 56: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU

16x bigger

256x bigger

Hard disk (1Tb) 128x bigger

56

Page 57: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU

16x bigger

256x bigger

Hard disk (1Tb) 128x bigger

57

Page 58: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU

16x bigger

256x bigger

Hard disk (1Tb) 128x bigger

58

Page 59: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU

16x bigger

256x bigger

Hard disk (1Tb) 128x bigger

59

Page 60: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Atypicaldisk

60

Page 61: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Numbers(JeffDeansays)EveryoneShouldKnow

~= 10x

~= 15x

~= 100,000x

40x

61

Page 62: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Whatdowecount?

• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.• ….

• Memoryaccess/instructionsarequalitativelydifferentfromdiskaccess

• Seeksarequalitativelydifferentfromsequentialreadsondisk

• Cache,diskfetches,etcworkbestwhenyoustreamthroughdatasequentially

• Bestcasefordataprocessing:streamthroughthedataonce insequentialorder,asit’sfoundondisk.

62

Page 63: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

Otherlessons-?

* but not important enough for this class’s assignments….

*

63

Page 64: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening

What/How–Nextlecture:probabilityreviewandNaïveBayes.–Homework:• WatchthereviewlectureIlinkedtoonthewiki

– I’mnot goingtorepeatit

64