Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf ·...
Transcript of Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf ·...
Database design and implementation CMPSCI 645
Lecture 02: Project overview
Overview of deliverables
2
} Formgroups(week3)} Projectproposal(week5)} Literaturesurvey(week7)} Midtermstatusreport(week10)} ProjectpresentaAons(lastweekofclasses)} Projectreport
Data errors
3
} Prevent} Detect} Repair
Data cleaning
4
} Costofdataerrors:} PoordataqualitycoststheUSeconomymorethan$600billionperyear[Eckerson’02]
} ErroneouspricedatainretaildatabasescostsUSconsumers$2.5billioneachyear[Fanetal.’08]
} Consumes30-80%ofthedevelopmentAmeandbudgetindatawarehouses[Fanetal.’08]
Examples:FD:STR→ZIPcFD:[CC,ZIP]→STR,(44,_||_)
Using FDs for cleaning
5
Challenge1:discoverFDsfromdataChallenge2:useFDstocleanthedata
credit:WenfeiFan
Data fusion
6
6:15 PM
6:15 PM
6:22 PM
9:40 PM 8:33 PM
9:54 PM
credit:LunaDong,DiveshSrivastava
Ideas?
7
} MajorityvoAng?} WhatifwebsitescopyinformaAon?} CanyouaccountforcorrelaAons?} Whatiftherearemaliciouscontributors/spammers?
Data disambiguation
8
9
Ideas?
10
} AffiliaAoninformaAon?} Co-authorshipgraph?} Papercontent?} Crowdsourcing?
Data diagnosis
11
<subject,predicate,object> <subject,predicate,object> <subject,predicate,object>
<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>
<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>
<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>
<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>
Post-factum cleaning with uncertainty
12
0.016 True 0.067 0 0.4 0.004 0.86 0.036 10
0.0009 False 0 0 0.2 0.0039 0.81 0.034 68
0.005 True 0.19 0 0.03 0.003 0.75 0.033 17
0.0008 True 0.003 0 0.1 0.003 0.8 0.038 18
Whatcausedtheseerrors?
Data
Accelerometer
Cell Tower
GPS
Light
Audio
Periodicity p
HasSignal? h
Rate of Change r
Avg. Intensity i
Speed s
Avg. Strength a
Zero crossing rate z
Spectral roll-off c
Transformations
Is Indoor?M(¬h, i < Ii)
Is Driving?M(p < Pd, r > Rd, h, s > Sd)
Is Walking?M(p > Pw, Rs < r < Rw,¬h � (s < Sw))
Alone?(A2 � a > A1) � ((a > A2) � (z > Z)) �
((a > A3) � (z < Z) � (c > C))
Is Meeting?M(¬h, i < Im, a > Am, z > Zm)
Outputs
true!
false!
false!
true!
false!
Sensorsmaybefaultyorinhibited
Itisnotstraigheorwardtospotsucherrorsintheprovenance
sensordata
Post-factum cleaning with uncertainty
13
} Whatifwedon’tknowforsurethatthereisanerror?
} Whatifweknowforsurethatthereisanerror,butwedon’tknowwhere?
Query debugging
14
} Example:findallactorswithBaconnumber2.SELECT!a3.fname, a3.lname FROM !Actor a0, Casts c0, Casts c1, !
!Casts c2, Casts c3, Actor a3 WHERE !a0.fname = 'Kevin' AND a0.lname = 'Bacon' AND !
!c0.pid = a0.id AND c0.mid = c1.mid AND !!c1.pid = c2.pid AND c2.mid = c3.mid AND !!c3.pid = a3.id AND !!NOT (a3.fname = 'Kevin' and a3.lname = 'Bacon') AND
!NOT EXISTS (SELECT !xc1.pid ! !FROM !Actor xa0, Casts xc0, Casts xc1 WHERE !xa0.fname = 'Kevin' AND !
! ! ! !xa0.lname = 'Bacon' AND xa0.id = xc0.pid AND !
! ! ! !xc0.mid = xc1.mid AND xc1.pid = a3.id)!GROUP BY a3.id, a3.fname, a3.lname;!!
Ideas?
15
} Datasampling?} VisualizaAons?} Differentquerylanguages?} Querybyexample?} InsteadofwriAngaquery,giveexamplesoftheresults
} Canyouquerywithnaturallanguage?
Query containment/equivalence
16
} Equivalence:
} Containment:
} Queryrewritewithviews.
} Idea:useSATsolvers
q1 � q2 � q1(D) = q2(D),�D
q1 � q2 � q1(D) � q2(D),�D
NP-hard
Databases for applications
17
} AdeclaraAvelanguageforanimaAon} Createcharactersby“joining”exisAngcomponents} Findrigs(skeletons)foragivencharacter
Databases for applications
18
} NavigaAngandqueryinglargeimagedata
credit:Sloandigitalskysurvey
Probabilistic databases
19
} WhataretheprobabiliAesinthejoinresults} Canyoufindqueryclassesforwhichinferenceistractable?
credit:DanSuciu
Provenance-based content authenticity
20
} Provenance:sourcesandprocessthatderivesnewdata.
} Example:Wikipedia} Openlyauthored} MistakencontribuAons} MaliciouscontribuAons
} Idea:} Usehistorytoassesstrustworthiness} Analyzeeditbehavior} AnalyzecitaAons} Propagatetrust
Data auditing
21
} Example:medicalrecords} Whoshouldhaveaccess?
} QuerylanguagetomodelsecurityviolaAons} QueryexecuAonengineforthislanguage
} TrustpropagaAon} Whatisdatagetstainted?} Whatotherdataisaffected?} Whichdataissafe?
Data understanding
22
Why??
Support to understand surprising results
What if the results are not surprising? Can DBs add context?
Explain by example
Fairness in Big Data
23
AccuracyisbeperformajoriAes
credit:MoritzHardt
Data-driven decisions WheretoofferservicesProductrecommendaAonsWhogetsaloanCrimesentencingSelf-drivingcarsWorkauthorizaAon(e.g.,e-Verify)
Other projects
24
} ACMSIGMODprogrammingcontest
} PVLDBexperimentandanalysistrack} thoroughevaluaAonofexisAngapproaches
Brainstorming session
25
} Formgroupsof6-7.} Take5mintothinkindividuallyandfillasmuchasyoucanthequesAonsonthefirstpageoftheprintout.
} Presentyourideasinacircleasfollows:} Eachpersondescribestheirideaathighlevel.Nomorethan1minute.
} Thepersononhis/herlesrepeatstheideaintheirownwords.Thenproceedstodescribehis/heridea.
} Discussandfillthesecondpageoftheprintout