Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf ·...

25
Database design and implementation CMPSCI 645 Lecture 02: Project overview

Transcript of Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf ·...

Page 1: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Database design and implementation CMPSCI 645

Lecture 02: Project overview

Page 2: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Overview of deliverables

2

}  Formgroups(week3)}  Projectproposal(week5)}  Literaturesurvey(week7)} Midtermstatusreport(week10)}  ProjectpresentaAons(lastweekofclasses)}  Projectreport

Page 3: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data errors

3

}  Prevent} Detect}  Repair

Page 4: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data cleaning

4

}  Costofdataerrors:}  PoordataqualitycoststheUSeconomymorethan$600billionperyear[Eckerson’02]

}  ErroneouspricedatainretaildatabasescostsUSconsumers$2.5billioneachyear[Fanetal.’08]

}  Consumes30-80%ofthedevelopmentAmeandbudgetindatawarehouses[Fanetal.’08]

Page 5: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Examples:FD:STR→ZIPcFD:[CC,ZIP]→STR,(44,_||_)

Using FDs for cleaning

5

Challenge1:discoverFDsfromdataChallenge2:useFDstocleanthedata

credit:WenfeiFan

Page 6: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data fusion

6

6:15 PM

6:15 PM

6:22 PM

9:40 PM 8:33 PM

9:54 PM

credit:LunaDong,DiveshSrivastava

Page 7: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Ideas?

7

} MajorityvoAng?} WhatifwebsitescopyinformaAon?}  CanyouaccountforcorrelaAons?} Whatiftherearemaliciouscontributors/spammers?

Page 8: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data disambiguation

8

Page 9: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

9

Page 10: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Ideas?

10

} AffiliaAoninformaAon?}  Co-authorshipgraph?}  Papercontent?}  Crowdsourcing?

Page 11: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data diagnosis

11

<subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

<subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object> <subject,predicate,object>

Page 12: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Post-factum cleaning with uncertainty

12

0.016 True 0.067 0 0.4 0.004 0.86 0.036 10

0.0009 False 0 0 0.2 0.0039 0.81 0.034 68

0.005 True 0.19 0 0.03 0.003 0.75 0.033 17

0.0008 True 0.003 0 0.1 0.003 0.8 0.038 18

Whatcausedtheseerrors?

Data

Accelerometer

Cell Tower

GPS

Light

Audio

Periodicity p

HasSignal? h

Rate of Change r

Avg. Intensity i

Speed s

Avg. Strength a

Zero crossing rate z

Spectral roll-off c

Transformations

Is Indoor?M(¬h, i < Ii)

Is Driving?M(p < Pd, r > Rd, h, s > Sd)

Is Walking?M(p > Pw, Rs < r < Rw,¬h � (s < Sw))

Alone?(A2 � a > A1) � ((a > A2) � (z > Z)) �

((a > A3) � (z < Z) � (c > C))

Is Meeting?M(¬h, i < Im, a > Am, z > Zm)

Outputs

true!

false!

false!

true!

false!

Sensorsmaybefaultyorinhibited

Itisnotstraigheorwardtospotsucherrorsintheprovenance

sensordata

Page 13: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Post-factum cleaning with uncertainty

13

} Whatifwedon’tknowforsurethatthereisanerror?

} Whatifweknowforsurethatthereisanerror,butwedon’tknowwhere?

Page 14: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Query debugging

14

}  Example:findallactorswithBaconnumber2.SELECT!a3.fname, a3.lname FROM !Actor a0, Casts c0, Casts c1, !

!Casts c2, Casts c3, Actor a3 WHERE !a0.fname = 'Kevin' AND a0.lname = 'Bacon' AND !

!c0.pid = a0.id AND c0.mid = c1.mid AND !!c1.pid = c2.pid AND c2.mid = c3.mid AND !!c3.pid = a3.id AND !!NOT (a3.fname = 'Kevin' and a3.lname = 'Bacon') AND

      !NOT EXISTS (SELECT !xc1.pid         ! !FROM  !Actor xa0, Casts xc0, Casts xc1                     WHERE !xa0.fname = 'Kevin' AND !

! ! ! !xa0.lname = 'Bacon' AND                            xa0.id = xc0.pid AND !

! ! ! !xc0.mid = xc1.mid AND xc1.pid = a3.id)!GROUP BY        a3.id, a3.fname, a3.lname;!!

Page 15: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Ideas?

15

} Datasampling?} VisualizaAons?} Differentquerylanguages?} Querybyexample?}  InsteadofwriAngaquery,giveexamplesoftheresults

}  Canyouquerywithnaturallanguage?

Page 16: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Query containment/equivalence

16

}  Equivalence:

}  Containment:

} Queryrewritewithviews.

}  Idea:useSATsolvers

q1 � q2 � q1(D) = q2(D),�D

q1 � q2 � q1(D) � q2(D),�D

NP-hard

Page 17: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Databases for applications

17

} AdeclaraAvelanguageforanimaAon}  Createcharactersby“joining”exisAngcomponents}  Findrigs(skeletons)foragivencharacter

Page 18: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Databases for applications

18

} NavigaAngandqueryinglargeimagedata

credit:Sloandigitalskysurvey

Page 19: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Probabilistic databases

19

} WhataretheprobabiliAesinthejoinresults}  Canyoufindqueryclassesforwhichinferenceistractable?

credit:DanSuciu

Page 20: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Provenance-based content authenticity

20

}  Provenance:sourcesandprocessthatderivesnewdata.

}  Example:Wikipedia}  Openlyauthored}  MistakencontribuAons}  MaliciouscontribuAons

}  Idea:}  Usehistorytoassesstrustworthiness}  Analyzeeditbehavior}  AnalyzecitaAons}  Propagatetrust

Page 21: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data auditing

21

}  Example:medicalrecords}  Whoshouldhaveaccess?

} QuerylanguagetomodelsecurityviolaAons}  QueryexecuAonengineforthislanguage

}  TrustpropagaAon}  Whatisdatagetstainted?}  Whatotherdataisaffected?}  Whichdataissafe?

Page 22: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Data understanding

22

Why??

Support to understand surprising results

What if the results are not surprising? Can DBs add context?

Explain by example

Page 23: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Fairness in Big Data

23

AccuracyisbeperformajoriAes

credit:MoritzHardt

Data-driven decisions WheretoofferservicesProductrecommendaAonsWhogetsaloanCrimesentencingSelf-drivingcarsWorkauthorizaAon(e.g.,e-Verify)

Page 24: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Other projects

24

} ACMSIGMODprogrammingcontest

}  PVLDBexperimentandanalysistrack}  thoroughevaluaAonofexisAngapproaches

Page 25: Database design and implementationavid.cs.umass.edu/courses/645/s2017/lectures/02-Projects.pdf · Database design and implementation CMPSCI 645 Lecture 02: Project overview . Overview

Brainstorming session

25

}  Formgroupsof6-7.}  Take5mintothinkindividuallyandfillasmuchasyoucanthequesAonsonthefirstpageoftheprintout.

}  Presentyourideasinacircleasfollows:}  Eachpersondescribestheirideaathighlevel.Nomorethan1minute.

}  Thepersononhis/herlesrepeatstheideaintheirownwords.Thenproceedstodescribehis/heridea.

}  Discussandfillthesecondpageoftheprintout