Where do data come from?Queries (cont.) CWID Name Major GPA Phone A1333331 Peter CS 3.5 3125558888...
Transcript of Where do data come from?Queries (cont.) CWID Name Major GPA Phone A1333331 Peter CS 3.5 3125558888...
-
Where do data come from?CS100 - Guest Lecture - Databases and Provenance
Boris Glavic1
DBGroupIllinois Institute of Technology
October 25, 2019
[email protected] 1 of 44 Boris Glavic - Where do data come from?:
-
Outline
1 Who I am
2 What are Databases?
3 Data Provenance
4 Questions
-
Who I am
Hi, I am Boris
Slide 3 of 44 Boris Glavic - Where do data come from?: Who I am
-
Who I am
Hi, I am Boris
I am a database guy!
Slide 4 of 44 Boris Glavic - Where do data come from?: Who I am
-
Who I am
Hi, I am Boris
I am a database guy!
I will tell you:1) Why DBs are important2) Why DBs are interesting
3) My Research
Slide 5 of 44 Boris Glavic - Where do data come from?: Who I am
-
Outline
1 Who I am
2 What are Databases?
3 Data Provenance
4 Questions
-
Where do data come from?
Where do data come from?
Slide 7 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
You might have heard . . .
Slide 8 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
What Are Databases?
Database systems and databasesdatabase systems manage databasesa database is a structured collection of data
What do database systems do?1 Provide persistent storage of data2 Efficient declarative access to data: querying3 Protection from data loss under failures4 Safe concurrent access to data
Slide 9 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Who Uses Databases?
Most large software systems use databases!Business Intelligence, e.g., IBM cognosWeb-based systems
Desktop softwareYou music playerYou email client (most likely at least)
Every big company uses DBsbanksinsurancegovernment agencies. . .
Slide 10 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Who Creates Databases?
Relational databases is big businessIBM DB2OracleMicrosoft SQLServerTeradataOpen Source Systems: PostgreSQL, MySQL
Distributed systemsCloud storage and Key-value stores
Amazon S3, Google Big Table, CassandraBig Data Analytics
MapReduce, Spark, Flink
Slide 11 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Why Are Database Interesting?
Combination of systems and theoretical researchInteresting systems problems
Hacking complex and large systemsLow-level optimizations
exploit modern hardware
Interesting theoretical foundationsComplexity of answering queriesExpressiveness of query languagesStrong connections to logic
Slide 12 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Why Are Databases Interesting (cont.)
Connections to other CS fieldsDistributed systems
getting more and more important
CompilersModelingAI and machine learning
Data mining
Operating and File Systems
Slide 13 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
The Relational Datamodel
Relations aka Tablesa table consists of columns and rowstables store one type of entity
e.g., students, bank accounts, loans, . . .each row is one entity
e.g., one studentcolumns store a particular type of information about an entity
e.g., name of a student
Slide 14 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
The Relational Datamodel (cont.)
Example TablesStudents table
CWID Name Major GPA PhoneA1333331 Peter CS 3.5 312 555 8888A5552341 Alice‘ CS 4.0 312 555 7777A1325324 Elisa Bio 3.2 312 555 5555
Grades table
CWID Course GradeA1333331 CS100 AA5552341 CS425 CA1325324 CS525 AA1325324 CS566 B
Slide 15 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Queries
What do I do with the data in my database?you can interrogate the database system to extract information aboutyour datathis is done using a programming language called SQLSQL is a declarative language
say what data you want not how to compute it
Queries return table (closed language)
Slide 16 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Queries (cont.)
CWID Name Major GPA PhoneA1333331 Peter CS 3.5 312 555 8888A5552341 Alice‘ CS 4.0 312 555 7777A1325324 Elisa Bio 3.2 312 555 5555
How many students are in mydatabase?
#Students3
Who has the highest GPA?
NameAlice
What are the names of CSstudents?
NamePeterAlice
Slide 17 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Persistence and Recovery
What if you shutdown your computer?will you loose your precious data?
What happens when your computer crashes?will you loose your precious data?
No!the database system stores your data on stable storage (disk)database systems know how recover from failureswhen the database system signals to you that a change you made wasapplied then you will never loose it
Slide 18 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Persistence and Recovery
What if you shutdown your computer?will you loose your precious data?
What happens when your computer crashes?will you loose your precious data?
No!the database system stores your data on stable storage (disk)database systems know how recover from failureswhen the database system signals to you that a change you made wasapplied then you will never loose it
Slide 18 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Concurrent Access
Banking ExampleAccount A: $50Account B: $50Transfer $25 from A to BBank gives all accounts 10% interest
Transfer Money
ActionSubtract $25 from A
Add $25 to B
Give 10% interestAction
Add %5 interest
Balances
Account A Account B$25 $50$27.5 $55$27.5 $80
We have lost interest!
Slide 19 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Concurrent Access (cont.)
Concurrency Controldatabases manage concurrent operationsprevent bad things from happeningfrom user perspective:
behaves like your program is the only one running!
Can we loose interest?Nope!
Slide 20 of 44 Boris Glavic - Where do data come from?: What are Databases?
-
Outline
1 Who I am
2 What are Databases?
3 Data Provenance
4 Questions
-
What is Provenance?
Provenance in Artrecord of ownership of a piece of art
Arnolfini PortraitThe provenance of the painting begins in 1434when it was dated by van Eyck and presumablyowned by the sitter(s). At some point before1516 it came into the possession of Don Diegode Guevara (d. Brussels 1520), a Spanish careercourtier of the Habsburgs . . .By 1516 he had given the portrait to Margaret ofAustria, . . .
Slide 22 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is Provenance?
Provenance in DatabasesRecords how data was produced
which other data was used in the creation processwhich operations were involved in its creation
For sake of this lecture, only provenance of queries
Provenance of a query resultSelect one row from the result of a queryWhich input rows were used to compute it?Maybe also: how were these rows combined
Slide 23 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Example
Compute average salary of employees per department
SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept
name salary deptPeter 10 HRBob 20 HRAlice 5 IT
Slide 24 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Example
Compute average salary of employees per department
dept avgsalHR 15IT 5
SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept
name salary deptPeter 10 HRBob 20 HRAlice 5 IT
Slide 25 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Example (cont.)
The first result row depends on the first two input rows
dept avgsalHR 15IT 5
SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept
name salary deptPeter 10 HRBob 20 HRAlice 5 IT
Slide 26 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Some Observations
Provenance maps output rows of queries to input rowshere track this per rowcould also track attribute values (higher fidelity)could also track tables (lower fidelity)
Slide 27 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Why should I care?
Use casesDebugging queries and dataAuditingExplainabilityOptimizing DB operationsDetermining trust in data
Slide 28 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Queries are Functions
Functional View of QueryingA query takes as input a database and outputs a tableWe can think about queries as functions from databases to resulttables!
What then is provenance?Select one of the outputs of the queryWhich inputs were used to compute it?
Slide 29 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1
f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2
. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1
f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4
. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3
. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions
What are functions in math?you already know functions from high school math!
Examplesf (x) = x
f (1) = 1f (2) = 2. . .
f (x) = x2
f (1) = 1f (2) = 4. . .
f (x , y) = x + y
f (1, 2) = 1+ 2 = 3. . .
Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions (cont.)
What makes a function a function?Does it have to return numbers?Does have to take numbers as input?
Counterexamplesf takes names as an input and returns the name converted to lowercase
f (Peter) = peterf (Bob) = bob
f takes text as input and returns the numbers of characters in the textf (Bob) = 3f (Alice) = 5
Slide 31 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Recap Functions (cont.)
Definition (Function)Input domain: A set of values IOutput domain: A set of value OMapping: Associate each value from I with one value from O
Queries as FunctionsInput domain: databasesOutput domains: tables
Slide 32 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examples
if f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5x
if f (x) = x3 then f −1(x) = 3√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5x
if f (x) = x3 then f −1(x) = 3√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
What is the Provenance of a Function?
Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input
When f (x) = y then f −1(y) = x
Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3
√x
if f (x) = x2 then f −1(x) = 2√x?
this does not work (two possible solutions)
Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
From Invertible Functions to Queries
Queries are typically not invertible!Return the number of students in CS100Let’s say the result is 3 studentsInverse function would have to magically guess who these 3 studentsare!
Queries operate on tablesWe want more fine-granular information:
Which rows from the input affected which rows from the output!
Slide 34 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Annotation Propagation to the Rescue
Quadratic functionInput Output
-2 4-1 10 01 12 4
Cannot invert thisThe output is not enough to compute provenance!How can we deal with that?
Slide 35 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Magic Machine Riddle
“Magic Machine”
Slide 36 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Magic Machine Solution
“Magic Machine”
Slide 37 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Generalizing the solution
Approachannotate input data with unique identifiers (colors)outputs annotated with the color of the input they are derived from
Quadratic functionInput Output(-2,�) (4,�)(-1,�) (1,�)(0,�) (0,�)(1,�) (1,�)(2,�) (4,�)
Slide 38 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Caveats
We assumed that the function happily accepts inputs that are pairs ofnumbers and colorsIf inputs and outputs are tables then we need to understand theinternals of the function to know how they are related
Slide 39 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Provenance for Queries (how it is actually done)
Encode annotations by extending tableseach row is extended with extra attributesthese attributes are used to store provenance
Query instrumentationWe rewrite queries input queries into queries that
1 create annotations for each input2 propagate these annotations to produced annotated outputs
Slide 40 of 44 Boris Glavic - Where do data come from?: Data Provenance
-
Outline
1 Who I am
2 What are Databases?
3 Data Provenance
4 Questions
-
DBGroup - What do we do?
Distributed and High-performance DatabasesHRDBMS - a scalable database with high per-node performanceHCDF - operating system - database co-design
Data Integration and Cleaninghow to systematically evaluate cleaning and integration systems
BartiBench
Data ProvenanceGProM - a generic provenance middlewareRelevance-based Data Management - optimizing data operationsbased on what data is relevant
Slide 42 of 44 Boris Glavic - Where do data come from?: Questions
http://www.cs.iit.edu/~dbgroup/projects/hrdbms.htmlhttp://db.unibas.it/projects/bart/http://www.cs.iit.edu/~dbgroup/projects/ibench.htmlhttp://www.cs.iit.edu/~dbgroup/projects/gprom.htmlhttp://www.cs.iit.edu/~dbgroup/projects/relevance.html
-
DBGroup - What do we do? (cont.)
Data ScienceWe are data science enablers!Vizier - a data-centric notebook platform with uncertainty tracking
Slide 43 of 44 Boris Glavic - Where do data come from?: Questions
https://vizierdb.info/
-
Questions?
IIT DBGroupstudents: 7 Ph.D., 1 Master, 1 Undergraduateresearch group: http://www.cs.iit.edu/~dbgroup/personal page:http://www.cs.iit.edu/~dbgroup/members/bglavic.htmlgithub: https://github.com/IITDBGroup
Slide 44 of 44 Boris Glavic - Where do data come from?: Questions
http://www.cs.iit.edu/~dbgroup/http://www.cs.iit.edu/~dbgroup/members/bglavic.htmlhttps://github.com/IITDBGroup
Who I amWhat are Databases?Data ProvenanceQuestionsAppendix