A Sightseeing Tour of Provenance in Databases & Workflows

73
Bertram Ludäscher [email protected] Director, Center for Informa.cs Research in Science & Scholarship (CIRSS) Graduate School of Library and Informa.on Science (GSLIS) & Na.onal Center for Supercompu.ng Applica.ons (NCSA) A Sightseeing Tour of Provenance in Databases & Workflows

Transcript of A Sightseeing Tour of Provenance in Databases & Workflows

Bertram  Ludäscher    [email protected]  

 Director,  Center  for  Informa.cs  Research  in  Science  &  Scholarship  (CIRSS)  

Graduate  School  of  Library  and  Informa.on  Science  (GSLIS)  &  Na.onal  Center  for  Supercompu.ng  Applica.ons  (NCSA)  

A  Sightseeing  Tour  of  Provenance  in    Databases  &  Workflows  

•  Provenance:  Alta  Vista  

•  Provenance    – …  in  ScienBfic  Workflows  – …  Provenance  in  Databases  

•  Time  allowing:  Quick  Demos  

Outline  of  the  Tour  

2  

IntroducBons  should  come  first!  •  What  is  “provenance”?  •  “… a record that describes the people, institutions,

entities, and activities involved in producing, influencing, or delivering a piece of data or a thing”.

•  Come  again?    Who  said  that?  •  …  what  isn’t  provenance??  

•  Reminds  me  of  “What  is  a  species?”  •  …  when  the  answer  is  (should  be)  quite  clear:    

–  There  are  different  noBons!  •  …  of  species  and  of  provenance!    

Kicking  Bird    by  Shahin  Gholizadeh    

3  

IntroducBons  should  come  first!  •  Let’s  not  be  too  hard  on  us  …  •  DefiniBons  can  be  difficult!  

–  see  Proofs  and  Refuta.ons  (Imre  Lakatos,  1976)  –  or  ask  Hermann  Grassmann  about  Exterior  Algebra  –   …  via  (Gian-­‐Carlo  Rota,  1997)    …  via  Bob  Morris  …      

•  “He  gave  his  en.re  life  to  understanding  and  developing  this  defini.on.”    

•  “It  took  almost  one  hundred  years  before  mathema.cians  realized  the  greatness  of  Grassmann's  discovery.”  

•  This  will  do  for  now:  Oxford  English  DicBonary    –  The  place  of  origin  or  earliest  known  history  of  something:  

•  an  orange  rug  of  Iranian  provenance  –  The  beginning  of  something’s  existence;  its  origin:  

•  they  try  to  understand  the  whole  universe,  its  provenance  and  fate  –  A  record  of  ownership  of  a  work  of  art  or  an  anBque,  used  as  a  guide  to  authenBcity  or  quality:  

•  the  manuscript  has  a  dis.nguished  provenance  

Kicking  Bird    by  Shahin  Gholizadeh    

4  

Provenance  Research  everywhere  …    

5  

1st  Tour  Stop:  The  Fine  Arts  

•  One  of  these  is  has  been  sold  for  nearly  $180m.  •  The  other  could  be  worth  as  much  or  more.  •  Which  is  which?  •  What  is  the  difference?    

6  

2nd  Stop:  Liberal  Arts  &  Sciences  

•  What’s  so  “provenance”  about  this?  •  Grand  Canyon’s  rock  layers  are  a  record  of  the  early  geologic  history  of  North  America.  

The  ancestral  puebloan  granaries  at  Nankoweap  Creek  tell  archaeologists  about  more  recent  human  history.  (By  Drenaline,  licensed  under  CC  BY-­‐SA  3.0)  

7  

Provenience  vs  Provenance  

8

The  Many  Faces  of  Provenance    •  What  are  those?  •  Cosmology  •  Geology,  Stra.graphy  •  Phylogeny  

–  the  Tree  of  Life  •  Genealogy  

–  your  family:  literally  

•  Academic  Pedigree  –  “Doktorvater”  

•  Etymology  •  Chain  of  custody  

–  of  art(ifacts)  •  Yes!  It’s  all  about  origins,  history  

…     9  

10  

Natural  History:    Understanding  what  happened…  

Zrzavý,  Jan,  David  Storch,  and  Stanislav  Mihulka.  Evolu.on:  Ein  Lese-­‐Lehrbuch.  Springer-­‐Verlag,  2009.  

Author:  Jkwchui  (Based  on  drawing  by  Truth-­‐seeker2004)  

Provenance  Sleuth  or  Engineer?  •  ScienBsts  are  Provenance  (i.e.,  Natural  History)  Sleuths  

•  {ComputaBonal,  Computer,  InformaBon}-­‐ScienBsts  should  (also)  be  Provenance  Engineers  –  Ensure  your  “Data  Tree  of  Life”  (data  provenance)  correct!  –  What  is  the  origin  and  processing  history  of  your  data?  

•  With  great  provenance  come  great  quesBons!  –  “We  store  everything!”      –  Huh?  Yes,  provenance  is  the  answer…  (yawn..)  –  But  what  is  the  quesBon??  

•  Engineer’s  Stance:  –  What  quesBons  do  you  want  to  answer?  –  Let’s  find  out  what  observables  we  need  to  capture,  what  query  language  we  should  use,  how  we  do  that  efficiently  (later),  …    

11  

ComputaBonal  Provenance  

•  Origin  and  processing  history  of  an  arBfact  – usually:  data  (products),  figures,  ...  – someBmes:  workflow  (and  script)  evoluBon  …  

•  Different  sub-­‐communiBes:  – Provenance  in  databases  – Provenance  in  (scienBfic)  workflows  –  ...  programming  languages,  systems/security,  …    

12  

… now arriving at 3rd stop: Scientific Workflows!

•  Automation –  wfs to automate computational aspects of science

•  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data

•  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share

•  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science

Trident  Workbench  

VisTrails  

13  

Es  war  einmal  …      

14  

Run:me  Provenance    (a.k.a.  traces,  logs,      

retrospec:ve  provenance,  “Trace-­‐land”)  

4th  Stop:  Different  Kinds  of  Data  Provenance  in  Workflows Workflow  Modeling  &  Design  

(a.k.a.  prospec:ve  provenance  “Workflow-­‐land”)  

ProvONE:  W3C  PROV++  for  scienBfic  workflows    (Transfer  sta.on  to  any  of  several  other  “standard  extensions”)  

hkp://purl.dataone.org/provone-­‐v1-­‐dev  

Trace-­‐Land  

Workflow-­‐Land  

Data-­‐Land  (extensible)  

15  

SKOPE:  Synthesized  Knowledge  Of  Past  Environments  

16  

Bocinsky,  Kohler  et  al.  study  rain-­‐fed  maize  of  Anasazi    –  Four  Corners;  AD  600–1500.  Climate  change  influenced  Mesa  Verde  MigraBons;  late  

13th  century  AD.  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spaBo-­‐temporal  climate  field  at  a  fairly  high  resoluBon  (~800  m)  from  AD  1–2000.  Algorithm  esBmates  joint  informaBon  in  tree-­‐rings  and  a  climate  signal  to  idenBfy  “best”    tree-­‐ring  chronologies  for  climate  reconstrucBng.  

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucBon  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  

Communica:ons.  doi:10.1038/ncomms6618    

… implemented as an R Script …

Yes, scripts are (can be) workflows too!

Interactive Visualization

17  

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?  

5th  Stop:  YesWorkflow:    Yes,  scripts  are  workflows,  too!  

•  Script  vs  Workflows/ASAP:  – Automation:    *****  – Scaling:          **  – Abstraction:  *    – Provenance:    **  

18  

YesWorkflow.org  •  YesWorkflow  (YW)  

–  Started  as  a  grass-­‐roots  effort    (Kurator,  SKOPE,  ..)  –  …  meeBng  the  scienBsts/users  where  they  R!  

•  R,  Matlab,  (i)Python,  Jupyter,  …  

–  Scripts  +  simple  user  annotaBons  

•  =>  Reveal  the  workflow  model/abstracBon      …  that  underlies  the  (script)  implementa.on  

•  =>  YW  can  give  us  more  of  ASAP!  –  First  YW:    ASAP  (AbstracBon)...  –  Then  YW-­‐recon:  ASAP  (reconstrucBng  runBme  Provenance)  

19  

YW  (prospec:ve)  and    YW-­‐Recon  (retrospec:ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wriken  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

20  

YW  annotaBons:  Model  your  Workflow!  

21  

YesWorkflow:  ProspecBve  &  RetrospecBve  Provenance  …  (almost)  for  free!    

•  YW  annotaBons  in  the  script  (R,  Python,  Matlab)  are  used  to  recreate  the  workflow  view  from  the  script  …    

22  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!  

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

Paleoclimate  ReconstrucBon  (EnviRecon.org)    

23  

•  …  explained  using  YesWorkflow!  

Kyle  B.,  (computaBonal)  archaeologist:    "It  took  me  about  20  minutes  to  comment.  Less  than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  

Get  3  views  for  the  price  of  1!  

24  

Process  view  

Data  view  

Combined  view  

MulB-­‐Scale  Synthesis  and  Terrestrial  Model  Intercomparison  Project  (MsTMIP)  

fetch_drought_variable

drought_variable_1

fetch_effect_variable

effect_variable_1

convert_effect_variable_units

effect_variable_2

create_land_water_mask

land_water_mask

init_data_variables

predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1

define_droughts

sigma_dv_event month_dv_length

detrend_deseasonalize_effect_variable

effect_variable_3

calculate_data_variables

recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2

export_recovery_time_figure

output_recovery_time_figure

export_drought_value_variable_figure

output_drought_value_variable_figure

export_predrought_effect_variable_figure

output_predrought_effect_variable_figure

export_drought_number_variable_figure

output_drought_number_figure

input_drough_variable

input_effect_variable

Christopher  Schwalm,  Yaxing  Wei  

25  

Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).

4 YesWorkflow Examples

In the following we show YesWorkflow views extracted from real-world scientific use cases.The scripts were annoted with YW tags by scientists and script authors, using a verymodest training and mark-up e↵ort.1 Due to lack of space, the actual MATLAB and R

scripts with their YW markup are not included here. However, they are all availablefrom the yw-idcc-15 repository on the YW GitHub site [Yes15].

4.1 Analysis of Gene Expression Microarray Data

Bioinformatics workflows commonly possess a pattern of large numbers of incoming pa-rameters and outputs at each stage of computation. In addition, analysis of even asingle bioinformatics dataset tends to yield a large number of di↵erent output files.Hence, bioinformatics pipelines are attractive candidates for workflow systems, whichcan capture this complexity [Bie12]. Figure 4 shows a YesWorkflow representation ofan R script performing a classic, complex bioinformatics task: analysis of A↵ymetrixgene expression microarray data. This R script was modeled on our previous work-flows developed in the Kepler environment [SMLB12]. The script analyzes experimentdesigns consisting of two conditions (e.g., microarrays from control-treated cells vs mi-croarrays from drug-treated cells) with multiple replicates in each condition. The R

script employs a set of standard BioConductor [GCB+04] packages mixed with customprogramming. The workflow consists of four fundamental tasks: normalization of dataacross microarray datasets (Normalize), selection of di↵erentially expressed genes (DEGs)between conditions (SelectDEGs), determination of gene ontology (GO) statistics for theresulting datasets (GO Analysis), and creation of a heatmap of the di↵erentially ex-pressed genes (MakeHeatmap). Each module produces outputs, and each module (asidefrom MakeHeatmap) requires external parameter inputs. Importantly, this graphical rep-resentation clearly indicates the dependence of each module on datasets and parameterinputs. This example demonstrates that YesWorkflow can provide informative visualiza-tions of bioinformatics workflows, especially workflows involving large numbers of inputsand outputs.

1For all of these scripts, learning the YW model and annotating the scripts was done in a few hours.

6

Gene  Expression  Microarray  Data  Analysis  

•  [Normalize]    –  NormalizaBon  of  data  across  microarray  datasets    

•  [SelectDEGs]    –  SelecBon  of  differenBally  expressed  genes  between  condiBons    

•  [GO  Analysis]    –  determinaBon  of  gene  ontology  staBsBcs  for  the  resulBng  datasets    

•  [MakeHeatmap]    –  creaBon  of  a  heatmap  of  the  differenBally  expressed  genes.    

Tyler  Kolisnik,  Mark  Bieda  

26  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collecBon  workflow  (X-­‐ray  diffracBon)  

27  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-­‐RECON:  ProspecBve  &  RetrospecBve  Provenance  …  (almost)  for  free!    

28  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

•  URI-­‐templates  link  conceptual  enBBes  to  runBme  provenance  “le}  behind”  by  the  script  author  …    

•  …  facilitaBng  provenance  reconstrucBon  

YW  (prospec:ve)  and    YW-­‐Recon  (retrospec:ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wriken  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

29  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collecBon  workflow:  runBme  data  

30  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

1.   YW  annotaBons  =>  YW  model  2.   Files  &  Folders  len  by  a  run  =>  runBme  (meta-­‐)data  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:  What  samples  did  the  script  run  collect  images  from?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

  31  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:  What  energies  were  used  for  image  collecBon  from  sample  DRT322?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

  32  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:  Where  is  the  raw  image  of  the  corrected  image  DRT322_11000ev_030.img?    run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

33  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:  What  cassepe-­‐id  had  the  sample  leading  to  DRT240_10000ev_001.img?  

34  

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:  What  cassepe-­‐id  had  the  sample  leading  to  DRT240_10000ev_001.img?  

35  

6th  Stop:  Provenance  in  Databases  •  Some  key  quesBons:  

– Why  is  t  in  q(D)?  – Which  set  of  tuples  L  in  D  does  t  depend  on?            i.e.,  what  is  the  lineage  of  t  ?    – How  was  t  derived  from  its  lineage  L  ?        

•  Also:  – Where  in  D  do  the  values  in  t  come  from?  – Why  is  t’  not  in  q(D)?  

•  ..  fasten  your  seatbelts  …    

36  

Provenance in Databases

37  

Land  of  many  different  provenance  species:    Why?  How?  Where?  

 Later:  Why-­‐Not?  How  many?  How  long?  

Provenance in Databases (fine-grained, white-box)

38  

Compare  with:  Provenance  in  ScienBfic  Workflows  

•  Some  key  quesBons:  – What  is  the  lineage/trace  T  of  data  product  (output)  yi:  

         (y1  …,  yn  )  =  execute(W,  x,  p)  ?  •  …  given  workflow/script  W  with  inputs  x  and  parameters  p  ?  •  …  i.e.,  find  subset  of  x,  p,  and  (program  slices  of)  W  on  which  a    specific  yi  depends!  

–  How  can  we  store,  query  the  provenance  (trace)  graph  effecBvely,  efficiently?    

•  Regular  Path  Queries  (RPQs),  Lowest  Common  Ancestor  (LCA)  •  Temporal  Query  Languages  (e.g.  Past-­‐Temporal  Logic)  •  other  graph  queries  

– What  is  the  difference  between  traces  T1,  T2?  –  Does  the  trace  (retrospec:ve  provenance)  match  the  workflow  (prospec:ve  provenance)?    

39  

Provenance in (Scientific) Workflows (“Coarse-grained”, “Black-box”)

40  

What people do with “provenance” •  Result  validaBon      •  Result  debugging  (science  vs  wf  logic)  •  Reproducibility  and  Repeatability      •  ExplanaBon  (derivaBons,  traces,  proof  trees)  •  RunBme  monitoring  

–  Profiling,  benchmarking  

•  Performance  OpBmizaBon  (“smart  re-­‐run”)  •  Fault-­‐tolerance,  crash-­‐recovery  •  Database  view  maintenance  (e.g.  data  warehousing)  •  Workflow  design     41  

Database  Provenance:  Some  Pioneers  …  

Cui  (PhD  2001),  Widom:  TODS’00,  VLDB’03  

42  

Database  Provenance:  Some  Pioneers  

Buneman  et  al.  ICDT  2001  

(cita:ons:  1000+)    

43  

Provenance  Semirings:    The  Great  Database  Provenance  UnificaBon*!  

TJ  Green  et  al:  PODS’07,  

SIGMOD  Record’12  

44  

*RestricBons  apply:          posi.ve  queries  only…  

7th  Stop:  Provenance  Polynomials  One  Semiring  to  Rule  them  all!  (Theory  strikes!)  

Green,  Karvounarakis,  Tannen.  Provenance  semirings,  PODS,  2007  45  

Example:  Go  from  X  to  Y  in  3  hops!  (a  =  CS      b  =  NCSA      c  =  GSLIS)  

•  Database:          hop(X,Y)  :=    

   •  Query:    3hop(X,Y)  :-­‐              hop(X,  Z1),  hop(Z1,  Z2),  hop(Z2,Y).  

a

p

bq

rcs

Note:  Cannot  go  from  c  to  a  in  3hops!    

a

ppp+pqr+qrpbppq+qrq

cpqsppr+qrr

rpq

rqs

hop(a,a,  p).  hop(a,b,  q).  hop(b,a,  r)  hop(b,c,  s).  

3hop(a,a,  p3+2pqr).  3hop(a,b,  p2q+q2r).  …    3hop(a,c,  pqs).  

46  

Provenance  Polynomials      

,,Mein  Schatz!”  

     p3  +  2pqr                    

     p3  +    pqr                        p  +  2pqr                    

     p  +    pqr                    

     pqr                    

     p  +    pqr                    

p  

a

ppp+pqr+qrpbppq+qrq

cpqsppr+qrr

rpq

rqs

47  

8th  Stop:  The  NegaBon  &  Why-­‐Not  Problem  

•  Provenance  Semirings  work  well  for:  –  PosiBve  Queries  (e.g.,  RA+  )  

•  Challenges:  Handling  of    –  set  difference  (~  negaBon)  – Why-­‐Not  provenance  – Missing  Answer  provenance      

•  A  fresh  look  at  provenance!  •  …  using  an  old  idea:  Game  semanBcs!  

–  for  query  evaluaBon      48  

Query  evalua:on  game  

EDB:    e(a,b),  e(b,b)    a b

tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2) tc(X,Y) :- # (1)--exists:Z-->(3)

e(X,Z), # (3)->(4)-e(X,Z)->(5) tc(Z,Y). # (3)--X:=Z-->(1) 2

3

1

X := Z

4 5

e(X,Y) exists:Z

e(X,Z)

3:(b,b,b) 11:(b,b) 11

4:(b,b) 11

1:(a,b) 1

3:(a,b,a) 1

2:(a,b) 01

3:(a,b,b) 1

2

2

3:(b,b,a) 1

2:(b,b) 01

4:(a,b) 1 5:(a,b) 01

5:(b,b) 01

3:(a,a,a) 14:(a,a) 0

1

1:(a,a) 2

1

3:(b,a,a) 1

4:(b,a) 0

1

1

11

3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2

Provenance’12  @Dagstuhl      with  JanVdB  TJ  Green        

Flum,  Kubierschky,  Ludäscher,  Total  and  parBal  well-­‐founded  Datalog  coincide,  ICDT-­‐The-­‐Bag-­‐1997,  Delphi,  Greece  

Eureka!

49  

a b

tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2) tc(X,Y) :- # (1)--exists:Z-->(3)

e(X,Z), # (3)->(4)-e(X,Z)->(5) tc(Z,Y). # (3)--X:=Z-->(1) 2

3

1

X := Z

4 5

e(X,Y) exists:Z

e(X,Z)

3:(b,b,b) 11:(b,b) 11

4:(b,b) 11

1:(a,b) 1

3:(a,b,a) 1

2:(a,b) 01

3:(a,b,b) 1

2

2

3:(b,b,a) 1

2:(b,b) 01

4:(a,b) 1 5:(a,b) 01

5:(b,b) 01

3:(a,a,a) 14:(a,a) 0

1

1:(a,a) 2

1

3:(b,a,a) 1

4:(b,a) 0

1

1

11

3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2

EDB:    e(a,b),  e(b,b)    

Game  diagram  

Instan:ated  move  graph  

Flum,  Kubierschky,  Ludäscher,  Total  and  parBal  well-­‐founded  Datalog  coincide,  ICDT-­‐The-­‐Bag-­‐1997,  Delphi,  Greece  

50  

Eureka  moment:  1.  query  evaluaBon  =  evaluaBon  game  (argument  about  truth  in  a  database)  2.   provenance  =  winning  strategies  (jusBfied/winning  arguments)  

9th  Stop:  A  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

51  

Solving  the  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

All  successors  won  è  posiBon  lost                    Some  successor  lost  è  posiBon  won  

52  

Solving  the  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

All  leaves  (dead-­‐ends)  are  immediately  lost!  53  

Solving  the  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

X  is  won  if  there  exists  a  move  to  a  lost  Y  54  

Solving  the  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

X  is  lost  if  all  moves  lead  to  a  won  Y  55  

Solving  the  Game  

a   k  

b   c   l  

d   e   m  

g   h   n  f  

Repeat  unBl  no  change  =>  drawn  posiBons  remain  56  

10th  Stop:  Game  Provenance  a

b

1

c

3

d e

f

1

g

3

m

h

1

k

l

oo

n

oo

oo

oo

2

2

2

•  Game  can  be  solved  in  Bme  linear  in  |Move|  

•  One  rule  to  rule  them  all!  win(X)  :-­‐  move(X,Y),  not  win(Y)  

•  node  color  =>  edge  color    –  good  vs  bad  moves  

•  good  moves  =  natural,  new  noBon  of  provenance!  

 

Aside:  Games  ~  ArgumentaBon  Frameworks  win(X)  :-­‐  move(X,Y),  not  win(Y)  def(X)  :-­‐  akacks(Y,X),  not  def(Y)   Eureka!

57  

Game  Provenance  

W

bad Dbad

L winningbad

drawing

n/a

delaying

n/a

n/a

a

b

1

c

3

d e

f

1

g

3

m

h

1

k

l

oo

n

oo

oo

oo

2

2

2

ExtracBng  Provenance:  ü  Why/how  win(x)?                    

•  [x]  –G.(R.G)*–>  [y]  

ü  Why-­‐not  win(x)?    •  [x]  –(R.G)*–>  [y]  •  [x]    –(Y+)–>      [y]  

Move  types  

58  

Game  Provenance  a

b

1

c

3

d e

f

1

g

3

m

h

1

k

l

oo

n

oo

oo

oo

2

2

2

ExtracBng  Provenance:  ü  Why/how  win(x)?                    

•  [x]  –G.(R.G)*–>  [y]  

ü  Why-­‐not  win(x)?    •  [x]  –(R.G)*–>  [y]  •  [x]    –(Y+)–>      [y]  

•  Next:  play  a  query  evaluaBon  game  

•  =>  new  why-­‐(not)  provenance  via  games!  

59  

11th  Stop:  Provenance  (or  Query  Evalua.on)  Games  ConstrucBon  

 “SLD-­‐resoluBon  game”    Next  (Example):      

 A(X)  :–  B(X,Y,Z)    …  not  C(X,Y)  …    

Eureka!

60  

TranslaBon:  Q(I) => G Q(I)

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

61  

Solve  G Q(I)  =>  Provenance!    

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

62  

Happy  End  (1  of  3)  

Towards Constraint Provenance Games

Sean Riddle Sven Kohler Bertram LudascherDepartment of Computer Science, University of California, Davis, CA 95616

{swriddle, svkoehler, ludaesch}@ucdavis.edu

Abstract

Provenance for positive queries is well understood and elegantlyhandled by provenance semirings [GKT07], which subsume manyearlier approaches. However, the semiring approach does not ex-tend easily to why-not provenance or, more generally, first-orderqueries with negation. An alternative approach is to view queryevaluation as a game between two players who argue whether, forgiven database I and query Q, a tuple t is in the answer Q(I) or not.For first-order logic, the resulting provenance games [KLZ13] yielda new provenance model that coincides with provenance semirings(how provenance) on positive queries, but also is applicable to first-order queries with negation, thus providing an elegant, uniformtreatment of earlier approaches, including why-not provenance andnegation. In order to obtain a finite answer to a why-not question,provenance games employ an active domain semantics and enu-merate tuples that contribute to failed derivations, resulting in a do-main dependent formalism. In this paper, we propose constraintprovenance games as a means to address this issue. The key idea isto represent infinite answers (e.g., to why-not questions) by finiteconstraints, i.e., equalities and disequalities.

1. Introduction

Consider the relation hop(x, y) in Fig. 1a and query Q3hop

:=

r1 : 3hop(X,Y ) :� hop(X,Z1), hop(Z1, Z2), hop(Z2, Y ).

Q3hop

asks for pairs of nodes that are reachable via exactly threeedges (“hops”). If we ask why and how a tuple such as 3hop(a, a)came about, we can use polynomials over a provenance semiring[GKT07, KG12] to get a precise answer, here: p3+2pqr. In Fig. 1awe see that one can “go” from node a to itself in three hops indistinct ways: (i) by using the edge p (= hop(a, a), a self-loop)three times: p·p·p, or p3 for short, (ii) by using the p edge once,followed by q (= hop(a, b)) and then r (= hop(b, a)), so p·q·r,or (iii) by following q, r, and then p, i.e., q·r·p. Since semiringprovenance is commutative, p·q·r + q·r·p = 2pqr as shown inthe figure. Many prior provenance approaches can be understoodas special provenance semirings: e.g., Trio provenance [BSHW06],why-provenance [BKT01], and lineage [CWW00], all yield coarserversion of the provenance p3 + 2pqr of 3hop(a, a), i.e., p+ 2pqr,p+ pqr, and pqr, respectively [KG12].

Provenance through Games. In Fig. 1c we see that 3hop(c, a) isabsent, so 3hop(c, a) is false. We cannot use semiring provenanceto explain why-not, since the approach is not defined for negativequeries and extensions for negation (or set-difference) are not ob-vious [GP10, GIT11, ADT11a, ADT11b]. On the other hand, if anapproach can explain the provenance of ¬A, this naturally providesa why-not explanation for A. In [KLZ13] we proposed an alterna-tive model of provenance that naturally supports negation. Considerthe graph in Fig. 1d. It can be understood as the move graph of aquery evaluation game in which two players argue whether or not

a p

b

q r

c

s

(a) input I ...

hop

a a pa b qb a rb c s

(b) ... annotated.

3hop

a a p3 + 2pqra b p2q + q2ra c pqsb a p2r + qr2

b b pqrb c qrs

(c) 3hop with provenance.

r1(a, a, b, a)

g21(a, a)

¬hop(b, a)

g11(a, a)

hop(b, a)

g21(a, b) g31(b, a)

rhop

(b, a)

r1(a, a, a, a)

r1(a, a, a, b)

3hop(a, a)

g31(a, a)

rhop

(a, a)

hop(a, b)

¬hop(a, a)

g11(a, b)

rhop

(a, b)

g21(b, a)

¬hop(a, b)

hop(a, a)

9 a,a 9 b,a

9 a,b

(d) The game provenance of 3hop(a, a) ...

+

+

+

+ +

r

+

+

p

+

+

q

+

+

(e) ... is p3 + 2pqr.

Figure 1: Each edge hop(x, y) in the input graph I in (a) is annotated(p, q, r, ...) in (b). The answer to Q

3hop

is shown in (c) with provenancepolynomials [KG12]. The game provenance [KLZ13], e.g., of 3hop(a, a)in (d) corresponds to the semiring provenance polynomial in (c): see (e).

a tuple t 2 Q(I). If a player wants to prove that t = 3hop(a, a) isin Q

3hop

, she needs to move to a ground rule r with t in the head,thereby claiming that this rule instance is deriving t. In Fig. 1d,there are three choices, starting from the root node 3hop(a, a): themove to r1(a, a, a, a), to r1(a, a, a, b), or to r1(a, a, b, a). Herer1(x, y, z1, z2) identifies ground instances of r1. There are two 8-quantified variables X and Y occurring in the head and body, andtwo (implicitly) 9-quantified variables Z1 and Z2, occurring onlyin r1’s body. By moving to a ground instance of r1 in the game, theplayer tries to pick values for the 9-quantified variables that makethe rule body true while deriving t in the head. For r1, the middleedge hop(z1, z2) fixes the bindings of Z1 and Z2. For the givendatabase instance I , there are three choices that “work”: (a, a),(a, b), and (b, a). This means that there are exactly three differ-ent ways to obtain 3hop(a, a) via r1 over input I: if we choose thep-hop (a, a) as the middle edge, we have p·p·p; for the q-hop (a, b)we have p·q·r; and for the r-hop (b, a) we have q·r·p.1 The oppo-nent can now challenge each of these claims, by selecting a subgoal

1 Game provenance [KLZ13] can distinguish p·q·r and q·r·p and is thuseven more fine-grained than the provenance semirings in [GKT07, KG12].

Provenance  Game  on  GQ(I)      =    Provenance  Polynomials    …  for  posiBve  queries!  

Yes! �

63  

Happy  End  (2  of  3)  

…  but  also  works  for  Why-­‐Not  provenance  &  non-­‐monotonic  queries  (i.e.,  Q  can  have  negaBon)  !!    Here:  not  3hop(c,a)  –  can’t  go  back  from            GSLIS    to        CS  

                                 c                        a    

g21(c, a)

¬3hop(c, a)

g21(c, c)g11(c, c)

r1(c, a, c, b)

¬hop(c, b)

hop(c, a)

g21(b, b)

¬hop(a, c)

hop(c, c)

g11(c, a)

r1(c, a, b, c)r1(c, a, a, b)

3hop(c, a)

hop(b, b)

g21(c, b)g21(a, c)

r1(c, a, a, c)

¬hop(c, c)

hop(c, b)

¬hop(c, a)

g11(c, b)

r1(c, a, b, b)

¬hop(b, b)

g31(c, a)

r1(c, a, a, a) r1(c, a, b, a)

hop(a, c)

r1(c, a, c, a) r1(c, a, c, c)

9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b

Figure 2: Why-not provenance for 3hop(c, a) using provenance games.

gi1 in the body of r1, thus claiming that gi1 is false and hence thatthe r1 instance doesn’t derive t. The first player can counter anddemonstrate that gi1 is true by selecting a rule instance or fact asevidence for gi1. The game proceeds in rounds until some playercannot move and thus loses (the opponent wins). In [KLZ13] itwas shown how the provenance of a tuple t can be obtained via aregular path query over a solved game graph like the one in Fig. 1d:e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved gameas shown in Fig. 1e: for positive queries, solved games representsemiring provenance by noting that won (green) and lost (red) po-sitions correspond to “+” and “⇥” operations, respectively (leavesrepresent input annotations, here: p, q, r, s) [KLZ13].

Why-Not Provenance and the Many Ways to Fail. Since gamesare inherently symmetric (one player’s win is the opponent’s lossand vice versa), the approach yields an elegant provenance modelthat unifies why and why-not provenance. Consider the (dark, red)node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing movesto a node r1(x, y, z1, z2) lead to a position that is won for the oppo-nent. There are 9 such positions, e.g., r1(c, a, c, b) is one of them(third from the right). Recall that an instance of r1 means that onecan do a 3-hop from x to y (here: c to a) via intermediate nodesz1 and z2 (here: c and b). However, in the given database I inFig. 1(a), there is no hop(c, z) – neither for z = b nor for any otherz, since there are no outgoing moves from c. In this case, the op-ponent can successfully attack the goals in the body. Note how thewhy-not provenance of 3hop(c, a) in Fig. 2 is similar but differentfrom the why provenance of 3hop(a, a) in Fig. 1: In order to showthat 3hop(c, a) is false, one has to show that all possible ways thatit could be true are failing, i.e., for all z1, z2, the ground instancesr1(c, a, z1, z2) do not derive 3hop(c, a) (since at least one goal inr1’s body is always false). In constrast, to prove that 3hop(a, a)is true, it is sufficient to find some ground instance r1(a, a, z1, z2)whose body is true. Earlier we saw that there are exactly three suchinstances, corresponding to p ·p ·p+p ·q ·r+q ·r ·p (= p3+2pqr).

Domain Dependence of Provenance Games. As seen, 3hop(a, a)has three derivations, represented by the first provenance polyno-mial in Fig. 1(c) and the game provenance in Fig. 1(d) and (e). Howmany ways are there to show that 3hop(c, a) is false (why-not pro-venance), or equivalently, that ¬ 3hop(c, a) is true? If we annotatethe leaves of the game graph in Fig. 2 with identifiers u1, . . . , u5 forthe five different hop tuples missing in I , we can construct a pro-venance expression that represents the many ways why 3hop(c, a)is not in the answer. While this answer provides a comprehensive,instance-based why-not explanation, it also exhibits a problem withthe current approach: In order to obtain finite (why and why-not)provenance answers for all first-order queries, game provenanceemploys an active domain semantics: e.g., the provenance gamefor Q

3hop

(I) considers only ground instances of r1 over the activedomain adom(I) = {a, b, c}. If additional elements d, e, . . . areadded to I (e.g., via a disconnected graph component), the why-notprovenance in Fig. 2 becomes incomplete and the provenance hasto be recomputed for the larger domain.

Constraint Provenance Games. We propose to solve the prob-lem of domain dependence by modifying provenance games sothat they can handle certain infinite relations that can be finitelyrepresented. For example, in addition to the finitely many reasonswhy 3hop(c, a) fails over the active domain adom(I), there are in-finitely many others, if we consider new constants d, e, . . . outsideof adom(I). For example, let relation R = {a, b} have two tuplesR(a) and R(b). If we want to know why-not R(c), we just point toc /2 R. But we could also return a more general answer for why-notR(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (notjust for x = c). This approach is inspired by Chan’s ConstructiveNegation [Cha88], a form of constraint logic programming [Stu95].The key idea is to represent (potentially infinite) relations throughconstraints, i.e., Boolean combinations of equalities x = c and dis-equalities x 6= c.

Overview and Contributions. Section 2 briefly explains how first-order queries are translated into games and how provenance is ex-tracted from solved games. In Section 3 we describe the construc-tion of constraint provenance games; additional details and exam-ples are contained in the appendix. Our main contributions are:(i) game provenance provides a uniform treatment of why and why-not provenance for first-order logic (= relational algebra with set-difference); (ii) for positive queries, the approach captures the mostinformative semiring provenance [GKT07, KG12]; (iii) we developa constraint provenance framework which yields domain indepen-dent provenance expressions, extending prior results [KLZ13]; and(iv) we implemented a prototype of constraint provenance games.

2. Provenance through Games

We first sketch how a query Q over database I gives rise to a gameG

Q(I) and how to obtain provenance from the solved game G�

Q(I).Consider, e.g., input relations B(X,Y ) and C(Y ) and a relationalquery Q

ABC

with set-difference: A ⇡X

(B on (⇡Y

(B) \ C)). It iswell-known that any relational algebra query can be translated intoa non-recursive Datalog¬ program. Here, we have Q

ABC

=

r2 : A(X) :� B(X,Y ),¬C(Y ).

The key idea of provenance games is to understand query evalu-ation as a game between players I and II who argue whether ornot a tuple is in the answer. In [KLZ13] we showed that the solvedgame is a representation of why (why-not) provenance of answertuples (missing tuples), respectively. Fig. 3a shows the game tem-plate for Q

ABC

: to prove that A(x) is true, player I needs to find arule instance of r2, say A(x) :� B(x, y),¬C(y) which derives thedesired tuple A(x) and whose choice y for the 9-quantified vari-able Y in the body satisfies all literals (subgoals) in the rule body.In the game template in Fig. 3a this corresponds to a move fromA(X) to r2(X,Y ) while choosing a suitable domain value y forthe 9-quantified variable Y . Player II can challenge this claim by“attacking” one of the subgoals g in the rule body. If player I chosethe “wrong” y for the instance r2(x, y), then II can always attackat least one subgoal that falsifies the body. The game continues inturns, until a player cannot move and loses, and the opponent wins.

A game template GQ

for query Q contains literal nodes (oval;for atoms or their negation), rule nodes (boxes; for Datalog¬ rules),and goal nodes (rounded boxes; subgoals of rules): see Fig. 3a.Edge labels indicate a condition for a move: e.g., the label “9Y ”between a literal node, say A(X), and a rule node, say r2(X,Y ),requires a player to pick a value y for the 9-quantified variable Ywhen moving from an atom to the rule that derives it. Similarly,a condition “X:=Y ” means that the current choice of Y becomes

Yes! �64  

Happy  End  (2  of  3)  

5  leaf  nodes  ~    5  missing  (“hypotheBcal”)  edges    Insert  those    =>  3hop(c,a)  will  be  true!    

g21(c, a)

¬3hop(c, a)

g21(c, c)g11(c, c)

r1(c, a, c, b)

¬hop(c, b)

hop(c, a)

g21(b, b)

¬hop(a, c)

hop(c, c)

g11(c, a)

r1(c, a, b, c)r1(c, a, a, b)

3hop(c, a)

hop(b, b)

g21(c, b)g21(a, c)

r1(c, a, a, c)

¬hop(c, c)

hop(c, b)

¬hop(c, a)

g11(c, b)

r1(c, a, b, b)

¬hop(b, b)

g31(c, a)

r1(c, a, a, a) r1(c, a, b, a)

hop(a, c)

r1(c, a, c, a) r1(c, a, c, c)

9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b

Figure 2: Why-not provenance for 3hop(c, a) using provenance games.

gi1 in the body of r1, thus claiming that gi1 is false and hence thatthe r1 instance doesn’t derive t. The first player can counter anddemonstrate that gi1 is true by selecting a rule instance or fact asevidence for gi1. The game proceeds in rounds until some playercannot move and thus loses (the opponent wins). In [KLZ13] itwas shown how the provenance of a tuple t can be obtained via aregular path query over a solved game graph like the one in Fig. 1d:e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved gameas shown in Fig. 1e: for positive queries, solved games representsemiring provenance by noting that won (green) and lost (red) po-sitions correspond to “+” and “⇥” operations, respectively (leavesrepresent input annotations, here: p, q, r, s) [KLZ13].

Why-Not Provenance and the Many Ways to Fail. Since gamesare inherently symmetric (one player’s win is the opponent’s lossand vice versa), the approach yields an elegant provenance modelthat unifies why and why-not provenance. Consider the (dark, red)node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing movesto a node r1(x, y, z1, z2) lead to a position that is won for the oppo-nent. There are 9 such positions, e.g., r1(c, a, c, b) is one of them(third from the right). Recall that an instance of r1 means that onecan do a 3-hop from x to y (here: c to a) via intermediate nodesz1 and z2 (here: c and b). However, in the given database I inFig. 1(a), there is no hop(c, z) – neither for z = b nor for any otherz, since there are no outgoing moves from c. In this case, the op-ponent can successfully attack the goals in the body. Note how thewhy-not provenance of 3hop(c, a) in Fig. 2 is similar but differentfrom the why provenance of 3hop(a, a) in Fig. 1: In order to showthat 3hop(c, a) is false, one has to show that all possible ways thatit could be true are failing, i.e., for all z1, z2, the ground instancesr1(c, a, z1, z2) do not derive 3hop(c, a) (since at least one goal inr1’s body is always false). In constrast, to prove that 3hop(a, a)is true, it is sufficient to find some ground instance r1(a, a, z1, z2)whose body is true. Earlier we saw that there are exactly three suchinstances, corresponding to p ·p ·p+p ·q ·r+q ·r ·p (= p3+2pqr).

Domain Dependence of Provenance Games. As seen, 3hop(a, a)has three derivations, represented by the first provenance polyno-mial in Fig. 1(c) and the game provenance in Fig. 1(d) and (e). Howmany ways are there to show that 3hop(c, a) is false (why-not pro-venance), or equivalently, that ¬ 3hop(c, a) is true? If we annotatethe leaves of the game graph in Fig. 2 with identifiers u1, . . . , u5 forthe five different hop tuples missing in I , we can construct a pro-venance expression that represents the many ways why 3hop(c, a)is not in the answer. While this answer provides a comprehensive,instance-based why-not explanation, it also exhibits a problem withthe current approach: In order to obtain finite (why and why-not)provenance answers for all first-order queries, game provenanceemploys an active domain semantics: e.g., the provenance gamefor Q

3hop

(I) considers only ground instances of r1 over the activedomain adom(I) = {a, b, c}. If additional elements d, e, . . . areadded to I (e.g., via a disconnected graph component), the why-notprovenance in Fig. 2 becomes incomplete and the provenance hasto be recomputed for the larger domain.

Constraint Provenance Games. We propose to solve the prob-lem of domain dependence by modifying provenance games sothat they can handle certain infinite relations that can be finitelyrepresented. For example, in addition to the finitely many reasonswhy 3hop(c, a) fails over the active domain adom(I), there are in-finitely many others, if we consider new constants d, e, . . . outsideof adom(I). For example, let relation R = {a, b} have two tuplesR(a) and R(b). If we want to know why-not R(c), we just point toc /2 R. But we could also return a more general answer for why-notR(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (notjust for x = c). This approach is inspired by Chan’s ConstructiveNegation [Cha88], a form of constraint logic programming [Stu95].The key idea is to represent (potentially infinite) relations throughconstraints, i.e., Boolean combinations of equalities x = c and dis-equalities x 6= c.

Overview and Contributions. Section 2 briefly explains how first-order queries are translated into games and how provenance is ex-tracted from solved games. In Section 3 we describe the construc-tion of constraint provenance games; additional details and exam-ples are contained in the appendix. Our main contributions are:(i) game provenance provides a uniform treatment of why and why-not provenance for first-order logic (= relational algebra with set-difference); (ii) for positive queries, the approach captures the mostinformative semiring provenance [GKT07, KG12]; (iii) we developa constraint provenance framework which yields domain indepen-dent provenance expressions, extending prior results [KLZ13]; and(iv) we implemented a prototype of constraint provenance games.

2. Provenance through Games

We first sketch how a query Q over database I gives rise to a gameG

Q(I) and how to obtain provenance from the solved game G�

Q(I).Consider, e.g., input relations B(X,Y ) and C(Y ) and a relationalquery Q

ABC

with set-difference: A ⇡X

(B on (⇡Y

(B) \ C)). It iswell-known that any relational algebra query can be translated intoa non-recursive Datalog¬ program. Here, we have Q

ABC

=

r2 : A(X) :� B(X,Y ),¬C(Y ).

The key idea of provenance games is to understand query evalu-ation as a game between players I and II who argue whether ornot a tuple is in the answer. In [KLZ13] we showed that the solvedgame is a representation of why (why-not) provenance of answertuples (missing tuples), respectively. Fig. 3a shows the game tem-plate for Q

ABC

: to prove that A(x) is true, player I needs to find arule instance of r2, say A(x) :� B(x, y),¬C(y) which derives thedesired tuple A(x) and whose choice y for the 9-quantified vari-able Y in the body satisfies all literals (subgoals) in the rule body.In the game template in Fig. 3a this corresponds to a move fromA(X) to r2(X,Y ) while choosing a suitable domain value y forthe 9-quantified variable Y . Player II can challenge this claim by“attacking” one of the subgoals g in the rule body. If player I chosethe “wrong” y for the instance r2(x, y), then II can always attackat least one subgoal that falsifies the body. The game continues inturns, until a player cannot move and loses, and the opponent wins.

A game template GQ

for query Q contains literal nodes (oval;for atoms or their negation), rule nodes (boxes; for Datalog¬ rules),and goal nodes (rounded boxes; subgoals of rules): see Fig. 3a.Edge labels indicate a condition for a move: e.g., the label “9Y ”between a literal node, say A(X), and a rule node, say r2(X,Y ),requires a player to pick a value y for the 9-quantified variable Ywhen moving from an atom to the rule that derives it. Similarly,a condition “X:=Y ” means that the current choice of Y becomes

A. Why-Not 3hop(c, a) Dissected

Consider the input graph in Fig. 1a and its why-not provenancefor 3hop(c, a) in Fig. 2. The graph encodes the reasons why3hop(c, a) is not in the answer. Moving from the lost 3hop(c, a) inFig. 2, there are nine possible rule instantiations r1(c, a, z1, z2), allof which represent a reason why there is no 3hop(c, a) via interme-diate nodes z1, z2 2 {a, b, c}. To better understand these why-notexplanations, consider the input graph in Fig. 7. It contains the orig-inal database instance I plus a number of hypothetical (or missing)edges (dotted), with labels t, u, v, w, and x. These missing edgescorrespond to the failed leaf nodes in Fig. 2. The table in Fig. 6contains the why-not provenance, with different combinations ofmissing edges as preconditions for a derivation of 3hop(c, a).

a p

b

q

c

u

r

x

s

t

w

v

Figure 7: Input graph I with five additional, hypothetical edges (dashed).

B. Constraint Game Construction

Consider the query QABC

. To build the game, each ground tu-ple in the program such as B(a, b) is replaced by a constraintB:x1=a, x2=b (a conjunction).

First, the subgraph for EDB predicates is created. The remainderof the game is constructed iteratively similar to query execution.For rules whose subgoals are all on EDB predicates, goal/rulenodes/edges are generated. For IDB predicates that were only inthe head of EDB-only rules, tuple nodes are generated. Goal andrule nodes/edges are added for rules when the subgraph for all theirsubgoals has been generated, and for predicates when the subgraphfor all the rules deriving into it has been generated.

For each EDB predicate, an expression is generated that is adisjunction of all tuples in the predicate. This expression and itsnegation are both processed to produce orthogonal DNF expres-sions (i.e., the conjunction of any two disjuncts in the expression isunsatisfiable). Tuple nodes t+= P : c and t�= !P : c and an edge(t�, t+) are added to the graph for each disjunct in the constraint.

Those EDB nodes created from a positive expression disjunctare connected negative to positive and positive to a new sink node.Those from a negative disjunct are connected negative to positive,the positive node being a sink.

Orthogonalization is applied to the tuple constraints to ensurethat each variable-free tuple is admitted by exactly one node.

Rule nodes are created to which connect IDB tuple nodes for thehead predicate and which connect to goal nodes representing theuses of predicates in subgoals of the rule. A rule node is generatedfor each combination of body tuple nodes such that, if variablesin the tuple node constraints were renamed as in the rule, theconstraints would be satisfiable when conjuncted. The rule nodeis given this simplified conjunction as a constraint, each goal nodeis created with an edge to its originating tuple node, and the rulenode is connected to all these goals.

When all rules deriving a predicate have been processed, tuplenodes for the predicate are created. All constraints for rule nodescorresponding to these rules are disjuncted and this expression is

restricted to the variables in the rule node.4 This expression is thentreated like that of an EDB predicate: it is simplified and convertedto orthogonal DNF. A pair (positive and negative) of tuple nodesis created for each disjunct in the DNF. Edges are created frompositive tuple nodes to rule nodes if the tuple node constraint (withvariables renamed appropriately) when conjuncted with the rulenode constraint can be satisfied.

A player selecting a goal node for goal g with conjunction eargues that a tuple agreeing with e can be used to satisfy g. A playercurrently ‘at’ a rule node is fighting the implicit claim that this rulefiring is satisfied and creates the tuple in question. To rebut thisclaim, the player moves to a goal node claimed to be unsatisfied.The goal, if unsatisfied, will be lost; the rule node will be won iffat least one goal is unsatisfied. This provides the desired semanticsfor the rule node.

A detailed example using the game in Fig. 5 can be found in thenext section.

Constraint provenance games improve grounded provenancegames by making them domain independent. To return to our mo-tivating example, consider Fig. 5. Observe that the won/lost statesare effectively the same as in Fig. 3c, but compressed into constraintnodes that apply to more than one tuple. If one is interested in whythe firing r2(b, c) was not sufficient to derive A(b), then one justhas to find the node admitting this rule firing (r2 : X 6=a, Y 6=a).The subgraph of this node reachable using provenance edges willexplain why rule firings admitted by this node are invalid.

Example Consider the example QABC

corresponding to the con-straint game in Fig. 5. After all EDB facts of B and C have been pro-cessed, the rule is processed. Intuitively, a way to show the presenceof A(X) is to select a node which represent the presence of tuplesin B and a node for the absence of tuples in C, which conjunctivelycorrespond to a valid rule firing deriving A(X). This is equivalentto evaluating the 9Y from the game template (see Fig. 3a) withouthaving to enumerate all possible assignments of values to Y . Ex-pressions that are not satisfiable in conjunction represent insolublejoin conditions between the goals.

When creating nodes for the rule, one could consider the com-bination !B : x1=a, x2=b and C : x1 6=a. Goal nodes are createdfor these (g12 : B : X=a, Y=b and g22 : !C : Y 6=a, respectively) andsince X=a^Y=b^Y 6=a is satisfiable, a rule node r2 : X=a, Y=bis created and edges are drawn from the rule node to each goal nodeand from each goal to the corresponding tuple node. To contrast, thecombination !B : x1=b, x2=a and C : x1 6=a would not be satisfi-able after renaming and conjunction.

Consider the (valid) rule firing A(a) :� B(a, b),¬C(b). In con-structing the game, the node !B : x1=a, x2=b is used for the firstgoal as this node has the only expression to agree with B(a, b). Agoal node is created signifying the use of this conjunction in thecontext of this goal: g12 : B:X=a, Y=b. Consider the conjunctionof the expressions of nodes g12 : B:X=a, Y=b and g22 : !C:Y 6=a. Itcan be satisfied, so a rule node is created representing this combina-tion of goal nodes. The corresponding expression is the simplifiedconjunction of all the goal expressions used.

The rule firing r2:X=a, Y=b is lost because both the con-nected goal nodes g12 and g22 are won (ultimately because B(a, b)is in the EDB and C(a) is not, respectively).

An expression for A/1 is generated by disjuncting all the ex-pressions for rule nodes deriving into A/1.5 This expression is thenrestricted to X (yielding X=a _ X=b _ X 6=a). Orthogonaliza-tion ensures that each tuple will correspond to a single conjunction:(X=a) _ (X=b) _ (X 6=a, X 6=b).

4 All other variables are replaced with true.5 This yields (Y 6=b, X=a, Y 6=a) _ (X=a, Y=a) _ (X=a, Y=b) _(X 6=a, Y 6=a) _ (X=b, Y=a) _ (X 6=a, X 6=b, Y=a)

To that end, a tuple R(X) with variables X = X1, . . . , Xn

is associated with a Boolean expression over equalities of the formX = c, or disequalities of the form X 6= c. Thus, each (dis)equalityis between a variable from X and a constant c 2 adom(I).

Since nodes in a constraint game no longer correspond to asingle concrete value, but a constrained set, a tuple node being wonin the solved game corresponds to the presence of all tuples whichsatisfy the node’s constraint (the node is said to admit the tuples),and if lost to their absence. The advantage of this approach is thatone can query provenance of tuples involving elements not in theactive domain and provenance answers will stay correct in light ofchanges in the active domain.

Figure 9 (in the appendix) encompasses the why-not explana-tions that involve only the active domain, as well as the infinite ex-planations that can be generated when one considers values outsidethe active domain. The table below shows each explanation involv-ing hypothetical tuples in the active domain and the correspondingrule node in Fig. 9. The rule nodes in Fig. 9 can be considered tobe numbered from 1 on the left to 15 on the right. Each rule nodeis won, which agrees with the fact that each of the paths shown inFig. 6 is only hypothetical.

Since all rule firings which would derive 3hop(c, a) are won/un-satisfied, the tuple 3hop(c, a) does not exist and the node indicatingits positive presence in Fig. 9 (the source node) is accordingly lost.

Each of the rule nodes referenced in the table, which explainthe negative provenance of a rule firing grounded in the activedomain, also captures the rule non-satisfaction of an infinite set ofpossible variable bindings to elements possibly outside the activedomain. Any constraint that has a variable that is only disequality-constrained represents an infinite set of firings. Consider the rulenode: R1 : X 6=a, X 6=b, Z1=a, Z2=a, Y=a. This corresponds tothe (hypothetical) 3hop path c

t

a

p

a

p

a and the situationin which the edge t exist (see first row of Fig. 6). However, it alsoexplains why the rule firing d ! a ! a ! a is not successful.The explanation is the failure of the first goal of the rule. In the caseof X=c, it represents that there are no outgoing edges from c. Inthe case of X=d or any other invented value this is trivially true.

This shows that constraint provenance games do not suffer fromthe same problems as their fully-grounded counterparts. Prove-nance can be queried for any imaginable tuple, including one not inthe active domain, and the provenance presented is still correct inthe presence of a growing active domain.

r1(X,Y, Z1, Z2) X ! Z1 ! Z2 ! Y Why�Not R1 Node

[Fig. 2] [Fig. 7] Provenance [Fig. 9]

r1(c, a, a, a) c

t

a

p

a

p

a t ) t·p·p 2

r1(c, a, a, b) c

t

a

q

b

r

a t ) t·q·r 3

r1(c, a, a, c) c

t

a

u

c

t

a t, u ) t·u·t 7

r1(c, a, c, a) c

v

c

t

a

p

a t, v ) v·t·p 14

r1(c, a, b, c) c

w

b

s

c

t

a t, w ) w·s·t 6

r1(c, a, c, c) c

v

c

v

c

t

a t, v ) v·v·t 12

r1(c, a, c, b) c

v

c

w

b

r

a v, w ) v·w·r 15

r1(c, a, b, a) c

w

b

r

a

p

a w ) w·r·p 4

r1(c, a, b, b) c

w

b

x

b

r

a w, x ) w·x·r 1

Figure 6: The nine r1-instances in the first column correspond to thosein Fig. 2 from left to right. The 3hop-path is shown in the second column,with missing/hypothetical edges (dashed) t, u, v, w, x and existing edgesp, q, r, s; see Fig. 7. The third column shows the why-not provenance of3hop(c, a): e.g., if an edge t from c to a were present, there would betwo derivations t·p·p and t·q·r. The last column identifies the R1 rulenode (labeled from 1 to 15, left to right) in Fig. 9 which subsumes thecorresponding rule node in Fig. 2.

4. Conclusions

In earlier work we proposed provenance games as an elegant andnovel approach to unify why and why-not provenance [KLZ13].The problem of domain dependence for why-not answers led usto develop our domain independent extension using concepts fromconstructive negation [Cha88]. This approach increases the com-plexity of individual nodes, but has the advantage that provenancecan be queried that is not limited to the active domain, and a con-straint provenance graph is still correct when considering a pro-gram executed under a larger active domain, unlike can occur innon-constraint games. This domain independent extension of pro-venance games [KLZ13] to use constraints is implemented as a pro-totype that uses a Datalog¬ engine to solve games via Q

wm

(Sec. 2),and the Z3 theorem prover to simplify constraints (most figures inthe paper and appendix are automatically generated by our proto-type).

Acknowledgments. Supported in part by NSF awards IIS-1118088and ACI-0830944.

References

[ADT11a] Y. Amsterdamer, D. Deutch, and V. Tannen. On the Limitationsof Provenance for Queries With Difference. In TaPP, Herak-lion, Crete, 2011.

[ADT11b] Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance foraggregate queries. In PODS, pp. 153–164. ACM, 2011.

[BKT01] P. Buneman, S. Khanna, and W.-C. Tan. Why and where: Acharacterization of data provenance. In ICDT, pp. 316–330.Springer, 2001.

[BSHW06] O. Benjelloun, A. Sarma, A. Halevy, and J. Widom. ULDBs:Databases with uncertainty and lineage. In VLDB, pp. 953–964, 2006.

[Cha88] D. Chan. Constructive Negation Based on the CompletedDatabase. In ICLP/SLP, pp. 111–125, 1988.

[CWW00] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of viewdata in a warehousing environment. ACM (TODS), 25(2):179–227, 2000.

[GIT11] T. Green, Z. Ives, and V. Tannen. Reconcilable differences.Theory of Computing Systems, 49(2):460–488, 2011.

[GKT07] T. Green, G. Karvounarakis, and V. Tannen. Provenance semi-rings. In PODS, pp. 31–40, 2007.

[GP10] F. Geerts and A. Poggi. On database query languages for k-relations. Journal of Applied Logic, 8(2):173–185, 2010.

[KG12] G. Karvounarakis and T. J. Green. Semiring-annotated data:queries and provenance. ACM SIGMOD Record, 41(3):5–14,2012.

[KLZ13] S. Kohler, B. Ludascher, and D. Zinn. First-Order ProvenanceGames. In In Search of Elegance in the Theory and Practice ofComputation, pp. 382–399. Springer, 2013.

[Stu95] P. J. Stuckey. Negation and constraint logic programming.Information and Computation, 118(1):12–33, 1995.

[VGRS91] A. Van Gelder, K. Ross, and J. Schlipf. The well-foundedsemantics for general logic programs. Journal of the ACM(JACM), 38(3):619–649, 1991.

=>  What-­‐If  provenance!  

Yes! �65  

Are  there  more  ways  to  fail?  

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

Two  branches  that  explain    Why-­‐not  A(b)  

Adding  a  new  constant  c  to  the  domain  =>  new  why-­‐not  answer!  

Oh no …� L �

66  

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

Happy  End  (3  of  3)…  sort  of  …    A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

Why-­‐not  provenance  complete  only  for  adom(I)  =  {  a,  b  }  !  

Constraint  why-­‐not  provenance  also  captures  new  constants,  i.e.,  

for  an  unlimited  domain    D  =  {  a,  b,  c,  …  }  

=>  Constraint  Provenance  answer  is  domain  independent!  (sort  of)    

67  

Why-­‐Not:  The  Full  Story  Emerges…  (sort  of…)      

R1 :X 6= a,X 6= b,Z1 = c,Z2 = c,Y 6= c

¬hop :x2 6= a,x2 6= b,x1 = a

R1 :X 6= a,X 6= b,Z1 = c,Z2 = b,Y = a

3Hop :x1 6= a,x1 6= b,x2 = a

R1 :X 6= a,X 6= b,Z1 6= c,Z1 6= a,Z1 6= b,Z2 = c,Y 6= c

G11 : hop :X 6= a,X 6= b,Z1 6= c

R1 :X 6= a,X 6= b,Z1 = b,Z2 = c,Y 6= c

G11 : hop :X 6= a,X 6= b,Z1 = c

¬hop :x1 6= a,x1 6= b,x2 = c

hop :x2 6= a,x2 6= b,x1 = a

R1 :X 6= a,X 6= b,Z1 = a,Z2 = a,Y = a

¬hop :x2 6= a,x2 6= c,x1 = b

G21 : hop :U 6= a,Z1 6= b,Z2 6= c

R1 :X 6= a,X 6= b,Z1 = c,Z2 6= c,Z2 6= a,Z2 6= b,Y 6= c

R1 :X 6= a,X 6= b,Z1 6= c,Z1 6= a,Z1 6= b,Z2 6= c,Z2 6= a,Z2 6= b,Y 6= c

hop :x1 6= a,x1 6= b,x2 6= c

¬hop :x1 6= a,x1 6= b,x2 6= c

R1 :X 6= a,X 6= b,Z1 = b,Z2 = b,Y = a

R1 :X 6= a,X 6= b,Z1 6= c,Z1 6= a,Z1 6= b,Z2 = b,Y = a

G21 : hop :Z1 6= a,Z1 6= b,Z2 = c

hop :x2 6= a,x2 6= c,x1 = b

hop :x1 6= a,x1 6= b,x2 = c

R1 :X 6= a,X 6= b,Z1 = b,Z2 = a,Y = a

G21 : hop :Z2 6= a,Z2 6= c,Z1 = b

R1 :X 6= a,X 6= b,Z2 6= a,Z2 6= b,Z1 = a,Y 6= c

R1 :X 6= a,X 6= b,Z1 = c,Z2 = a,Y = a

R1 :X 6= a,X 6= b,Z1 6= c,Z1 6= a,Z1 6= b,Z2 = a,Y = a

R1 :X 6= a,X 6= b,Z1 = a,Z2 = b,Y = a

G31 : hop :Z2 6= a,Z2 6= b,Y 6= c

R1 :X 6= a,X 6= b,Z2 6= a,Z2 6= c,Z1 = b,Z2 6= b,Y 6= c

G21 : hop :Z2 6= a,Z2 6= b,Z1 = a

Figure 9: The why-not provenance of 3hop(c, a). The provenance is represented in the failure of the claim that 3hop(c, a) is in the answer. This is arguedover the Boolean expression defining 3hop(x, y). A move from the source node to a child represents the choice of a Boolean expression that is sufficient tocapture a rule deriving 3hop(c, a). The opponent counters with a subset of this conjunction that is claimed not to be true. The game continues until it reachesthe EDB. There exists no equivalent grounded provenance game.

g21(c, a)

¬3hop(c, a)

g21(c, c)g11(c, c)

r1(c, a, c, b)

¬hop(c, b)

hop(c, a)

g21(b, b)

¬hop(a, c)

hop(c, c)

g11(c, a)

r1(c, a, b, c)r1(c, a, a, b)

3hop(c, a)

hop(b, b)

g21(c, b)g21(a, c)

r1(c, a, a, c)

¬hop(c, c)

hop(c, b)

¬hop(c, a)

g11(c, b)

r1(c, a, b, b)

¬hop(b, b)

g31(c, a)

r1(c, a, a, a) r1(c, a, b, a)

hop(a, c)

r1(c, a, c, a) r1(c, a, c, c)

9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b

Figure 2: Why-not provenance for 3hop(c, a) using provenance games.

gi1 in the body of r1, thus claiming that gi1 is false and hence thatthe r1 instance doesn’t derive t. The first player can counter anddemonstrate that gi1 is true by selecting a rule instance or fact asevidence for gi1. The game proceeds in rounds until some playercannot move and thus loses (the opponent wins). In [KLZ13] itwas shown how the provenance of a tuple t can be obtained via aregular path query over a solved game graph like the one in Fig. 1d:e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved gameas shown in Fig. 1e: for positive queries, solved games representsemiring provenance by noting that won (green) and lost (red) po-sitions correspond to “+” and “⇥” operations, respectively (leavesrepresent input annotations, here: p, q, r, s) [KLZ13].

Why-Not Provenance and the Many Ways to Fail. Since gamesare inherently symmetric (one player’s win is the opponent’s lossand vice versa), the approach yields an elegant provenance modelthat unifies why and why-not provenance. Consider the (dark, red)node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing movesto a node r1(x, y, z1, z2) lead to a position that is won for the oppo-nent. There are 9 such positions, e.g., r1(c, a, c, b) is one of them(third from the right). Recall that an instance of r1 means that onecan do a 3-hop from x to y (here: c to a) via intermediate nodesz1 and z2 (here: c and b). However, in the given database I inFig. 1(a), there is no hop(c, z) – neither for z = b nor for any otherz, since there are no outgoing moves from c. In this case, the op-ponent can successfully attack the goals in the body. Note how thewhy-not provenance of 3hop(c, a) in Fig. 2 is similar but differentfrom the why provenance of 3hop(a, a) in Fig. 1: In order to showthat 3hop(c, a) is false, one has to show that all possible ways thatit could be true are failing, i.e., for all z1, z2, the ground instancesr1(c, a, z1, z2) do not derive 3hop(c, a) (since at least one goal inr1’s body is always false). In constrast, to prove that 3hop(a, a)is true, it is sufficient to find some ground instance r1(a, a, z1, z2)whose body is true. Earlier we saw that there are exactly three suchinstances, corresponding to p ·p ·p+p ·q ·r+q ·r ·p (= p3+2pqr).

Domain Dependence of Provenance Games. As seen, 3hop(a, a)has three derivations, represented by the first provenance polyno-mial in Fig. 1(c) and the game provenance in Fig. 1(d) and (e). Howmany ways are there to show that 3hop(c, a) is false (why-not pro-venance), or equivalently, that ¬ 3hop(c, a) is true? If we annotatethe leaves of the game graph in Fig. 2 with identifiers u1, . . . , u5 forthe five different hop tuples missing in I , we can construct a pro-venance expression that represents the many ways why 3hop(c, a)is not in the answer. While this answer provides a comprehensive,instance-based why-not explanation, it also exhibits a problem withthe current approach: In order to obtain finite (why and why-not)provenance answers for all first-order queries, game provenanceemploys an active domain semantics: e.g., the provenance gamefor Q

3hop

(I) considers only ground instances of r1 over the activedomain adom(I) = {a, b, c}. If additional elements d, e, . . . areadded to I (e.g., via a disconnected graph component), the why-notprovenance in Fig. 2 becomes incomplete and the provenance hasto be recomputed for the larger domain.

Constraint Provenance Games. We propose to solve the prob-lem of domain dependence by modifying provenance games sothat they can handle certain infinite relations that can be finitelyrepresented. For example, in addition to the finitely many reasonswhy 3hop(c, a) fails over the active domain adom(I), there are in-finitely many others, if we consider new constants d, e, . . . outsideof adom(I). For example, let relation R = {a, b} have two tuplesR(a) and R(b). If we want to know why-not R(c), we just point toc /2 R. But we could also return a more general answer for why-notR(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (notjust for x = c). This approach is inspired by Chan’s ConstructiveNegation [Cha88], a form of constraint logic programming [Stu95].The key idea is to represent (potentially infinite) relations throughconstraints, i.e., Boolean combinations of equalities x = c and dis-equalities x 6= c.

Overview and Contributions. Section 2 briefly explains how first-order queries are translated into games and how provenance is ex-tracted from solved games. In Section 3 we describe the construc-tion of constraint provenance games; additional details and exam-ples are contained in the appendix. Our main contributions are:(i) game provenance provides a uniform treatment of why and why-not provenance for first-order logic (= relational algebra with set-difference); (ii) for positive queries, the approach captures the mostinformative semiring provenance [GKT07, KG12]; (iii) we developa constraint provenance framework which yields domain indepen-dent provenance expressions, extending prior results [KLZ13]; and(iv) we implemented a prototype of constraint provenance games.

2. Provenance through Games

We first sketch how a query Q over database I gives rise to a gameG

Q(I) and how to obtain provenance from the solved game G�

Q(I).Consider, e.g., input relations B(X,Y ) and C(Y ) and a relationalquery Q

ABC

with set-difference: A ⇡X

(B on (⇡Y

(B) \ C)). It iswell-known that any relational algebra query can be translated intoa non-recursive Datalog¬ program. Here, we have Q

ABC

=

r2 : A(X) :� B(X,Y ),¬C(Y ).

The key idea of provenance games is to understand query evalu-ation as a game between players I and II who argue whether ornot a tuple is in the answer. In [KLZ13] we showed that the solvedgame is a representation of why (why-not) provenance of answertuples (missing tuples), respectively. Fig. 3a shows the game tem-plate for Q

ABC

: to prove that A(x) is true, player I needs to find arule instance of r2, say A(x) :� B(x, y),¬C(y) which derives thedesired tuple A(x) and whose choice y for the 9-quantified vari-able Y in the body satisfies all literals (subgoals) in the rule body.In the game template in Fig. 3a this corresponds to a move fromA(X) to r2(X,Y ) while choosing a suitable domain value y forthe 9-quantified variable Y . Player II can challenge this claim by“attacking” one of the subgoals g in the rule body. If player I chosethe “wrong” y for the instance r2(x, y), then II can always attackat least one subgoal that falsifies the body. The game continues inturns, until a player cannot move and loses, and the opponent wins.

A game template GQ

for query Q contains literal nodes (oval;for atoms or their negation), rule nodes (boxes; for Datalog¬ rules),and goal nodes (rounded boxes; subgoals of rules): see Fig. 3a.Edge labels indicate a condition for a move: e.g., the label “9Y ”between a literal node, say A(X), and a rule node, say r2(X,Y ),requires a player to pick a value y for the 9-quantified variable Ywhen moving from an atom to the rule that derives it. Similarly,a condition “X:=Y ” means that the current choice of Y becomes

A. Why-Not 3hop(c, a) Dissected

Consider the input graph in Fig. 1a and its why-not provenancefor 3hop(c, a) in Fig. 2. The graph encodes the reasons why3hop(c, a) is not in the answer. Moving from the lost 3hop(c, a) inFig. 2, there are nine possible rule instantiations r1(c, a, z1, z2), allof which represent a reason why there is no 3hop(c, a) via interme-diate nodes z1, z2 2 {a, b, c}. To better understand these why-notexplanations, consider the input graph in Fig. 7. It contains the orig-inal database instance I plus a number of hypothetical (or missing)edges (dotted), with labels t, u, v, w, and x. These missing edgescorrespond to the failed leaf nodes in Fig. 2. The table in Fig. 6contains the why-not provenance, with different combinations ofmissing edges as preconditions for a derivation of 3hop(c, a).

a p

b

q

c

u

r

x

s

t

w

v

Figure 7: Input graph I with five additional, hypothetical edges (dashed).

B. Constraint Game Construction

Consider the query QABC

. To build the game, each ground tu-ple in the program such as B(a, b) is replaced by a constraintB:x1=a, x2=b (a conjunction).

First, the subgraph for EDB predicates is created. The remainderof the game is constructed iteratively similar to query execution.For rules whose subgoals are all on EDB predicates, goal/rulenodes/edges are generated. For IDB predicates that were only inthe head of EDB-only rules, tuple nodes are generated. Goal andrule nodes/edges are added for rules when the subgraph for all theirsubgoals has been generated, and for predicates when the subgraphfor all the rules deriving into it has been generated.

For each EDB predicate, an expression is generated that is adisjunction of all tuples in the predicate. This expression and itsnegation are both processed to produce orthogonal DNF expres-sions (i.e., the conjunction of any two disjuncts in the expression isunsatisfiable). Tuple nodes t+= P : c and t�= !P : c and an edge(t�, t+) are added to the graph for each disjunct in the constraint.

Those EDB nodes created from a positive expression disjunctare connected negative to positive and positive to a new sink node.Those from a negative disjunct are connected negative to positive,the positive node being a sink.

Orthogonalization is applied to the tuple constraints to ensurethat each variable-free tuple is admitted by exactly one node.

Rule nodes are created to which connect IDB tuple nodes for thehead predicate and which connect to goal nodes representing theuses of predicates in subgoals of the rule. A rule node is generatedfor each combination of body tuple nodes such that, if variablesin the tuple node constraints were renamed as in the rule, theconstraints would be satisfiable when conjuncted. The rule nodeis given this simplified conjunction as a constraint, each goal nodeis created with an edge to its originating tuple node, and the rulenode is connected to all these goals.

When all rules deriving a predicate have been processed, tuplenodes for the predicate are created. All constraints for rule nodescorresponding to these rules are disjuncted and this expression is

restricted to the variables in the rule node.4 This expression is thentreated like that of an EDB predicate: it is simplified and convertedto orthogonal DNF. A pair (positive and negative) of tuple nodesis created for each disjunct in the DNF. Edges are created frompositive tuple nodes to rule nodes if the tuple node constraint (withvariables renamed appropriately) when conjuncted with the rulenode constraint can be satisfied.

A player selecting a goal node for goal g with conjunction eargues that a tuple agreeing with e can be used to satisfy g. A playercurrently ‘at’ a rule node is fighting the implicit claim that this rulefiring is satisfied and creates the tuple in question. To rebut thisclaim, the player moves to a goal node claimed to be unsatisfied.The goal, if unsatisfied, will be lost; the rule node will be won iffat least one goal is unsatisfied. This provides the desired semanticsfor the rule node.

A detailed example using the game in Fig. 5 can be found in thenext section.

Constraint provenance games improve grounded provenancegames by making them domain independent. To return to our mo-tivating example, consider Fig. 5. Observe that the won/lost statesare effectively the same as in Fig. 3c, but compressed into constraintnodes that apply to more than one tuple. If one is interested in whythe firing r2(b, c) was not sufficient to derive A(b), then one justhas to find the node admitting this rule firing (r2 : X 6=a, Y 6=a).The subgraph of this node reachable using provenance edges willexplain why rule firings admitted by this node are invalid.

Example Consider the example QABC

corresponding to the con-straint game in Fig. 5. After all EDB facts of B and C have been pro-cessed, the rule is processed. Intuitively, a way to show the presenceof A(X) is to select a node which represent the presence of tuplesin B and a node for the absence of tuples in C, which conjunctivelycorrespond to a valid rule firing deriving A(X). This is equivalentto evaluating the 9Y from the game template (see Fig. 3a) withouthaving to enumerate all possible assignments of values to Y . Ex-pressions that are not satisfiable in conjunction represent insolublejoin conditions between the goals.

When creating nodes for the rule, one could consider the com-bination !B : x1=a, x2=b and C : x1 6=a. Goal nodes are createdfor these (g12 : B : X=a, Y=b and g22 : !C : Y 6=a, respectively) andsince X=a^Y=b^Y 6=a is satisfiable, a rule node r2 : X=a, Y=bis created and edges are drawn from the rule node to each goal nodeand from each goal to the corresponding tuple node. To contrast, thecombination !B : x1=b, x2=a and C : x1 6=a would not be satisfi-able after renaming and conjunction.

Consider the (valid) rule firing A(a) :� B(a, b),¬C(b). In con-structing the game, the node !B : x1=a, x2=b is used for the firstgoal as this node has the only expression to agree with B(a, b). Agoal node is created signifying the use of this conjunction in thecontext of this goal: g12 : B:X=a, Y=b. Consider the conjunctionof the expressions of nodes g12 : B:X=a, Y=b and g22 : !C:Y 6=a. Itcan be satisfied, so a rule node is created representing this combina-tion of goal nodes. The corresponding expression is the simplifiedconjunction of all the goal expressions used.

The rule firing r2:X=a, Y=b is lost because both the con-nected goal nodes g12 and g22 are won (ultimately because B(a, b)is in the EDB and C(a) is not, respectively).

An expression for A/1 is generated by disjuncting all the ex-pressions for rule nodes deriving into A/1.5 This expression is thenrestricted to X (yielding X=a _ X=b _ X 6=a). Orthogonaliza-tion ensures that each tuple will correspond to a single conjunction:(X=a) _ (X=b) _ (X 6=a, X 6=b).

4 All other variables are replaced with true.5 This yields (Y 6=b, X=a, Y 6=a) _ (X=a, Y=a) _ (X=a, Y=b) _(X 6=a, Y 6=a) _ (X=b, Y=a) _ (X 6=a, X 6=b, Y=a)

5  missing  edges  9  minimal  combina.ons    

 

A. Why-Not 3hop(c, a) Dissected

Consider the input graph in Fig. 1a and its why-not provenancefor 3hop(c, a) in Fig. 2. The graph encodes the reasons why3hop(c, a) is not in the answer. Moving from the lost 3hop(c, a) inFig. 2, there are nine possible rule instantiations r1(c, a, z1, z2), allof which represent a reason why there is no 3hop(c, a) via interme-diate nodes z1, z2 2 {a, b, c}. To better understand these why-notexplanations, consider the input graph in Fig. 7. It contains the orig-inal database instance I plus a number of hypothetical (or missing)edges (dotted), with labels t, u, v, w, and x. These missing edgescorrespond to the failed leaf nodes in Fig. 2. The table in Fig. 6contains the why-not provenance, with different combinations ofmissing edges as preconditions for a derivation of 3hop(c, a).

a p

b

q

c

u

r

x

s

t

w

v

Figure 7: Input graph I with five additional, hypothetical edges (dashed).

B. Constraint Game Construction

Consider the query QABC

. To build the game, each ground tu-ple in the program such as B(a, b) is replaced by a constraintB:x1=a, x2=b (a conjunction).

First, the subgraph for EDB predicates is created. The remainderof the game is constructed iteratively similar to query execution.For rules whose subgoals are all on EDB predicates, goal/rulenodes/edges are generated. For IDB predicates that were only inthe head of EDB-only rules, tuple nodes are generated. Goal andrule nodes/edges are added for rules when the subgraph for all theirsubgoals has been generated, and for predicates when the subgraphfor all the rules deriving into it has been generated.

For each EDB predicate, an expression is generated that is adisjunction of all tuples in the predicate. This expression and itsnegation are both processed to produce orthogonal DNF expres-sions (i.e., the conjunction of any two disjuncts in the expression isunsatisfiable). Tuple nodes t+= P : c and t�= !P : c and an edge(t�, t+) are added to the graph for each disjunct in the constraint.

Those EDB nodes created from a positive expression disjunctare connected negative to positive and positive to a new sink node.Those from a negative disjunct are connected negative to positive,the positive node being a sink.

Orthogonalization is applied to the tuple constraints to ensurethat each variable-free tuple is admitted by exactly one node.

Rule nodes are created to which connect IDB tuple nodes for thehead predicate and which connect to goal nodes representing theuses of predicates in subgoals of the rule. A rule node is generatedfor each combination of body tuple nodes such that, if variablesin the tuple node constraints were renamed as in the rule, theconstraints would be satisfiable when conjuncted. The rule nodeis given this simplified conjunction as a constraint, each goal nodeis created with an edge to its originating tuple node, and the rulenode is connected to all these goals.

When all rules deriving a predicate have been processed, tuplenodes for the predicate are created. All constraints for rule nodescorresponding to these rules are disjuncted and this expression is

restricted to the variables in the rule node.4 This expression is thentreated like that of an EDB predicate: it is simplified and convertedto orthogonal DNF. A pair (positive and negative) of tuple nodesis created for each disjunct in the DNF. Edges are created frompositive tuple nodes to rule nodes if the tuple node constraint (withvariables renamed appropriately) when conjuncted with the rulenode constraint can be satisfied.

A player selecting a goal node for goal g with conjunction eargues that a tuple agreeing with e can be used to satisfy g. A playercurrently ‘at’ a rule node is fighting the implicit claim that this rulefiring is satisfied and creates the tuple in question. To rebut thisclaim, the player moves to a goal node claimed to be unsatisfied.The goal, if unsatisfied, will be lost; the rule node will be won iffat least one goal is unsatisfied. This provides the desired semanticsfor the rule node.

A detailed example using the game in Fig. 5 can be found in thenext section.

Constraint provenance games improve grounded provenancegames by making them domain independent. To return to our mo-tivating example, consider Fig. 5. Observe that the won/lost statesare effectively the same as in Fig. 3c, but compressed into constraintnodes that apply to more than one tuple. If one is interested in whythe firing r2(b, c) was not sufficient to derive A(b), then one justhas to find the node admitting this rule firing (r2 : X 6=a, Y 6=a).The subgraph of this node reachable using provenance edges willexplain why rule firings admitted by this node are invalid.

Example Consider the example QABC

corresponding to the con-straint game in Fig. 5. After all EDB facts of B and C have been pro-cessed, the rule is processed. Intuitively, a way to show the presenceof A(X) is to select a node which represent the presence of tuplesin B and a node for the absence of tuples in C, which conjunctivelycorrespond to a valid rule firing deriving A(X). This is equivalentto evaluating the 9Y from the game template (see Fig. 3a) withouthaving to enumerate all possible assignments of values to Y . Ex-pressions that are not satisfiable in conjunction represent insolublejoin conditions between the goals.

When creating nodes for the rule, one could consider the com-bination !B : x1=a, x2=b and C : x1 6=a. Goal nodes are createdfor these (g12 : B : X=a, Y=b and g22 : !C : Y 6=a, respectively) andsince X=a^Y=b^Y 6=a is satisfiable, a rule node r2 : X=a, Y=bis created and edges are drawn from the rule node to each goal nodeand from each goal to the corresponding tuple node. To contrast, thecombination !B : x1=b, x2=a and C : x1 6=a would not be satisfi-able after renaming and conjunction.

Consider the (valid) rule firing A(a) :� B(a, b),¬C(b). In con-structing the game, the node !B : x1=a, x2=b is used for the firstgoal as this node has the only expression to agree with B(a, b). Agoal node is created signifying the use of this conjunction in thecontext of this goal: g12 : B:X=a, Y=b. Consider the conjunctionof the expressions of nodes g12 : B:X=a, Y=b and g22 : !C:Y 6=a. Itcan be satisfied, so a rule node is created representing this combina-tion of goal nodes. The corresponding expression is the simplifiedconjunction of all the goal expressions used.

The rule firing r2:X=a, Y=b is lost because both the con-nected goal nodes g12 and g22 are won (ultimately because B(a, b)is in the EDB and C(a) is not, respectively).

An expression for A/1 is generated by disjuncting all the ex-pressions for rule nodes deriving into A/1.5 This expression is thenrestricted to X (yielding X=a _ X=b _ X 6=a). Orthogonaliza-tion ensures that each tuple will correspond to a single conjunction:(X=a) _ (X=b) _ (X 6=a, X 6=b).

4 All other variables are replaced with true.5 This yields (Y 6=b, X=a, Y 6=a) _ (X=a, Y=a) _ (X=a, Y=b) _(X 6=a, Y 6=a) _ (X=b, Y=a) _ (X 6=a, X 6=b, Y=a)

+  …  ?  

Constraints  imply    15  disjoint  rela.ons  over  key  variables  X,  Z1,  Z2,  Y  

 

Oh Boy!

68  

Provenance  Games:  Summary  •  (1)  Game  Provenance                        

–  The  win-­‐move  game  has  a  natural  why  and  why-­‐not  provenance  “built-­‐in”  •  “good”  and  “bad  moves”  •  è  discard  bad  moves  è  game  provenance    

•  (2)  Provenance  Games                            –  Query  evaluaBon  also  is  a  game!  –  Game  provenance  can  be  applied  to  query  evaluaBon  game  =>  uniform  why  +  why-­‐not  provenance    

•  (3)  Constraint  Provenance                –  Domain  independent  (some  infinite  domains  OK)  –  Prototypically  implemented  

•  (4)  Future  Work                                              –  Make  theory  pracBcal!    

•  e.g.  implement  in  Boris  Glavic’s  Perm  or  GPROM    system  –  TheoreBcal  properBes  –  RelaBon  to  ArgumentaBon  Frameworks    –  Clarify  relaBonship  to  monus  semirings  (Floris  Geerts  et  al)  –  Higher-­‐order  reasons!  

69  

Why-­‐Not:  so  many  answers,  so  liple  

Bme  •  The  crux  of  current  why-­‐not  approaches:  –  Enumerate  all  ways  that  could/might  have  worked,  but  failed…  

•  Idea    è  abstract  those  many,  many  explanaBons!  

TaPP’15  

70  

Conclusions  •  Provenance  is  an  acBve  and  broad  area  of  research    – …  in  databases    – …  in  scienBfic  workflows  – Both  in  specialized  (TAPP,  IPAW)  and  maintream  venues  (VLDB,  SIGMOD,  EDBT,  ICDE,  PODS,  ICDT,  ..)  

•  Great  topics  in  theory,  pracBce/engineering  or  both!  

71  

YesWorkflow  References  •  hkp://yesworkflow.org    •  T.  McPhillips,  S.  Bowers,  K.  Belhajjame,  B.  Ludäscher  (2015).  

RetrospecBve  Provenance  Without  a  RunBme  Provenance  Recorder.  7th  USENIX  Workshop  on  the  Theory  and  Prac.ce  of  Provenance  (TaPP'15).    

•  T.  McPhillips,  T.  Song,  T.  Kolisnik,  S.  Aulenbach,  K.  Belhajjame,  R.K.  Bocinsky,  Y.  Cao,  J.  Cheney,  F.  ChirigaB,  S.  Dey,  J.  Freire,  C.  Jones,  J.  Hanken,  K.W.  KinBgh,  T.A.  Kohler,  D.  Koop,  J.A.  Macklin,  P.  Missier,  M.  Schildhauer,  C.  Schwalm,  Y.  Wei,  M.  Bieda,  B.  Ludäscher  (2015).  YesWorkflow:  A  User-­‐Oriented,  Language-­‐Independent  Tool  for  Recovering  Workflow  InformaBon  from  Scripts.  Interna.onal  Journal  of  Digital  Cura.on  10,  298-­‐313.  

72  

Why-­‐Not  Provenance  References  •  Köhler,  Sven,  Bertram  Ludäscher,  and  Daniel  Zinn.  "First-­‐

order  provenance  games.”  In  Search  of  Elegance  in  the  Theory  and  Prac.ce  of  Computa.on.  Peter  Buneman  Festschrin,    LNCS  8000.  Springer  Berlin  Heidelberg,  2013.  

•  Riddle,  Sean,  Sven  Köhler,  and  Bertram  Ludäscher.  "Towards  constraint  provenance  games.”  6th  USENIX  Workshop  on  the  Theory  and  Prac.ce  of  Provenance  (TaPP  2014).    

•  Glavic,  Boris,  Sven  Köhler,  Sean  Riddle,  and  Bertram  Ludäscher.  "Towards  constraint-­‐based  explanaBons  for  answers  and  non-­‐answers.”  7th  USENIX  Workshop  on  the  Theory  and  Prac.ce  of  Provenance  (TaPP  2015).  

73