Incomplete And Missing Data In Geoscience Databases
-
Upload
resources-computing-international-ltd -
Category
Documents
-
view
1.245 -
download
0
description
Transcript of Incomplete And Missing Data In Geoscience Databases
![Page 1: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/1.jpg)
Resources Computing International Ltd
Incomplete and missing data in geoscience
databasesTowards the OWA relational model ?
Stephen HenleyPresented at the eSI workshop The Closed World
of Databases Meets the Open World of the Semantic Web, Edinburgh 12-13 Oct 2006
Resources Computing International Ltd
![Page 2: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/2.jpg)
Resources Computing International Ltd
Not just geoscience
• The title says this is about geoscience – but the conclusions are much more widely applicable
![Page 3: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/3.jpg)
Resources Computing International Ltd
Geoscience data
•Very commonly may be– Imprecise– Incomplete– Missing
![Page 4: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/4.jpg)
Resources Computing International Ltd
Typical imprecise data
Sample SiO2 % Cu ppm
#101 53.5 128
#102 49.2 185
#103 66.3 163
![Page 5: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/5.jpg)
Resources Computing International Ltd
Typical imprecise data
Sample SiO2 % Cu ppm
#101 53.5 128
#102 49.2 185
#103 66.3 163
![Page 6: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/6.jpg)
Resources Computing International Ltd
What is “49.2% SiO2” ?
• A recorded value from a laboratory• Imprecise: the true value could be
49.2%, 49.21% or 48.55%?• because of instrumental errors• and because of sampling errors• The full data item should include
“49.2” AND data about the error distribution
![Page 7: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/7.jpg)
Resources Computing International Ltd
Each value has its own error distribution
Sample SiO2 % Cu ppm
#101 ~53.5 ~128
#102 ~49.2 ~185
#103 ~66.3 ~163
![Page 8: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/8.jpg)
Resources Computing International Ltd
What about queries ?
• Given the SiO2 value of “~ 49.2”– Query “WHERE SiO2 >50 …”– Not TRUE or FALSE but P = 0.317 (for
example)
• So the simple 2VL does not apply – instead a continuous scale of probability estimates from P=0 (FALSE) to P=1 (TRUE)
• Related to ‘fuzzy logic’ – but let’s not go there today !
![Page 9: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/9.jpg)
Resources Computing International Ltd
Incomplete data
![Page 10: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/10.jpg)
Resources Computing International Ltd
Incomplete data
Hole_ID Total_Depth D_green
#301 320.0 250.0
#302 300.0 270.0
#303 200.0 Unknown ?
![Page 11: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/11.jpg)
Resources Computing International Ltd
Incomplete data
Hole_ID Total_Depth D_green
#301 320.0 250.0
#302 300.0 270.0
#303 200.0 > 200.0
![Page 12: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/12.jpg)
Resources Computing International Ltd
Incomplete data
![Page 13: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/13.jpg)
Resources Computing International Ltd
Incomplete data
• This value “>200.0” is semi-quantitative
• (another similar example – “below detection limit” in chemical analysis data – e.g. “< 5 ppm”)
• It is not a NULL so Chris Date ought to be quite happy about it
• BUT queries will not always give unambiguous TRUE or FALSE
![Page 14: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/14.jpg)
Resources Computing International Ltd
Querying a tuple withD_green value “>200”
• WHERE D_green > 150 … TRUE• WHERE D_green < 100 … FALSE• WHERE D_green > 250 … UNKNOWN• WHERE D_Green <250 … UNKNOWN• So 2VL (true/false) is inadequate here
also
![Page 15: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/15.jpg)
Resources Computing International Ltd
Missing data
• Very often there are genuine gaps in data sets, for many possible reasons– Samples not collected– Observations not taken– Instrumental malfunction– . . . 1001 other possible reasons
• These gaps may be single data items or whole rows (tuples)
![Page 16: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/16.jpg)
Resources Computing International Ltd
Missing data item
Sample SiO2 % Cu ppm
#101 53.5 128
#102 - 185
#103 66.3 163
![Page 17: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/17.jpg)
Resources Computing International Ltd
Missing data item
• We know sample #102 must have a SiO2 value – we just don’t know what it is
• So this value is just “missing”. It’s not “inapplicable” (which might justify re-designing the database)
• If we use Chris Date’s ‘CWA relational’ model then we are not allowed ‘NULL’ so how do we represent this ?
![Page 18: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/18.jpg)
Resources Computing International Ltd
CWA: Avoiding NULL
• Several proposed methods to get around the prohibition of NULL, including – – Default-value solutions (Chris Date)– Other suggestions (Hugh Darwen and
Fabian Pascal)
![Page 19: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/19.jpg)
Resources Computing International Ltd
The default-value ‘solution’ as proposed by Date
• Instead of a global ‘null’• A default value defined separately for
each domain• If a legitimate value for the domain,
how are missing values distinguished from actual values ?
• If not a legitimate value for the domain, it’s just another sort of ‘null’ – no better, but more complicated, so in fact worse
![Page 20: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/20.jpg)
Resources Computing International Ltd
Proposals by Darwen and Pascal
• Different in detail, but both involve decomposition to ‘hide’ the missingness of data values
![Page 21: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/21.jpg)
Resources Computing International Ltd
Decompose into ‘null-free’ relations
Sample
SiO2%Cu
ppm
#101 53.5 128
#102 - 185
#103 66.3 163
Sample
SiO2 %Sampl
eCu
ppm
#101 53.5 #101 128
#103 66.3 #102 185
#103 163
![Page 22: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/22.jpg)
Resources Computing International Ltd
In this way …
• We certainly get rid of the ‘NULL’• Any missing data item is expressed
instead as a missing tuple in a binary relation
![Page 23: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/23.jpg)
Resources Computing International Ltd
The CWA states that …• where r is any relation and t is any possible
tuple that conforms to the heading of r :-• If t appears in the body of r, then it is a true
instantiation of the predicate (i.e. the corresponding proposition is considered to be true);
• conversely, if t does not appear in the body of r, then it is a false instantiation (i.e. the corresponding proposition is considered to be false)
– Date & Darwen 1998, 2000, …
![Page 24: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/24.jpg)
Resources Computing International Ltd
…. so
• Under the CWA, any tuple that is legitimate but is missing is assumed to represent a FALSE proposition.
• So what about our decomposed ‘null-free’ relations ? …
![Page 25: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/25.jpg)
Resources Computing International Ltd
No tuple for sample #102 in the SiO2 relation
Sample SiO2 % SampleCu
ppm
#101 53.5 #101 128
#103 66.3 #102 185
#103 163
![Page 26: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/26.jpg)
Resources Computing International Ltd
Under the CWA …• There are infinitely many possible legitimate
tuples for sample #102: for example
• But NONE of them is included• So ALL are interpreted as FALSE
propositions• This means that under the CWA
interpretation, for sample #102 there is NO acceptable value of SiO2 – it does not mean that the value is merely unknown.
Sample SiO2%or
Sample SiO2%
#102 51.2 #102 45.5
![Page 27: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/27.jpg)
Resources Computing International Ltd
This implies that …
• If we have any missing data, then the CWA is not appropriate.
• Does this mean we can’t use the relational model for geoscience data ?
• Of course not. Just that the narrow ‘CWA’ version of relational, defined by Date, Darwen, & Pascal, is inadequate
• - but is that really the only game in town ?
![Page 28: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/28.jpg)
Resources Computing International Ltd
Codd was right
• We need to revert to the “true” relational model as defined by Codd – which ALLOWS for the reality, that there will always be missing and incomplete data
• Codd’s 1979 RM/T paper and his 1990 book leave many unanswered questions – but they do allow us to use the open world assumption
• This does not restrict us to 2VL but uses a 3VL - including truth value UNKNOWN.
![Page 29: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/29.jpg)
Resources Computing International Ltd
So let’s take a look at the truth tables
• 2VL – two valued logic for CWA• Then extended to allow for
probabilities• Then 3VL as needed by OWA
![Page 30: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/30.jpg)
Resources Computing International Ltd
CWA - 2VL
NOT AND T F OR T F T F T T F T T T F T F F F F T F
T represents TRUE
F represents FALSE
![Page 31: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/31.jpg)
Resources Computing International Ltd
2VL with probabilities
NOT AND T p(A) F OR T p(A) F T F T T p(A) F T T T T p 1-p p(B) p(B) p(AB) F p(B) T p(AB) p(B) F T F F F F F T p(A) F
T represents p=1; F represents p=0
p(AB), p(AB) in general need statistical computation
![Page 32: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/32.jpg)
Resources Computing International Ltd
OWA - 3VL
T represents TRUE
F represents FALSE
U represents UNKNOWN
NOT AND T U F OR T U F T F T T U F T T T T U U U U U F U T U U F T F F F F F T U F
![Page 33: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/33.jpg)
Resources Computing International Ltd
Conclusions
• If any data are imprecise, incomplete, or missing, then CWA and 2VL are inadequate
• Imprecise data need a probabilistic approach – is this an extension of CWA / 2VL ?
• If we have any incomplete (e.g. truncated) or missing data we need OWA / 3VL
![Page 34: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/34.jpg)
Resources Computing International Ltd
Conclusions
• A database is not about what IS, but about what IS KNOWN.
• Perfectly reasonable to use the CWA about what IS –
• – but not about what IS KNOWN – precisely because ‘I don’t know’ has to be a valid answer: hence truth value UNKNOWN must be legal
![Page 35: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/35.jpg)
Resources Computing International Ltd
Conclusions
• 3VL need not be scary. It isn’t actually much more complex than 2VL
• Relational databases can use the richness of the OWA. We just need to do it right.
• See www.OpenWorldDBMS.com
![Page 36: Incomplete And Missing Data In Geoscience Databases](https://reader033.fdocuments.in/reader033/viewer/2022061207/54876821b4af9f5f388b4dd9/html5/thumbnails/36.jpg)
Resources Computing International Ltd
Some final words from E.F.Codd (1990)
• In developing the relational model, I have tried to follow Einstein’s advice, “Make it as simple as possible, but no simpler”. I believe that in the last clause he was discouraging the pursuit of simplicity to the extent of distorting reality.
• Is insistence on CWA and 2VL perhaps distorting reality ? A little TOO simple ?