Maybe - Just Maybe: the problem of partially missing data
We continue this month looking in a little more depth at the problems of managing geological databases which contain missing and partially missing data.
"Select all sample data where the gold grade is greater than 10 grams per tonne". This is a simple request, which can be translated directly into SQL to perform a straightforward retrieval operation on a table within a database. That is the first impression, at least. A little more thought would show, however, that it is not so simple at all. Let us imagine that one of the records that is retrieved contains a gold assay of 11 grams per tonne. This matches the selection criterion, doesn't it ? Well, not exactly. A gold assay of 11 grams per tonne might reflect a gold grade in the sample of 11 grams per tonne. Or the sample gold grade might really be 5 g/t, or 9 g/t, or 12 g/t, or 20 g/t. The real grade is actually unknown. The assay result, whether it is a single determination or an average of several replicate determinations, can never be more than an estimate of the true gold grade of the sample.
All SQL database management systems were developed primarily for handling business data, where the precision (if not the accuracy) of numeric data can be relied on. Thus a similar-looking request "Select all departments where the number of employees is greater than 10" presents no problems.
So how can we address the gold assay database problem ? (1) We could ignore the problem, and do a simple SQL database retrieval. This is what existing mining software systems would do if they rely on ODBC database connections. There is no problem in doing this, as long as the user interprets the results correctly: the data set returned contains all samples whose reported gold grade is greater than 10 g/t. (2) We can look into the statistics of sampling and assaying error. If we want to be reasonably sure of finding all samples whose real gold grade is likely to be greater than 10 then this is what we must do. The statistics will help us to define some (lower) grade (say 7 g/t) which will allow us to be sure to some degree of confidence (say 90%) that we have retrieved all the samples that we want. Of course we also retrieve a lot of samples whose real gold grade is below 10 g/t - but from reported grades, which are all we have, there is no way to tell which is which. Given this lower reported grade, we can substitute it in the SQL statement and again use a standard database SELECT.
This is all very well for a simple retrieval, but what if we want to do something more complex ? Perhaps we want to compare the grades of two different grain-size fractions. We could of course do a simple SQL comparison (e.g. "SELECT * FROM assaytable WHERE Au1 > Au2"). This would work fine, but of course only on reported values, not on the unknown real grades. If we know something of the sample statistics, we could use statistical methods such as a Student-t test, and if we know very little, we could use non-parametric statistics. But what can we do if one or both of the grades are reported as "below detection limit" or "trace" ? What would SQL make of such a data value in Au1 or Au2 or both ?
Indeed, 'trace' or 'below detection limit' is a classic example of partially missing data. There is some information (the grade is known to be very low) but certainly not enough to be able to do numerical comparisons between 'trace' values for different samples ("trace < trace" evaluates to unknown), though numerical comparisons between trace and actual numeric values certainly are possible ("trace < 10g/t" evaluates to true), so 'trace' cannot be treated as if it were a 'null'.
So how do we deal with such cases in a database ? Conventional logic allows for just two truth values, true and false, leaving no room for uncertainty and no provision for missing information. It was recognised very early that this was inadequate for database management systems, and the NULL concept was introduced. SQL provides a three-valued logic ('3VL') solution with most logical operations involving NULLs leading to a new logic value unknown. Unfortunately, as C.J.Date demonstrates, standard SQL offers incomplete support for the 3VL model and as a result can lead to serious database integrity problems. His preferred solution is to use only 2VL and a 'default value' approach: the database designer is responsible for defining one or more special values in each column, each with a set of operations that are allowed. E.F.Codd argues convincingly that it is an abdication of the responsibilities of the database management system to maintain integrity if such a crucial role is left to the user or the application to define on a case-by- case basis. His approach is to extend the logic framework beyond even the 3VL of SQL. Codd specifically introduces a four-valued logic (4VL) model containing two types of NULL ('missing but applicable', 'missing and inapplicable'). This may be adequate to handle most situations in commercial database management. However, I would argue that it should be generalised to an n-valued logic model in order to handle the more complex situations of partially missing data in fields such as the geosciences. At present much of the intelligence needed for managing such data is developed on a case-by-case basis for applications, with results including
Next month we look more specifically at some of the problems associated with spatial data.
ReferencesCodd, E.F. 1990: The relational model for database management: version 2, Addison-Wesley, p.203-204.
Date, C.J., 1995: Relational Database Writings, 1991-1994; Addison-Wesley, chapter 9, p.341- 362.
Copyright © 2003 Stephen Henley
Maybe - Just Maybe: the problem of partially missing data Earth Science Computer Applications, v.18,no.5,p.1-2