Silicon Dale

Database Management Systems and Data Models

The closest that many geoscientists come (or want to come) to database management systems (DBMS) is the Microsoft Access that comes bundled with their Office suite, or a packaged ODBC-compliant system sitting underneath a mining applications suite. Yet effective management of their geological data is vital for all exploration and mining projects.

From the 1970s onwards, database management has been and remains an intensely fought-over battlefield. The original protagonists were, in one corner, hierarchical and network DBMSs following international CODASYL standards, and in the other corner, relational systems following (more or less) the principles first developed by Ted Codd in 1970. During the 1980s the relational systems came to dominate the marketplace, largely by default as the older COBOL-based hierarchical systems became obsolete with the mainframe computers which hosted them. In the late 1980s and the 1990s a new army of contenders arose - the so-called >object-oriented= systems, and a wide range of offspring resulted from the hybridisation of these, as >object-relational= systems.

There are, however, many misconceptions, and many corruptions of the relational principles. Furthermore there is still not a single commercial DBMS which can be called truly relational in that it implements the full set of relational features identified in Codd=s 1990 book >The Relational Model for Database Management, version 2'.

All DBMSs (should) contain at least three components:-

  1. a schema, which contains a description of the data within the database.
  2. one or more files or tables containing the data
  3. a set of program modules which carry out functions on the data - selection, update, validation, etc.

They may range from completely rigid, predefined structures to completely open unstructured systems. CODASYL hierarchical and network systems were examples of rigid databases defined to meet pre-defined requirements of commercial organisations - the database structure was specified in detail in advance of implementation, everything was encoded in the schema, and access paths and allowable operations were controlled by the database design. In such systems, it might be very difficult or impossible to add later functionality or new types of data without major re-design, because they tended to rely on artificial keys and pointers, and limited access paths, to maximise efficiency.

Relational systems provide a great deal more flexibility in that (at least in theory) implementation at the table level could be modified to reflect changed requirements: for example it is relatively easy to add new columns or even entire new tables, linked to an existing database through values of (real) key fields. Artificial pointers and data-specific access methods are frowned upon, though some systems do allow them: one of the great strengths of relational systems is their data independence.

In relational databases, all data are held in flat tables (consisting of rows and columns, with each space containing one and only one data item). Each table has its own sub-schema (data definition) and the tables are linked through the values of key fields. In strict implementations of relational database systems, every table contains one pre-defined primary key, and multiple repeated values of such keys are prohibited: the tables can be decomposed in a process known as >normalisation= into simpler tables which contain unique-valued primary keys.

Object-oriented (>OO=) systems might be regarded as >New Age= DBMSs in that they were developed from the principle that anything could be defined as an >object=, with a set of properties - which included even the operations which might be carried out upon or by the object. There was a deliberate blurring of the boundary between data and software, and also between temporary objects (involved in actual applications) and persistent objects (the stored database). One of the less fortunate results of the development of OO systems has been a loss of data independence. While relational systems will operate >out of the box= on any data, OO systems generally require a considerable amount of systems analysis and design work before they can be used.

Each type of DBMS is associated with a particular way of looking at the data universe, and a particular way of structuring it in what can loosely be called >Data Models=.

One particular type of data model which has become popular following a paper by Chen (1976) is the >entity-relationship= approach, in which >objects= are identified as either entities or relationships - the idea being that entities represent concrete things connected together by the relationships. One basic problem with this type of model is that one person=s entity is another person=s relationship. Another is that relationships are not themselves supposed to have properties or attributes, yet clearly in many cases do. Let us take a very simple geological example: two entities defined as geological formations, and a relationship defined as the boundary between them. If this is a fault, for example, it will clearly have properties such as its throw, whether it is mineralised, whether it is currently active, and the presence or absence of breccia, slickensides, etc. A pure relational data model would handle such data without any difficulty and without any need to define what is an >entity= and what a >relationship=.

Object-oriented models can be seen as an extension and generalisation of entity-relationship types of model. However, the very freedom they allow in definition of object attributes and behaviours causes a significant problem, in that it restricts the adaptability of such models to change in the data content or structure, and moreover, the presence of software in the definition of methods will necessarily limit the adaptability of such databases to changes in the programming environment (language, operating system, host machine, etc). A further very powerful criticism of the object-oriented approach is that it does not encourage rigour, but on the contrary allows the survival of very fuzzy thinking and design.

Object-relational DBMSs as proposed by Date and Darwen (1998) add a level of complexity to relational systems which (in my opinion) is simply not necessary, but is added largely to appease the current vogue for OO systems. In fact, the Date and Darwen model is in effect an extended relational model with very little of the >object-oriented= about it. They have updated Codd=s definitions without adding much to them.

So what about real DBMSs ? The most widely used, such as Oracle, Access, dBase, Paradox, Ingres, and others, are all claimed to be relational. Certainly they can all use SQL (Structured Query Language) which itself is supposedly an indicator of a relational database system. Unfortunately, SQL itself violates some of the relational principles, and fails to support others, so it is not a good yardstick. It has been found to work, in a rather clunky way, on most types of data. In one area, however - that of 2D and 3D graphical data, it has been found wanting. This is an area of very active development at present, with a number of different approaches by database suppliers such as Oracle, GIS companies such as ESRI (the suppliers of Arcview), and independent international teams (such as OpenGIS).

References:
Chen, P.P. 1976: The entity-relationship model - toward a unified view of data. ACM TODS 1:1.
Codd, E.F. 1990: The relational model for database management: version 2, Addison?Wesley, 538pp.
Date, C.J. and Darwen, H. 1998: Foundation for object/relational databases: the third manifesto, Addison?Wesley, 496pp

Stephen Henley
Matlock, England
steve@silicondale.co.uk
www.SiliconDale.com

Copyright © 2001 Stephen Henley
Database management Systems and Data Models: Earth Science Computer Applications, v.16, no.8, p.1-3