XML and Databases
If one were to believe all the hype, XML is the solution for everything from e-commerce to organising file systems to database management. It is reported that Microsoft intends to base its next generation operating system on XML, with replacement of the FAT file management system by one based on XML - and possibly imposing an XML structure on the content of files themselves. This is in the name of 'knowledge engineering', or making all file content accessible to search engines.
Long before this comes about - if it is more than Microsoft pie in the sky - there are worrying signs that many people are expecting way too much of XML. Let us look at what this three- letter acronym actually represents. XML is an eXtensible Markup Language. It is similar in structure to the familiar HTML which is used to encode most web pages, and this is no coincidence as both are subsets of the same Standard General Markup Language (SGML) itself originally developed in the 1980s for word-processing applications but found to be useful in a much wider context. In fact XML, like SGML, is a meta-language because it allows users to define their own markup tags.
The crucial difference between HTML and XML is that HTML markup tags encode (principally) information on how to represent data on screen or printer, while XML markup tags describe the content and context of a data file. For example, there is an XML implementation GML ('Geographic Markup Language') which is used for spatially coded data relevant to GIS packages, and an extension of this currently under development is XMML which contains - apart from all the GML markup - tags that are specifically relevant to exploration and mining.
The express purpose of many of these new XML 'dialects' is to facilitate data transfer, and in principle they are quite good at this. Although implementations can be a little verbose, these languages do define uniquely the structure and content of a data set which can be arbitrarily complex.
However, there have been serious suggestions that XML should be taken much further, into system-critical areas such as operating systems and databases. Let us look at the database management question in particular.
In the 1960s the first database management systems were developed. These were intended for use with commercial data which appeared generally to be hierarchic in structure, and so it seemed natural to create a database structure that mirrored this 'reality'. Very soon it was found that a hierarchic structure was insufficient. Real connections between items of data were found to cut across branches of the pre-defined trees, and so network-structured database systems were developed. By the early 1970s it was starting to become apparent that even such network-structured databases were inadequate. The key problem with both hierarchic and network DBMSs was that to access any particular data item efficiently (or at all) it was necessary to know in advance where it lay in the data structure, and to follow a pre- defined access path - navigate the hierarchy or the network - in order to retrieve it. With ever more complex databases this was fast becoming a nightmare for the support programmers.
In 1970, Dr Edgar Codd at IBM developed an alternative database concept which was based on predicate logic and assumed no pre-defined access methods or navigational paths. This was the relational data model. In this model, which has since been extended and refined further by Codd and more recently by Date, Darwen, and Pascal (see references below), data structure is defined in terms of flat tables (relations) in which each row ('tuple') represents a statement (predicate) which is asserted to be true: a particular association of data values ('employee A is of age NN and receives salary $X'). It is, of course, possible to include more than merely numeric or alphabetic data in a relational table: it may contain also many other types of data such as images, other forms of binary-coded data (eg audio clips), and even programs.
In the late 1980s and 1990s it seemed that the IT industry was becoming bored with the relational data model, and the so-called 'object-oriented' model was developed. Unfortunately, although superficially attractive in that it did not require the careful database design and relation normalisation required for relational systems, there was no consistent underlying data model. Object-oriented systems proliferated because they were easy to construct. An 'object' consists of a basket of attributes and 'methods' (pieces of program to carry out defined types of processing on the attributes). There are many types of relationship between objects, and no control over the complexity which may be defined in a collection of objects. It is as easy to produce unnavigable spaghetti as in the worst of the old network-structured DBMSs. A classic example of this was the EU 'DEEP' project in 1992-95 which was intended to define a 'standard' database structure for mineral exploration and mining data but instead led to a complex and rigid framework of objects linked by a variety of relationships. A relational model for the same purpose might have had just as many tables as DEEP had objects, but the relationships among them could be defined, flexibly, only as and when required. Furthermore, the disciplines of the relational data model (definition of key fields and normalisation of tables) would have ensured a valid database - such validation is virtually impossible in an object- oriented environment.
However useful the object-oriented paradigm might have been for development of software systems, the deficiencies of the object-oriented model for database management quickly became apparent, and a generation of DBMSs known as 'object-relational' or 'post-relational' were developed. The best of these - such as the newer versions of Oracle - are nothing more nor less than relational systems with a few clever add-ons and some object-oriented jargon. There is little real advance over the relational DBMSs developed in the 1980s. What is perhaps worse (though off the point of this note) is that none of the commercially available relational or 'post-relational' DBMSs actually implement the complete relational model as defined by Codd (1985), let alone the more advanced versions from Codd (1990) and Date and Darwen (1998).
So what about XML ? There are some in the database community who are looking for yet another alternative to the relational model - who still do not accept that the relational database can and will provide the full answer, because it is based on a solid mathematical foundation. XML is irrelevant to this question for one crucial reason: it organises data hierarchically. To access any individual data item in an XML structure, you must navigate the hierarchy, which means that you must know the hierarchy. Although there are indeed software tools which will quite happily parse any XML document, these tools, at best, can only ever be pre-processors to a database management system, mapping data items into a structure which can better (more efficiently) deal with database queries.