Introduction Design principles R & D G-EXEC format Datamine formats VMine formats G-EXEC Contacts

THE VMINE FILE FORMATS


There is a new file structure developed for use with VMine:
VMDD/VMDA files are used within the system, providing random-access for maximum efficiency of both storage and processing The VMDD/VMDA format uses paired files, xxxx.VMDD and xxxx.VMDA.

The VMDD file is a sequential ASCII file which contains the data definition. This includes the file name, number of fields and number of records. For each field, it contains the field name, field type (currently A=alpha or N=numeric but notionally extendable to include other types), field length in bytes for alpha fields, and default value for numeric fields.

The VMDA file is a random access binary file containing all data records. The record length is computed from the total number of numeric fields multiplied by 8 (they are stored in Fortran double precision format), the total number of bytes in alpha fields, plus one additional byte for each field to hold the Codd mark - non-blank is an indicator of absent data, with the actual mark used defining the reason for absence, for example 'A' for simply missing, or 'I' for missing and inapplicable. The use of these marks allows implementation of open-world database management.

The VMDD file holds the data definition in a sequential formatted file, as follows:
RecordBytesContents
11-8NUMREC - number of data records
9-16NUMFLD - number of stored fields
17-24NUMCON - number of file constants
29-60FILENAME
Next NUMFLD recordsDefinitions for each field:
1-32Field name
33Field type N=numeric or A=text
34-41Field length: 1 for numeric fields, or the maximum number of bytes for text fields
42-49Start position for each field within data buffers. For numeric fields this is simply the word number, for text fields it is a byte number (data for the two types are held in separate buffers)
50-72For numeric fields, default value which may be used in place of missing data entries. For text fields, blank (undefined: not used)
Next NUMCON recordsDefinitions for each file constant: these are only of numeric type
1-32Constant name
33Blank
34-41Blank
42-49Blank
50-72Numeric value of the file constant

The VMDA file holds data records in random-access binary format, containing NUMREC records of fixed length LENTOTAL bytes which is defined as:

LENTOTAL = LENNUM + LENALF + NUMFLD

where LENNUM is the total number of bytes of numeric data (the number of numeric fields, multiplied by 8, which is the number of bytes in a Fortran double precision variable); LENALF is the total number of bytes in text fields; NUMFLD is the total number of fields of both types - but excluding file constants.

Because of the strong typing of modern dialects of Fortran, the old G-EXEC and DATAMINE storage format, with everything held in a REAL array, can no longer be guaranteed to be accepted by Fortran compilers. It is non-standard Fortran. Therefore, in the VMDA file, numeric and character format fields are stored separately. This has the small additional benefit that for character fields with lengths that are not exact multiples of 4 bytes, there is some space saving by comparison with, for example, Datamine files. Indeed, compared with Datamine 'Extended Precision' files there is substantial space saving, as the 8-byte words of such files contain a maximum of only 4 characters.

Each record in the VMDA file contains three buffers, in the following order:

  1. Numeric data, with double precision variables holding data for numeric fields in the order in which they are defined in the VMDD file.
  2. Text data, with character format variables holding data for text fields in the order in which they are defined in the VMDD file. There is no need for any spacing or delimiter between fields as their positions and lengths are fully defined in the VMDD file.
  3. At the end of the record, Codd marks, one byte for each field whether numeric or text, in the order in which the fields are defined in the VMDD file. Any non-blank value represents missing data. These marks are described in more detail below.

The "Codd mark" is an indicator for any item in a database that the data value is missing or otherwise undefined. It is much more powerful than the NULL of SQL-based systems as it has a well-defined logical basis and, because it can have many alternative codings, it provides a means to treat databases in a more consistent and meaningful way. In his 1990 book, Codd defined two marks, 'A' for simply missing and unknown, and 'I' for missing and inapplicable. Initially these two marks will be implemented in VMine. The 'I' mark is used in situations where a value is undefined as a result of program execution - for example in an outer join where key fields in the two files being joined cannot be matched.