Introduction Design principles R & D G-EXEC format Datamine formats VMine formats G-EXEC Contacts

THE DATAMINE .DM FILE FORMAT

The DM format is used for binary files within the Datamine* system (now marketed as 'Studio' by Constellation Software Inc). It is a modification of the G-STAR file structure originally defined for the G-EXEC system in the British Geological Survey in 1972-3. The DM format is not used within VMINE, but DM files may be imported into VMINE.

There are two versions of the DM format - single precision (SP) and extended precision (EP). The SP format evolved from the G-STAR format in the 1980s for use in Datamine. The EP format was developed during the late 1990s. Confusingly, both formats use files with the .DM extension.

(1) single precision

The original single-precision .DM format was based on 2048-byte pages. The first page contained the data definition while subsequent pages contained the data records.

Public domain C++ coding to read .DM format files (it may possibly also handle the 'extended precision' version of the format) is available online from the University of Manchester AVS project at http://www.iavsc.org/repository/express/source/data_io/rd_dmine/

The description below is new: it was not taken from any Constellation Software Inc copyright material. Even though I wrote the original version, this description is not taken from the old coding but is reverse-engineered from actual .DM files generated more recently. However, it cannot be guaranteed that it correctly describes the .DM format as currently used in products supplied by Constellation Software Inc. If you use this to read files produced by or write files intended to be read by Constellation Software Inc products, you do so at your own risk.

However, the file structure described below is used to import Datamine files into VMINE and the description below should allow development of software which will read and write .DM files without the need for separate conversion programs.

This is a random-access file with filename extension .DM. It is organised as 'pages' (these are the Fortran records) with page length of 2048 bytes (512 4-byte words) into which the data are mapped. The first page contains the DD (data definition) and the second and following pages contain the data.

In all pages the last 4 words (16 bytes) contain security information, but I think this is no longer used so can probably (no guarantees!) safely be coded as blanks. However, these words are not available to be used for data, so the effective page size is actually 2032 bytes (508 4-byte words).

There are two data types, text or alpha ('A') and floating point numeric ('N'). Some integers are used within the first DD page but these are all stored as 4-byte floating-point values.

Data items may either be (a) stored within data records, or (b) file constants whose value is the same for every record in the file and defined once only in the DD.

Integer items in the Data definition page are stored as Fortran REAL*4 or REAL*8 values in the single and extended precision formats respectively.

First page structure:

Single
Precision
Extended
Precision
BytesBytesContent
1-81-4 and 9-12File name (max 8 characters) which usually matches the actual file name (e.g. FILENAME.DM - not case-sensitive)
9-1617-20 and 25-28Database name (max 8 characters) - I think this is no longer used
17-9633-36,41-44,...,185-188File description (free text, max 80 characters)
97-100193-200Numeric date coded as 10000*year + 100*month + day
101-104201-208Total number of fields in the file (alpha fields are counted as the number of 4-byte blocks they occupy)
105-108209-216Number of last page in the file
109-112217-224Number of last logical data record within the last page
113-2032225-4064Field definitions, each occupying a group of 28 bytes (SP) or 56 bytes (EP). Alpha fields are always recorded in 4-byte units (in both SP and EP files), and thus foe fields wider than 4 bytes more than one field definition is required, with the same field name but different values of LENF (1,2,...)
SPEP
1-81-4 and 9-12Field name (max 8 characters)
9-1217-20Field type ('A ' or 'N ')
13-1625-32SW Stored word number - set to zero if the field is a 'file constant' (defined by the 'default' value). This is the storage position within a logical data record.
17-2033-40Word number within field (always 1 for numeric fields, 1,2,3... for text fields)
21-2441-48Not used (provision for subsequent inclusion of a code for 'units of measurement' but which was never implemented)
25-2849-56Default value or file constant value. Default is the value optionally to be used in the event that a data value is missing.
When reading or writing a DD, the position of each field in a logical record is given by the SW value.
MAXLEN = Total number of words stored in data fields. These are counted as one for each floating-point value and one for each 4-byte word of text data.
The length MAXLEN of each logical record is given by the maximum storage-position SW value.
It should be noted that successive words of a text field may not always be contiguous (i.e. adjacent to each other), but their positions in each logical data record are given by the storage position value SW for each word of a field, allowing the field to be reconstructed correctly even if its constituent 4-byte words are separated. There will always be an SW value for all integers up to and including MAXLEN. Fields with storage position = 0 are file constants: they are not stored in the file, and their value for every record is taken from the default value given in the DD.
2033-20484065-4096Security information: I think no longer used

The number of logical data records per page is calculated by NLRP = INT(508/MAXLEN) - thus in general there will be a few unused bytes at the end of a page in addition to the 16 bytes reserved for security data.

The structure of data pages is simply
Words (4-byte for SP, 8-byte for EP)Content
1 to MAXLENData for first data record within page
MAXLEN+1 to 2*MAXLENData for second record
... and so on 
Bytes 2033-2048 (SP) or 4065-4096(EP)Security information: I think no longer used

As many additional pages are used as needed. The last page is unlikely to be filled with data - and remaining words are unused and undefined.

When writing a file, data are mapped into each logical record according to the SW (storage word) values for each numeric field or for each word of a text field. On writing a file, these logical records are accumulated in a page buffer until it is filled (i.e. all NLRP records have been generated) and the whole page is then written to the file. For the last page, after generating the last record, the page buffer is written to the file, and the last page and last record values are updated on the DD page which is then also written to the file. This is why the file must be specified as random access.

There are a few special numeric codes which are used within the data.

  • -1.0 E30 = 'bottom', used as the missing data code for numeric fields (for text fields, missing data is simply all blanks)
  • +1.0 E30 = 'top' and is used if a representation of 'infinity' is needed.
  • +1.0 E-30 = 'TR' or 'DL' and is used if it is required to represent an assay value of 'trace' or 'below detection limit'.

All text data is held in REAL variables, not the Fortran CHARACTER type - though the stored format is identical. This allows use of a simple REAL array to hold a whole page buffer, and another REAL array to hold the whole of each logical record for writing or reading. This concept originated in the British Geological Survey G-EXEC system in 1972-3 and was the key to Datamine's generality - rather than needing to pre-define specific data formats for every different combination of text and numeric fields. The same generality is achieved today through standards such as XML which do not prescribe storage formats or processing methods. VMINE binary files achieve this generality through a rather different mechanism.

(2) extended precision

The so-called 'extended precision' (EP) Datamine file format was developed after I left the company, and in my opinion is badly designed. It also has extension .DM (which can make it difficult to identify when reading the file!). Its pages are 4096 bytes in length, and it seems that the page structure is simply mapped into 8-byte words instead of 4-byte words. The byte mapping is shown in the table above.

This is fine for numeric data (it allows the full Fortran REAL*8 or DOUBLE PRECISION), but for text data only the first four bytes of each double-precision word are used. The EP file structure is therefore inefficient in data storage terms for files which contain sgnificant amounts of text data.

VMINE does not use .DM format files of either kind, but can convert them to the VMDD/VMDA format. Back-conversion to .DM files is not offered in VMINE but can be done through export of CSV files, and re-import into Datamine.

* Datamine is a trademark now owned by Constellation Software Inc and much of the Datamine system code is copyright owned by Constellation Software Inc. Note that, under the EU Software Directive, data file formats are specifically excluded from copyright protection, so you may use and extract information from the AVS public domain coding without problems. General information on their code repository is available from here.