(Discussion of XML DTD for HDF5)
REMcG
Purpose and use of DTD
Some global issues were suggested in passing, which should be noted.
- What will the DTD and XML descriptions of the file be used for? Clearly, we are not sure. Will people want to use it mainly as a way to get a table of contents, which may not need all the details and data contents. Or will they want to have a representation of the data. We may have occasion to optimize some kinds of use at the expense of others, and will be guided by what we learn about how people use the XML.
- Is the DTD required to be able to reconstruct the HDF-5 file? If so, does it need to match byte-for-byte, or at some structural level? There seemed to be a consensus that it should be possible to recreate the file from an XML description.
- Is the DTD constrained to support specific processing models, e.g., one pass parsing of the HDF-5 file or XML, minimal memory use by parsing, etc.? There was no explicit discussion of any limiting assumptions.
What to do about data, especially binary data?
In genernal, XML does not have numeric types, or understand the semantics of numbers.
XML can:
- express some aspects of the format of numbers, in an awkward way.
- use tags and attributes to indicate the intended interpretation, e.g., FORTRAN formating.
- include uninterpreted blobs of unicode
- include detailed and 'smart' pointers to external data
For handling the data in HDF5 files, there seems to be two basic
strategies:
- point to the data, e.g., with a URL+path, with tags/attributes indicating that this should be read with HDF-5 software.
- include the data in a character encoding
<DATA_FROM_FILE>
<POINTER_TO_DATA>
URL, path etc.
</POINTER_TO_DATA>
<DATA>
... character/unicode endoding of the
data....
</DATA>
</DATA_FROM_FILE>
To Do:
Investigate and propose details of both pointer and character representation of data.
RE pointer: Need to investigate XML standards for Xpointer and Xlink, and "do the right thing".
RE character representation: Data in an HDF-5 file is often strucutred, i.e., it can be an array of structures. This opens the question of whether we want to "mark up" the data elements themselves, e.g., marking the rows, cols, fields, etc. of the data. This could be done be defining additional tags to be used within the '<DATA>' element.
An alternative is to have a standard for "flattening" data into a one dimensional array of UniCode.
And, of course we will probably follow a mixed approach. Strings and scalars can be represented in a straightforward way as UniCode strings with standard formats. Other data elements might be represented as several sub-elements, with further structure flattenned. For example, a 2D array of compound data types might be represented as several "<ROW>" elements with <CELL> elements, but perhaps each cell might be stored as a flat array of bytes with no further mark up.
Attributes/Elements
We discussed the use of XML attributes and elements. There is some freedom here, and sometimes the decision may be a matter of taste. We can and should choose whichever makes sense in a given case.
Rules and tips for choosing
- If order is significant, use Elements
- If the values are an ENUM, use an Attribute
- If there can be 1 or more instances of an item, must use Elements
- There may be advantages to use Attributes if you want to use XML parameters
- If the value is structured, must use Elements
- Attributes can have default values
Case by case, decide about using attributes or elements....
File Structure, Links
We discussed how best to represent the structure of an HDF-5 file with
XML.
- XML is strictly a tree, all objects are nested in the outermost object.
- XML has ID and IDREFS, which can be used to implement 'link' objects which can represent arbitrary relationships.
- All the nameable objects are at the top level, all links are explicitly included. The structure of the file has to be reconstructed from all the links.
- The objects are all nested in the RootGroup, with extra link objects for cases where there are multiple references to the same object
The first approach is 'elegant', and represents the actual way that HDF-5 works. This isn't the way the documentation describes the file, and isn't how the API or dumper works. Also, this approach does not take advantage of the 'treeness' of XML, even when the file really is a tree. It is more complicated than needed for the common simple cases.
The second approach still has links, but they are needed only for objects with more than one link. In most cases, the object will be nested in a natural way, with the XML matching the HDF-5 (and the DDL).
The general consensus was to do the tree plus links, because this makes the common case easy.
To Do:
Revise DTD to do the tree with aux. links. Note: will need to define hueristics beyond the DTD for how to construct the tree.
- - Last modified:June 26th 2007
