Robert E. McGrath
September 9, 2000
Requirements
The goal of this task is to add an option to the h5dump utility to output a description of the HDF5 file formatted in XML. The XML must conform to the HDF5 DTD.The following features are required:
- The user may select either standard or XML output with a simple option.
- If XML is not selected, the h5dump utility will be exactly as before.
- The XML output must conform to the HDF5 DTD.
- It should be possible to reconstruct the HDF5 file from the DTD.
- The XML option will be implemented entirely within the existing code modules, and will add no new external dependencies.
- As far as possible, the XML output should use the existing h5dump code.
- Some options in the standard output may be unsupported or partly supported in the XML.
- Some objects and output available in the standard output may not be available with the XML option.
- Space and time efficiencies are not important.
Proposed User Interface
This feature will add two new flags to the h5dump command.- '-xml' - (required) selects XML instead of standard output. May also disable other flags, such as subsetting.
- '-dtd <alternate URI>' - (optional) specifies the path (URI) to the HDF5 DTD which should be written in the output. This may be used to point to a local copy of the DTD, for instance, instead of the default web address.
Explanation: In an example XML file, the default preamble
is
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE HDF5-File PUBLIC "HDF5-File.dtd" "http://hdf.ncsa.uiuc.edu/HDF5/XML/DTD/HDF5-File.dtd"> <HDF5-File> <RootGroup OBJ-XID="root"> ....This instructs an interpreter to look for the DTD to interpret the XML at 'http://hdf.ncsa.uiuc.edu...' If the file is to be used off the network, or behind a firewall, or with some custom version of the DTD, then the third line would be changed to point to a differnt file or URL. E.g., to use my own copy of the DTD: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE HDF5-File PUBLIC "HDF5-File.dtd" "/tmp/mcgrath/my-HDF5-File.dtd"> <HDF5-File> <RootGroup OBJ-XID="root"> ....This could be done by editing the output file, or by using: h5dump -xml -dtd /tmp/mcgrath/my-HDF5-File.dtd file.h5 |
Technical Approach and Design Notes
Based on a detailed analysis and experiments with the current dumper, it is clear that the XML feature can be added to the existing code without disrupting the standard options. Most of the dumper code will be shared between the different versions of the output, with some additional code to support XML (described below). To date, no changes to libtool or any other code are needed, although at least one libtool call needs to be overridden for XML (see #4 below).Key Changes
1. Add the options as described above, and global variables
to store their values. Also will need to add some logic to disallow
options that are not supported when XML is selected.
2. The dump_header format table must be changed.
Note that the XML code will not use the 'header' strings, but will use
other strings from that table.
3. Implement alternative versions of object dumps. The XML output is not only syntactically different, but some of the order of elements is different. The cleanest implementation will be to provide alternative versions of:
- dump_group,
- dump_named_datatype,
- dump_dataset,
- dump_dataspace,
- dump_datatype,
- dump_attr,
- dump_data
4. XML needs to output the target of references, not the value of the reference. This is required because it is required that the DTD can be used to create a new HDF5 file. (The dumper prints the reference value, which cannot be used to create a new copy of the file.)
The proposed output for reference data is a path that can be used to create a reference to the correct object. Region references would be a path plus additional mark up TBD, describing the region.
Implementing this feature requires additional code to the dumper.
First, there must be some mechanism for looking up at least one path, given an object reference.
There are two suggested implementations for this.
- A table of all targets of references, keyed by object reference. The entries are: (obj_ref, aPath) tuples
- Adding information to the global object table already in use.
|
|
|
|
| New table of (reference, targetpath) |
|
|
| Add to existing object table. |
|
|
The first option is recommended.
The second change will be to not call the tools library to dump references.
Instead, a new routine will be called to read the object reference, look
up the path and write the path is written to the XML file as the value
of the <DataFromFile> element.
| Example:
The dumper would show the value of an object reference thus: DATASET "Dataset3" {
DATATYPE { H5T_REFERENCE }
DATASPACE { SIMPLE ( 4 ) / ( 4 ) }
DATA {
DATASET 0:1696, DATASET 0:2152,
GROUP 0:1320, DATATYPE 0:2268
}
}
The XML for the data part should be something like:
<Dataset Name="Dataset3" OBJ-XID="Dataset3" Parents="">
<Dataspace>
<SimpleDataspace Ndims="1">
<Dimension DimSize="4" MaxDimSize="4"/>
</SimpleDataspace>
</Dataspace>
<DataType>
<AtomicType>
<ReferenceType>
<ObjectReferenceType />
</ReferenceType>
</AtomicType>
</DataType>
<Data>
<DataFromFile>
"/Group1/Dataset1"
"/Group1/Dataset2"
"/Group1"
"/Group1/Datatype1"
</DataFromFile>
</Data>
</Dataset>
|
5. Changes to the DTD
The DTD will need to be updated to support the following:
- Object References
- Region References
- BitFields
- Opaque types
6. Other changes yet to be determined
There are several questions that remain unknown at this time and need to be investigated:
- Some features may need clarification. For instance, it is not clear whether dumping selected object should be supported by the XML option.
- It is not yet known whether the existing dumper code can be used for compound data type data, or whether an alternative will be needed.
- The DTD does not support all objects. For instance, there is no provision for dumping object IDs.
- Some legal file structures (such as 'loops' in the graph and forward object references) may produce XML that is difficult to parse.
Summary and Miscellaneous comments
The overall changes are feasible, requiring several hundred lines of additional code and modification to about 50 lines of existing code.
An initial version, supporting the most important data types can be done in a month of part time work.
Some of this work is uncovering bugs in the XML DTD and h5gen tool, which makes debugging the h5dump code more complicated.
- - Last modified:May 16th 2011
