Robert E. McGrath
March 27, 2001
HDF5 compound data types are very general and flexible C data structures. They present a complex problem for non-C applications, include XML parsers, and Java applications. This document explains key issues, some of which have been solved and some which are still open.
1. Overview
The HDF5 data model includes a flexible and general model for defining data and records of heterogeneous elements, called "compound data types". A dataset declared to have a compound datatype is then a multi-dimensional array of records. Compound data types may have fields of any atomic type, and also may have fields that are compound (sometimes termed "nested compound types"). The data type defines the offset and storage layout of all the elements in a compound element.
The HDF5 library also supports reading and writing data to and from a dataset with compound type. The access may write (read) one or more complete records, or may write to (read from) particular fields of the records. The HDF5 library manages the scatter/gather necessary to transfer the values to correct offsets on disk or memory.
Compound data creates a number of challenges for non-C environments. These stem from several fundamental features:
- Compound datatypes are defined in terms of layout in memory, a la C structs.
- The type of the data is defined at run time, i.e., you must read the file to determine the type of the data objects to read
In addition, accessing data must be done according to C language rules, i.e., reading into arrays of C structs. This model is completely different from the I/O of Java or XML, and requires complex plumbing to convert from one organization to another. The reorganization must be done at run time, if it can be done at all.
Section 2 explains problems that general purpose viewers have. Section 3 discusses how HDF5 Compound data works in Java. Section 4 explains the challenges in XML.
2. Issues for a Viewer/Editor
The Java viewer has additional problems with compound data, problems that are an issue for any visual viewer or editor, not just Java. The issue is how to display and/or manipulate compound datasets.
Arrays of data have conventional methods for display, as images or as 'spreadsheet'-like grids. Multidimensional arrays can be stacked in slices, etc. Each entry is a single number or string, and there is no need to label each value with a name or type.
In the case of compound data, though, each element has structure. E.g., in a spreadsheet, there would be a set of values in each cell. Furthermore, the values really need to be labeled to identify the fields. Calculating a reasonable layout for an arbitrary record is non-trivial, and a model of 'navigation' (e.g., to "drill down" from a summary to more detailed records, etc.) is not obvious.
In order to support editing compound data it is also be necessary to select and modify individual values within a record. For a single record this is often done with a 'form', which could be automatically generated from the type description. However, for an array of hundreds of records, one would also like to be able to manipulate whole fields (e.g., a column of a table), or groups of records. This is very tricky.
This is not to say that nothing can be done. The conclusion is that careful design is needed.
These problems are not specific to Java, except for the fact that several standard widgets work extremely well for arrays of numbers or strings, but cannot deal with compound records. This means that an implementation will require more work, compared to simple types that are already supported by standard widgets.
3. Compound Data in Java
3.1 The ProblemThe first question to ask is "doesn't Java have structures like C, that can be mapped to compound data?" The answer is that Java objects can be conceptually equivalent to C structs, but cannot easily be mapped to them. The reasons are basically:
- Java is extremely strongly typed and types are declared at compile time.
- Java does not allow any access to storage layout of objects.
For any given record format, one can create a Java object and special C code that converts (field by field) between a C struct and the members of the Java object. An example of this can be seen in the NCSA HDF4 Java interface for HDF4 compression parameters, the package is ncsa.hdf.hdflib.HDFCompInfo and it's sub-classes. (See Appendix 1) However there is no way to do this for arbitrary records at run time.
3.2. What Can Be Done
The upshot is that for a Java program to read or write HDF5 compound data, it must do one of the following options:
- convert the data of the Java object into an array of bytes, in a way that assures that the bytes will be arranged in correct C order on any platform (and vice versa for read). This is extremely difficult to to in a portable way: the Java code must attempt to lay out (or interpret) byte according to the C layout of the platform and compiler used by the library code--which is completely unknown to the Java language.
- access the data field by field, i.e., read 'field a' from all the elements in to an array of appropriate type. (See Appendix 2.) This is quite inefficient, as the read or write is skipping across memory or disk. However, the data in the HDF5 file is completely compatible with any other HDF5 program, and can be read and written by C or Fortran.
However, for "reasonable" compound data, it is perfectly possible to analyze the type, allocate buffers of the appropriate type and size (e.g., an array of int for a field of type integer 32, a second array of floats for a field of type float 32, etc.), and then read or write data.
4. Compound Data in XML
XML is character oriented, and does not specify a standard for representing arrays of numbers or arrays of records. XML has no problem marking up an HDF5 compound data type description, and the nested nature of the type model fits nicely into XML. However, there is no standard for how the contents of the data records should be represented.4.1. No markup: how it is done currently
The current h5dump program write the values of data from an HDF5
file into ASCII. In the case of an array of numbers, the result is
block of characters, with each number separated by separators, in C memory
order. (Figure 1) The '--xml' option uses this same code,
with white space for separator.
|
In the case of compound data, the values of the fields in each element are written in order, with the elements in C memory order. Note that the alignment (or size) of the elements is irrelevant and not present in the ASCII or XML file. The values are a block of characters, individual values separate by markers.
In this scheme, an XML parser must parse the <Datatype> element to discover the type of the data and the <Dataspace> tells how many elements. From this, an array of appropriate type and size is allocated (using heuristics as needed). The <DataFromFile> element is read as a large block of characters. The string read from <DataFromFile> must be parsed to pull out each value, convert to the appropriate type, and write into the array. For an array of numbers, there is a single array and the string is parsed to find numbers of a particular type, e.g., integers. (The numbers have to be checked by the parser to make sure that they are within the required range, do or don't have signs, etc.)
Compound data, on the other hand, is heterogeneous. Figure 2 shows
a sample of compound data, as implemented in the current h5dump.
Ignoring the issue of storage allocation (which is a problem for Java),
the parser must parse the <Datatype> element, and then construct
a state machine to reflect the order of the elements expected. Then
the <DataFromFile> string is parsed, looking for the next field
of the current element, converting to a number according to the type of
the field, and then writing to the appropriate memory location. This
must be able to handle any number and order of atomic fields. It is important
to note that this is a problem for any XML parser, not just for
Java.
| <!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields: DATATYPE H5T_COMPOUND { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } --> <Data> <DataFromFile> 0 0 1 1 1 0.5 2 4 0.333333 3 9 0.25 4 16 0.2 </DataFromFile> </Data> |
4.2. What about marking up the compound data?
Part of the problem stems from the representation of the data values as a single block of characters. The 'structure' of the data values is un-marked, so the parser has to be very smart.
The burden on the parser can be reduced by adding mark up to the data. There are two general flavors. First, the data can be left as a single XML element, but extra markers besides white spaces can be added. For instance, the h5dump DDL output uses {} to delimit values of a record. (Figure 1) This sort of markup could simplify the string parser by unambiguously marking the limits of records (especially nested types).
The second approach would be to use XML mark up. For example,
there could be a <Record> element, for each element of the
array, and there could be a <Field> element, to give the name
and value. This sort of mark up would give the XML parser much more
to work with. (XML DTDs allow this kind of markup, and XML Schema
supports this sort of mark up quite nicely.) Of course, it would still
have to check that the marked up data matches the HDF5 datatype
declaration, and will still have to convert each value into a number
and store it in the appropriate data item.
| <!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields: DATATYPE H5T_COMPOUND { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } --> <Data> <DataFromFile> <Record> <Field Name="a_name">0</Field> <Field Name="b_name">1</Field> <Field Name="c_name">1</Field> </Record> <Record> <Field Name="a_name">1</Field> <Field Name="b_name">0.5</Field> <Field Name="c_name">2</Field> </Record> <!-- and so on... --> </DataFromFile> </Data> |
Clearly, some kind of mark up would be a good idea. There are two important barriers to doing so up to now.
- First, there is no recognized standard, so we need to design carefully and be prepared to use such standards as they do arise.
- Second, the size of the XML file balloons alarmingly. A record with two 4-byte numbers would be several times larger in marked up unicode, and the whole data block will be huge. The first option (adding markup in a string) uses less space, but is less likely to be standardized that the bulkier XML.
4.3. What about 'pointing' to the data in the file?
In many cases XML is used for a description of an object, such as a
dataset. In this use, the data is commonly not represented in the
XML, rather the <Data> element is a pointer to the data.
Figure 4 shows an example of how this might look.
| <!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields: DATATYPE H5T_COMPOUND { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } --> <Data> <NativeHDF5> <xlink type="locator" href="url of HDF5 file" H5Path="/Dataset3" /> <!-- will need other attributes to indicate which elements to read --> </NativeHDF5> </Data> |
The XML software will be able to analyze the type (from the <Datatype> element), and have a complete picture of what is pointed to (including how many elements, the fields, etc.). The <xlink> element is part of the XML standard, and is designed to implement various kinds of 'smart pointers'. The <xlink> would presumably contain information about what software or service to call, and what data from the file to request. Assuming that the required service or external library is available, the XML program can retrieve the data directly when it needs it.
This approach will meet the needs of many uses. The data remains in an HDF5 dataset, perhaps at a network data provider. The data is retrieved directly, and is never converted into XML. This save space and the time needed to convert from binary to characters. The external library or service can provide useful options for accessing the compound data, e.g. by fields, or for selected records.
Appendix 1: How Java can access C structs
Java can interface to C routines that use structs using the Java
Native Interface. This is explained here.
Figure 1 sketches the layers of software involved: the Java program
(which sees only Java objects, some with 'native' methods), Native methods
(which have a Java stub and a C stub), and the C code (regular C functions,
called from the JNI). The JNI support library provides the means
to pass objects of atomic type (such as int) from Java to C, and also has
methods to extract the value of public fields of a Java object into C.
Figure 1. Sketch of the data flow when passing a Java object
as a C struct
The overall action is:
- In the Java program, an instance of a Java object, e.g., HDFCompInfo, is created and initialized.
- The object is passed as an argument to a native method. The object is cast to type java.lang.Object (i.e., 'any').
- In the C stub, calls are made to access the Java object, and to read the values of public fields into C variables.
- The C stub declares the appropriate C struct, and fills in the fields from the values from the Java object.
- The regular C function is called with C struct, e.g., a 'comp_info' struct is passed to HDF.
Example code is shown below.
Example: passing a 'comp_info' struct to HDF
In HDF 4, the 'comp_info' parameter is a union of structs. Table
1 shows this structure.
typedef union tag_comp_info |
This is represtented in Java as a set of classes. The 'union'
is represented as a super class with several sub-classes. E.g., HDFCompInfo,
which has a sub-class for HDFDeflateCompInfo, HDFNBITCompInfo, and so on.
Each sub-class has the fields that correspond to one of the structs in
the C union.
| Class Hierarchy
class ncsa.hdf.hdflib.HDFCompInfo
|
Table 3 shows some examples of these classes. Each class represents
a C struct, so it has data fields but no methods (other than constructors
and accessors). For example, the class HDFJPEGCompInfo has
public fields for quality and force_baseline.
package ncsa.hdf.hdflib; |
The JNI C stubs can access the values in the Java HDFCompInfo object
via built in JNI calls. Table 4 shows an example C function that
is passed a reference to a Java HDFCompInfo object, and intializes a comp_info
structure with the appropriate values. This code determines the sub-class
of the object, and then extracts each field by name.
jboolean getOldCompInfo( JNIEnv *env, jobject ciobj, comp_info *cinf) |
Similar code can be written to go the other way: to read a C struct, and then construct a corresponding Java object, initialized with appropriate values.
Conclusion
As this example shows, it is possible (if very annoying) to create tightly linked Java objects and C structs. Note that it is definitely not possible to simply copy bytes from the Java object to the C struct (or vice versa), e.g, with a memcpy. Note, too, that this is all hand-coded. To add a new struct, it would be necessary to create a new Java class, and new C code to access it.
Appendix 2: Example of Compound Data: Reading field by field
/* the HDF5 Compound data type is ( int, float ) */
datatype1 = -1;
try {
datatype1 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,8);H5.H5Tinsert(datatype1, "int", 0,
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_INT));H5.H5Tinsert(datatype1, "float", 4,
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_FLOAT));dataset1 = H5.H5Dcreate(file, dataset1Name,
datatype1, dataspace1, HDF5Constants.H5P_DEFAULT);
} catch (HDF5Exception ex) {
}// ...
// Read the compound data by field
datatype2 = -1;
try {
datatype2 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,4);H5.H5Tinsert(datatype2, "int",
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_INT));int outIData[numElements];
status = H5.H5Dread(dataset2,
datatype2,
HDF5Constants.H5S_ALL,
HDF5Constants.H5S_ALL,
HDF5Constants.H5P_DEFAULT,
outIData);
} catch (Exception ex) {
}datatype2 = -1;
try {
datatype2 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,4);H5.H5Tinsert(datatype2, "float",
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_FLOAt));floast outFData[numElements];
status = H5.H5Dread(dataset2,
datatype2,
HDF5Constants.H5S_ALL,
HDF5Constants.H5S_ALL,
HDF5Constants.H5P_DEFAULT,
outFData);
} catch (Exception ex) {
}
- - Last modified:June 26th 2007
