hdf images hdf images

Compound Data: Technical Issues for XML, Java, and Tools

Robert E. McGrath
March 27, 2001

HDF5 compound data types are very general and flexible C data structures. They present a complex problem for non-C applications, include XML parsers, and Java applications. This document explains key issues, some of which have been solved and some which are still open.

1. Overview

The HDF5 data model includes a flexible and general model for defining data and records of heterogeneous elements, called "compound data types". A dataset declared to have a compound datatype is then a multi-dimensional array of records. Compound data types may have fields of any atomic type, and also may have fields that are compound (sometimes termed "nested compound types"). The data type defines the offset and storage layout of all the elements in a compound element.

The HDF5 library also supports reading and writing data to and from a dataset with compound type. The access may write (read) one or more complete records, or may write to (read from) particular fields of the records. The HDF5 library manages the scatter/gather necessary to transfer the values to correct offsets on disk or memory.

Compound data creates a number of challenges for non-C environments. These stem from several fundamental features:

The first feature is basically irrelevant to XML, and is outside the programming model of Java entirely. The second feature is outside the model of basic XML, and is impossible in Java.

In addition, accessing data must be done according to C language rules, i.e., reading into arrays of C structs. This model is completely different from the I/O of Java or XML, and requires complex plumbing to convert from one organization to another. The reorganization must be done at run time, if it can be done at all.

Section 2 explains problems that general purpose viewers have. Section 3 discusses how HDF5 Compound data works in Java. Section 4 explains the challenges in XML.

2. Issues for a Viewer/Editor

The Java viewer has additional problems with compound data, problems that are an issue for any visual viewer or editor, not just Java. The issue is how to display and/or manipulate compound datasets.

Arrays of data have conventional methods for display, as images or as 'spreadsheet'-like grids. Multidimensional arrays can be stacked in slices, etc. Each entry is a single number or string, and there is no need to label each value with a name or type.

In the case of compound data, though, each element has structure. E.g., in a spreadsheet, there would be a set of values in each cell. Furthermore, the values really need to be labeled to identify the fields. Calculating a reasonable layout for an arbitrary record is non-trivial, and a model of 'navigation' (e.g., to "drill down" from a summary to more detailed records, etc.) is not obvious.

In order to support editing compound data it is also be necessary to select and modify individual values within a record. For a single record this is often done with a 'form', which could be automatically generated from the type description. However, for an array of hundreds of records, one would also like to be able to manipulate whole fields (e.g., a column of a table), or groups of records. This is very tricky.

This is not to say that nothing can be done. The conclusion is that careful design is needed.

These problems are not specific to Java, except for the fact that several standard widgets work extremely well for arrays of numbers or strings, but cannot deal with compound records. This means that an implementation will require more work, compared to simple types that are already supported by standard widgets.

3. Compound Data in Java

3.1 The Problem

The first question to ask is "doesn't Java have structures like C, that can be mapped to compound data?" The answer is that Java objects can be conceptually equivalent to C structs, but cannot easily be mapped to them. The reasons are basically:

One implication of these facts is that, unlike C, you cannot simply allocate a block of bytes, read data into it, and then cast bytes into some other type. Since the memory layout of objects is never accessible to Java programs, there is no way to construct a valid object in memory from an array of bytes. It is necessary to create and instance of a object, and then assign values to each field, one by one.

For any given record format, one can create a Java object and special C code that converts (field by field) between a C struct and the members of the Java object. An example of this can be seen in the NCSA HDF4 Java interface for HDF4 compression parameters, the package is ncsa.hdf.hdflib.HDFCompInfo and it's sub-classes. (See Appendix 1) However there is no way to do this for arbitrary records at run time.

3.2. What Can Be Done

The upshot is that for a Java program to read or write HDF5 compound data, it must do one of the following options:

In either case, there is considerable overhead in the Java program, required to convert between the natural Java representation and HDF5/C.

However, for "reasonable" compound data, it is perfectly possible to analyze the type, allocate buffers of the appropriate type and size (e.g., an array of int for a field of type integer 32, a second array of floats for a field of type float 32, etc.), and then read or write data.

4. Compound Data in XML

XML is character oriented, and does not specify a standard for representing arrays of numbers or arrays of records. XML has no problem marking up an HDF5 compound data type description, and the nested nature of the type model fits nicely into XML. However, there is no standard for how the contents of the data records should be represented.

4.1. No markup: how it is done currently

The current h5dump program write the values of data from an HDF5 file into ASCII. In the case of an array of numbers, the result is block of characters, with each number separated by separators, in C memory order. (Figure 1) The '--xml' option uses this same code, with white space for separator.


DATASET "dset1" {
DATATYPE H5T_COMPOUND {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name";
}
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
{
0,
0,
1
},
{
1,
1,
0.5
},
{
2,
4,
0.333333
},
{
3,
9,
0.25
},
{
4,
16,
0.2
}
}
}
Figure 1. h5dump output

In the case of compound data, the values of the fields in each element are written in order, with the elements in C memory order. Note that the alignment (or size) of the elements is irrelevant and not present in the ASCII or XML file. The values are a block of characters, individual values separate by markers.

In this scheme, an XML parser must parse the <Datatype> element to discover the type of the data and the <Dataspace> tells how many elements. From this, an array of appropriate type and size is allocated (using heuristics as needed). The <DataFromFile> element is read as a large block of characters. The string read from <DataFromFile> must be parsed to pull out each value, convert to the appropriate type, and write into the array. For an array of numbers, there is a single array and the string is parsed to find numbers of a particular type, e.g., integers. (The numbers have to be checked by the parser to make sure that they are within the required range, do or don't have signs, etc.)

Compound data, on the other hand, is heterogeneous. Figure 2 shows a sample of compound data, as implemented in the current h5dump. Ignoring the issue of storage allocation (which is a problem for Java), the parser must parse the <Datatype> element, and then construct a state machine to reflect the order of the elements expected. Then the <DataFromFile> string is parsed, looking for the next field of the current element, converting to a number according to the type of the field, and then writing to the appropriate memory location. This must be able to handle any number and order of atomic fields. It is important to note that this is a problem for any XML parser, not just for Java.

<!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields:
DATATYPE H5T_COMPOUND {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name";
}
-->
<Data>
<DataFromFile>
0 0 1 1 1 0.5 2 4 0.333333 3 9 0.25 4 16 0.2
</DataFromFile>
</Data>
Figure 2. Example of how compound data is written in XML.

4.2. What about marking up the compound data?

Part of the problem stems from the representation of the data values as a single block of characters. The 'structure' of the data values is un-marked, so the parser has to be very smart.

The burden on the parser can be reduced by adding mark up to the data. There are two general flavors. First, the data can be left as a single XML element, but extra markers besides white spaces can be added. For instance, the h5dump DDL output uses {} to delimit values of a record. (Figure 1) This sort of markup could simplify the string parser by unambiguously marking the limits of records (especially nested types).

The second approach would be to use XML mark up. For example, there could be a <Record> element, for each element of the array, and there could be a <Field> element, to give the name and value. This sort of mark up would give the XML parser much more to work with. (XML DTDs allow this kind of markup, and XML Schema supports this sort of mark up quite nicely.) Of course, it would still have to check that the marked up data matches the HDF5 datatype declaration, and will still have to convert each value into a number and store it in the appropriate data item.

<!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields:
DATATYPE H5T_COMPOUND {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name";
}
-->
<Data>
<DataFromFile>
<Record>
<Field Name="a_name">0</Field>
<Field Name="b_name">1</Field>
<Field Name="c_name">1</Field>
</Record>
<Record>
<Field Name="a_name">1</Field>
<Field Name="b_name">0.5</Field>
<Field Name="c_name">2</Field>
</Record>
<!-- and so on... -->
</DataFromFile>
</Data>
Figure 4. Example of how compound data could be marked up XML.

Clearly, some kind of mark up would be a good idea. There are two important barriers to doing so up to now.


4.3. What about 'pointing' to the data in the file?

In many cases XML is used for a description of an object, such as a dataset. In this use, the data is commonly not represented in the XML, rather the <Data> element is a pointer to the data. Figure 4 shows an example of how this might look.

<!-- A one D array with 5 elements -->
<!-- Each element is compound with the following fields:
DATATYPE H5T_COMPOUND {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name";
}
-->
<Data>
<NativeHDF5>
<xlink type="locator" href="url of HDF5 file" H5Path="/Dataset3" />
<!-- will need other attributes to indicate which elements to read -->
</NativeHDF5>
</Data>
Figure 5. Example of how compound data could be pointed to XML.

The XML software will be able to analyze the type (from the <Datatype> element), and have a complete picture of what is pointed to (including how many elements, the fields, etc.). The <xlink> element is part of the XML standard, and is designed to implement various kinds of 'smart pointers'. The <xlink> would presumably contain information about what software or service to call, and what data from the file to request. Assuming that the required service or external library is available, the XML program can retrieve the data directly when it needs it.

This approach will meet the needs of many uses. The data remains in an HDF5 dataset, perhaps at a network data provider. The data is retrieved directly, and is never converted into XML. This save space and the time needed to convert from binary to characters. The external library or service can provide useful options for accessing the compound data, e.g. by fields, or for selected records.


Appendix 1: How Java can access C structs


Java can interface to C routines that use structs using the Java Native Interface. This is explained here.

Figure 1 sketches the layers of software involved: the Java program (which sees only Java objects, some with 'native' methods), Native methods (which have a Java stub and a C stub), and the C code (regular C functions, called from the JNI). The JNI support library provides the means to pass objects of atomic type (such as int) from Java to C, and also has methods to extract the value of public fields of a Java object into C.





Figure 1. Sketch of the data flow when passing a Java object as a C struct

The overall action is:

  1. In the Java program, an instance of a Java object, e.g., HDFCompInfo, is created and initialized.
  2. The object is passed as an argument to a native method. The object is cast to type java.lang.Object (i.e., 'any').
  3. In the C stub, calls are made to access the Java object, and to read the values of public fields into C variables.
  4. The C stub declares the appropriate C struct, and fills in the fields from the values from the Java object.
  5. The regular C function is called with C struct, e.g., a 'comp_info' struct is passed to HDF.


Example code is shown below.

Example: passing a 'comp_info' struct to HDF

In HDF 4, the 'comp_info' parameter is a union of structs. Table 1 shows this structure.

Table 1. From HDF4: 'comp_info' is a union of structs
typedef union tag_comp_info
{
struct
{
intn quality;
intn force_baseline;
}
jpeg;
struct
{
int32 nt;
intn sign_ext;
intn fill_one;
intn start_bit;
intn bit_len;
}
nbit;
struct
{
intn skp_size; /* size of the individual elements when skipping */
}
skphuff;
struct
{
intn level; /* how hard to work when compressing the data */
}
deflate;
}
comp_info;

This is represtented in Java as a set of classes. The 'union' is represented as a super class with several sub-classes. E.g., HDFCompInfo, which has a sub-class for HDFDeflateCompInfo, HDFNBITCompInfo, and so on. Each sub-class has the fields that correspond to one of the structs in the C union.

Table 2. Class hierarchy for CompInfo
Class Hierarchy

class ncsa.hdf.hdflib.HDFCompInfo
class ncsa.hdf.hdflib.HDFNewCompInfo
class ncsa.hdf.hdflib.HDFDeflateCompInfo
class ncsa.hdf.hdflib.HDFNBITCompInfo
class ncsa.hdf.hdflib.HDFRLECompInfo
class ncsa.hdf.hdflib.HDFSKPHUFFCompInfo
class ncsa.hdf.hdflib.HDFOldCompInfo
class ncsa.hdf.hdflib.HDFIMCOMPCompInfo
class ncsa.hdf.hdflib.HDFJPEGCompInfo
class ncsa.hdf.hdflib.HDFOldRLECompInfo
class ncsa.hdf.hdflib.HDFCompModel

Table 3 shows some examples of these classes. Each class represents a C struct, so it has data fields but no methods (other than constructors and accessors). For example, the class HDFJPEGCompInfo has public fields for quality and force_baseline.

Table 3. Examples of the classes
package ncsa.hdf.hdflib;

public class HDFOldCompInfo extends HDFCompInfo {
public int ctype;
public HDFNewCompInfo() {
ctype = HDFConstants.COMP_CODE_NONE;
} ;
}

public class HDFJPEGCompInfo extends HDFOldCompInfo {
public int quality;
public int force_baseline;
public HDFJPEGCompInfo() {
ctype = HDFConstants.COMP_JPEG;
}

public HDFJPEGCompInfo(int qual, int fb) {
ctype = HDFConstants.COMP_JPEG;
quality = qual;
force_baseline = fb;
}
}

public class HDFNBITCompInfo extends HDFNewCompInfo {
int nt;
int sign_ext;
int fill_one;
int start_bit;
int bit_len;

public HDFNBITCompInfo() {
ctype = HDFConstants.COMP_CODE_NBIT;
}

public HDFNBITCompInfo( int Nt, int Sign_ext, int Fill_one,
int Start_bit, int Bit_len) {
ctype = HDFConstants.COMP_CODE_NBIT;
int nt = Nt;
int sign_ext = Sign_ext;
int fill_one = Fill_one;
int start_bit = Start_bit;
int bit_len = Bit_len;
}
}

The JNI C stubs can access the values in the Java HDFCompInfo object via built in JNI calls. Table 4 shows an example C function that is passed a reference to a Java HDFCompInfo object, and intializes a comp_info structure with the appropriate values. This code determines the sub-class of the object, and then extracts each field by name.

Table 4. C routine, using JNI to get the fields of an object. (Code which is called via '(*env)->' is support code from the JNI library.)
jboolean getOldCompInfo( JNIEnv *env, jobject ciobj, comp_info *cinf)
{
jfieldID jf;
jclass jc;
jint ctype;

jc = (*env)->FindClass(env, "ncsa/hdf/hdflib/HDFOldCompInfo");
if (jc == NULL) {
return JNI_FALSE;
}
jf = (*env)->GetFieldID(env, jc, "ctype", "I");
if (jf == NULL) {
return JNI_FALSE;
}
ctype = (*env)->GetIntField(env, ciobj, jf);

switch(ctype) {
case COMP_NONE:
case COMP_RLE:
case COMP_IMCOMP:
default:
break;

case COMP_JPEG:
jc = (*env)->FindClass(env, "ncsa/hdf/hdflib/HDFJPEGCompInfo");
if (jc == NULL) {
return JNI_FALSE;
}
jf = (*env)->GetFieldID(env, jc, "quality", "I");
if (jf == NULL) {
return JNI_FALSE;
}
cinf->jpeg.quality = (*env)->GetIntField(env, ciobj, jf);

jf = (*env)->GetFieldID(env, jc, "force_baseline", "I");
if (jf == NULL) {
return JNI_FALSE;
}
cinf->jpeg.force_baseline = (*env)->GetIntField(env, ciobj, jf);
break;
}

return JNI_TRUE;
}

Similar code can be written to go the other way: to read a C struct, and then construct a corresponding Java object, initialized with appropriate values.

Conclusion

As this example shows, it is possible (if very annoying) to create tightly linked Java objects and C structs. Note that it is definitely not possible to simply copy bytes from the Java object to the C struct (or vice versa), e.g, with a memcpy. Note, too, that this is all hand-coded. To add a new struct, it would be necessary to create a new Java class, and new C code to access it.


Appendix 2: Example of Compound Data: Reading field by field


/* the HDF5 Compound data type is ( int, float ) */
datatype1 = -1;
try {
datatype1 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,8);

H5.H5Tinsert(datatype1, "int", 0,
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_INT));

H5.H5Tinsert(datatype1, "float", 4,
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_FLOAT));

dataset1 = H5.H5Dcreate(file, dataset1Name,
datatype1, dataspace1, HDF5Constants.H5P_DEFAULT);
} catch (HDF5Exception ex) {
}

// ...

// Read the compound data by field

datatype2 = -1;
try {
datatype2 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,4);

H5.H5Tinsert(datatype2, "int",
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_INT));

int outIData[numElements];

status = H5.H5Dread(dataset2,
datatype2,
HDF5Constants.H5S_ALL,
HDF5Constants.H5S_ALL,
HDF5Constants.H5P_DEFAULT,
outIData);
} catch (Exception ex) {
}

datatype2 = -1;
try {
datatype2 = H5.H5Tcreate(HDF5Constants.H5T_COMPOUND,4);

H5.H5Tinsert(datatype2, "float",
H5.J2C(HDF5CDataTypes.JH5T_NATIVE_FLOAt));

floast outFData[numElements];

status = H5.H5Dread(dataset2,
datatype2,
HDF5Constants.H5S_ALL,
HDF5Constants.H5S_ALL,
HDF5Constants.H5P_DEFAULT,
outFData);
} catch (Exception ex) {
}

- - Last modified:June 26th 2007