Technical Note: Escape Characters for XML/HDF5

Robert E. McGrath
November 22, 2000

Basic Principle and Goal:

HDF5 names and strings may contain almost any ASCII characters. XML has reserved characters, with standard escape sequences, termed 'external entities'. (Technically XML uses unicode, but this has no practical impact.) In addition, the H5gen classes parse the contents of data blocks using Java's 'StringTokenizer' class and our own logic.

When outputting a description of an HDF5 (e.g., H5dump or H5view XMLWriter) it is necessary to replace certain characters from HDF5 with escape sequences that will be parsed correctly by XML and the H5gen classes.

When inputting, the H5gen classes must handle 'parsed character' values from the XML parser, and also unparsed data (e.g., the <DataFromFile> contents), which must be further processed by the H5gen classes.

One important constraint is that HDF5 object references (which appear in unparsed data) must exactly match the object name of it's target (which is in parsed data). The rules below assure that the resulting strings in memory will be identical in this case.

The Two Cases

There are two cases: data that is in an XML 'parsed character' object, and data that is 'unparsed character' data. The former are items within tags, such as XML attributes, the latter are the contents of CDATA blocks, such as the values of HDF5 data. Parsed character data is parsed and escaped by the XML parser. Unparsed data is partially parsed and escaped by XML.

Example 1:

<Dataset Name="bob" OBJ-XID="bob1" Parents="/">


The quoted strings in this element are all parsed entities, and any escaped characters are completed handled by XML.

Example 2:

<Attribute Name="attr5">
<Dataspace>
<ScalarDataspace />
</Dataspace>
<DataType>
<AtomicType>
<StringType Cset="H5T_CSET_ASCII" StrSize="17" StrPad="H5T_STR_NULLTERM"/>
</AtomicType>
</DataType>
<Data>
<DataFromFile>
"string attribute"
</DataFromFile>
</Data>
</Attribute>


The <DataFromFile> element is CDATA, and must be parsed by the H5gen classes.

The Escape Characters

When writing parsed data to XML:
Character in HDF5
Output Character in XML
quote (") &quot;
apostrophe (') &apos;
ampersand (&) &amp;
less than (<) &lt;
greater than (>) &gt;
slash (\) No escape required
space No escape required

When writing unparsed data to XML:
Character in HDF5
Output Character in XML
quote (") \"
apostrophe (') &apos;
ampersand (&) &amp;
less than (<) &lt;
greater than (>) &gt;
slash (\) \\
space No escape required

When reading parsed data form XML:

No additional parsing is needed.

Note, though, that reserved characters are treated as the end of a StringTokenizer stream (this is undocumented behavior). The parser must parse past these breaks. The H5gDataset::parseTextData() and H5gen 'characters()' method handle this case correctly.

When reading unparsed data from XML

Strings and references must be parsed. They have quotes around them ("), so any quote that is part of the string must be escaped (\"). Consequently, the escape character "\" must also be escaped. The H5gDataset::parseTextData() method correctly skips the escape character, as well as handling white space between strings.

Appendix:

Attached below are two C routines from the h5dump.c program. These add the escape characters for the the two cases of writing to XML.


/* for parsed data, e.g., the name of datasets or groups */
char *
xml_escape_the_name(const char * str)
{
int extra;
int len;
int i;
char * cp;
char * ncp;
char * rcp;

if (str == NULL) return (char *)str;
cp = (char *)str;
len = strlen(str);
extra = 0;
for ( i = 0;i < len; i++ ) {
if (*cp == '\"')
{
extra += (strlen(quote) - 1);
} else if (*cp == '\'')
{
extra += (strlen(apos) - 1);
} else if (*cp == '<')
{
extra += (strlen(lt) - 1);
} else if (*cp == '>')
{
extra += (strlen(gt) - 1);
} else if (*cp == '&')
{
extra += (strlen(amp) - 1);
}
cp++;
}

if (extra == 0) {
return (char *)str;
} else {
cp = (char *)str;
rcp = ncp = malloc(len+extra+1);
if (ncp == NULL) return NULL; /* ?? */
for (i = 0; i < len; i++) {
if (*cp == '\'') {
strncpy(ncp,apos,strlen(apos));
ncp += strlen(apos);
cp++;
} else if (*cp == '<') {
strncpy(ncp,lt,strlen(lt));
ncp += strlen(lt);
cp++;
} else if (*cp == '>') {
strncpy(ncp,gt,strlen(gt));
ncp += strlen(gt);
cp++;
} else if (*cp == '\"') {
strncpy(ncp,quote,strlen(quote));
ncp += strlen(quote);
cp++;
} else if (*cp == '&') {
strncpy(ncp,amp,strlen(amp));
ncp += strlen(amp);
cp++;
} else {
*ncp++ = *cp++;
}
}
*ncp = '\0';
return rcp;
}
}

/* for unparsed data, e.g., String data */
char *
xml_escape_the_string(const char * str)
{
int extra;
int len;
int i;
char * cp;
char * ncp;
char * rcp;

if (str == NULL) return (char *)str;
cp = (char *)str;
len = strlen(str);
extra = 0;
for ( i = 0;i < len; i++ ) {
if (*cp == '\\')
{
extra++;
} else if (*cp == '\"')
{
extra ++;
} else if (*cp == '\'')
{
extra += (strlen(apos) - 1);
} else if (*cp == '<')
{
extra += (strlen(lt) - 1);
} else if (*cp == '>')
{
extra += (strlen(gt) - 1);
} else if (*cp == '&')
{
extra += (strlen(amp) - 1);
}
cp++;
}

if (extra == 0) {
return (char *)str;
} else {
cp = (char *)str;
rcp = ncp = malloc(len+extra+1);
if (ncp == NULL) return NULL; /* ?? */
for (i = 0; i < len; i++) {
if (*cp == '\\') {
*ncp++ = '\\';
*ncp++ = *cp++;
} else if (*cp == '\"') {
*ncp++ = '\\';
*ncp++ = *cp++;
} else if (*cp == '\'') {
strncpy(ncp,apos,strlen(apos));
ncp += strlen(apos);
cp++;
} else if (*cp == '<') {
strncpy(ncp,lt,strlen(lt));
ncp += strlen(lt);
cp++;
} else if (*cp == '>') {
strncpy(ncp,gt,strlen(gt));
ncp += strlen(gt);
cp++;
} else if (*cp == '&') {
strncpy(ncp,amp,strlen(amp));
ncp += strlen(amp);
cp++;
} else {
*ncp++ = *cp++;
}
}
*ncp = '\0';
return rcp;
}
}