Robert E. mcGrath
March 27, 2001
Overview
The XML language itself is character oriented, objects are represented by character mark up, i.e., as elements and attributes. One of the major open issues in using XML for scientific applications is the representation of "binary" data, that is data that is not "character oriented". This includes numbers, arrays of numbers, images, and other objects. There is no default standard for representing such objects in XML. Furthermore, XML is a very general purpose technology, which is used far a wide variety of purposes. Different uses place fundamentally different requirements on the XML representation of the relevant "binary" data.
However, the most important reason for using XML is interoperability in an open system. This goal absolutely requires appropriate standards. For this reason, the issues here are not a matter of "what is the one correct way to handle binary data", but "what are the appropriate standards to employ".
The General Problem
The general problem is how to represent objects that are typically represented on computers by packed binary codes. One important class of such objects are "numbers": numbers are typically represented in memory or storage as coded bit fields, e.g., as unsigned or signed twos-complement, or following IEEE floating point standards.
Note, too, that numbers are not usually uninterpreted values. They are often operated on as objects. For example, there are a set of commonly applied methods used with numbers, including comparisons (e.g., equals) and arithmetic. And numbers often are used in complex aggregates: vectors, matrices, and arrays, which might in turn represent images or other objects.
The basic XML standard has no specification for how a "number" should
be encoded, nor for standard operations, nor for how to encode aggregates
such as arrays. This is not to say that numbers can't be represented:
far from it. There are many ways to represent numbers as strings.
For instance, suppose that the concept to be expressed is to state a range
of numbers is from "1" to "10". Figure 1 shows some of the several
ways this might be done in XML.
<!-- 1 --> |
It is important to note that all of these variations are "correct", and each has advantages and disadvantages for certain purposes.
For example, consider an indexing program that creates a searchable index of XML documents. The first example, '<range min="1" max="10"/>', the indexer will extract a single entity with no value. In the second example, the indexer will index the value "1,10". In the third case, the index will be able to index 'range.min="1"' and 'range.max="10"'. And in the fourth case, the indexer might know that 'range' is a range of integers, from 1 to 10. The point is, depending on what the XML is to be used for, different markup may be needed, even for something as simple as a pair of numbers.
Markup for arrays, etc.
It is commonly seen that XML documents need to describe large aggregates of numbers, such as images, which might have thousands of pixel values. In such an application, it would be possible to encode the individual pixels of an image using markup along the lines shown in Figure 1, perhaps with tags for each scan line or for blocks of pixels, or even for every pixel.
This is not commonly done, because the XML would be extremely voluminous, and the markup adds little to the standard binary encoding. On the other hand, such a mark up would make it possible for XML based tools to identify and access every data value, perhaps to index the image by its values, or to correlate XML descriptions of vectors or features with the pixels of the image.
Including binary data in XML
One possible approach is to include 'binary' data directly in the data
as a 'blob'. The XML tag would indicate the format of the data, followed
by a bunch of bytes.
| <SomeData format="JPEG" cols="10" rows="20">
asdlfjaslfdasfjasdfasdfasdfdsflasdlf as;lfjaslfjasljfsalfsdfj </SomeData> |
It is not easy to simply embed the binary data in XML, however, because the values need to be escaped. I.e., if a pixel happens to have the numeric value of the Unicode character '< ', XML will break. for this reason, the blob will need to be 'escaped' or encoded in some way to assure that it will not contain problematic values.
This approach keeps the data in the XML in a compact form, but XML tools cannot "understand" it. For example, indexing software would not be able to extract individual pixel values.
Referring to data in an external file
A third approach is to 'point' to the data, e.g., with an <xlink>
element. In this case, the XML describes the binary data; where it is (e.g.,
a URL) and how to access it (e.g., a MIME type). the XML does not
encode the data. Figure 2 shows an example of this:
<DataObject> |
This approach is ideal if there is good reason for the "real data" to stay in one place:
- if the data requires special accesss mechanisms
- is proprietary
- too large to replicate or transport
DTD, Schema, and RDF Schema
There are several models for defining constraints on marked up documents, and validating instances against constraints. These include Document Type Definitions (DTDs), XML-Schema, and RDF-Schema. These standards provide different capabilities which can be used to deal with binary data in different ways.Of the three, a DTD is the simplest mechanism, with the least built in capability. DTDs have no typing, and cannot express sub-class relationships. Also, DTDs have no standard for representing numbers or binary encoded data.
XML-Schema has far more capabilities, including sub-classing, and standard markup for numbers. An XML schema can markup binary data, e.g., the pixels of a raster image. Standard numeric types are supported, and non-standard types can be derived. Also, XML schema supports inclusion of multiple schemas, so a single XML 'document' may import specific markup schemes for different data objects.
Resource Description Framework (RDF) provides some of the features of XML schema, but also has a model of semantic relationships. RDF can express almost any entity relationship model in a consistent and machine readable format. In principle, RDF descriptions can be used to 'reason' about the relationships of object. RDF does not have a specification for numbers or binary objects.
All of these mechanisms can support 'pointers' to external objects.
None of them specify a standard for including character encoded data.
In either case, the format of the data must be specified in the XML.
this specification will indicate the correct interpretation of the data
(i.e., what program to call) along with any other necessary parameters.
For maximum interoperability, the XML specification should be standardized,
along the lines of MIME types. There can be a variety of these standards,
but they should be published for communities and implementors to share.
- - Last modified:June 26th 2007
