HDF (Hierarchical Data Format) technologies are relevant when the data challenges being faced push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Leveraging the powerful HDF products and the expertise of The HDF Group, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies.
Many HDF adopters have very large datasets, very fast access requirements, or very complex datasets. Others turn to HDF because it allows them to easily share data across a wide variety of computational platforms using applications written in different programming languages. Some use HDF to take advantage of the many open-source and commercial tools that understand HDF.
Similar to XML documents, HDF files are self-describing and allow users to specify complex data relationships and dependencies. In contrast to XML documents, HDF files can contain binary data (in many representations) and allow direct access to parts of the file without first parsing the entire contents.
HDF, not surprisingly, allows hierarchical data objects to be expressed in a very natural manner, in contrast to the tables of relational database. Whereas relational databases support tables, HDF supports n-dimensional datasets and each element in the dataset may itself be a complex object. Relational databases offer excellent support for queries based on field matching, but are not well-suited for sequentially processing all records in the database or for subsetting the data based on coordinate-style lookup.
In-house data formats are often developed by individuals or teams to meet the specific needs of their project. While the initial time to develop and deploy such a solution may be quite low, the results are often not portable, not extensible, and not high-performance. In many cases, the time devoted to extending and maintaining the data management portion of the code takes an increasingly large percentage of the total development effort - in effect reducing the time available for the primary objectives of the project. HDF offers a flexible format and powerful API backed by over 20 years of development history. Projects can leverage HDF's capabilities and still define their own data objects and project-specific API to those objects.
HDF is open-source and the software is distributed at no cost. Potential users can evaluate HDF without any financial investment. Projects that adopt HDF are assured that the technology they rely on to manage their data is not dependent upon a proprietary format and binary-only software that a company may dramatically increase the price of, or decide to stop supporting altogether.
The HDF Group is committed to supporting and improving HDF technologies, and to ensuring the long-term accessibility of HDF-stored data. The HDF Group offers training, consulting, and customized support and development services to help users optimally apply the HDF technologies to their particular data challenges.
Detailed descriptions of some of the data challenges facing HDF adopters are given here...
Data is large
In one experiment a 53,000 x 53,000 image from a DNA sequencing analysis was stored in HDF5 and viewed. In another, using HDF5's external storage capability, a composite of 900 files from a seismic simulation was organized in HDF5 to create a terabyte-sized dataset, permitting fast subsetting. An aerospace company will archive all instrument data from a test flight, creating an HDF5 file of nearly a terabyte per test flight.
The HDF5 format and software include features specifically designed to store and access large datasets. There is no theoretical limit to the size of datasets that can be stored in an HDF5 file. HDF5 includes storage options, such as chunking, compression, and external object storage that mitigate many problems associated with large datasets. The flexible structure of HDF5 files makes it possible for applications to create composite structures, such as tables and indexes, that can provide fast random access to very large datasets. The HDF5 library includes sophisticated subsetting operations that offer very fast access to portions of very large datasets. HDF5 supports certain kinds of parallel I/O, making it possible to read and write data at very high speeds.
Data is complex
The Earth Observing System's HDF-EOS format, based on HDF4 and HDF5, can store swaths, grids, in-situ data, instrument metadata, and browse images in a single file, making it possible to capture the entire collection of information about a day's mission.
The grouping structure in HDF5 enables applications to organize data objects in HDF5 to reflect complex relationships among objects. The rich collection of HDF5 datatypes, including datatypes that can point to data in other objects, and including the ability for users to define their own types, lets applications build sophisticated structures that match well with complex data. The HDF5 library has a correspondingly rich set of operations that enables applications to access just those components that are important.
Data is heterogeneous:
A government test center has more than 800,000 HDF5 files, each corresponding to a single test, where all of the information for the test is stored, including the measurements from every on-board instrument, GIS information about the location of the vehicle, browse image plots of the instrument data, and all metadata about the test run.
Because one can mix and match any kind of data within an HDF5 file, applications are able to create coherent repositories of highly heterogeneous collections. One can mix tables, images, small metadata, streams of data from instruments, and structured grids all in the same file. New objects can be easily added to existing HDF5, even though the file was not originally created to handle the new objects.
Data is esoteric:
HDF5 itself places no special meaning to data stored in datasets or attributes. An application can thus store any kind of data simply as a stream of bytes in a dataset or attribute. Using other features in HDF5, an application can provide metadata, indexes, and other information structures to provide full meaning to the data. Doing this in HDF5 then enables the applications to apply other features of HDF5, such as performance-enhancing features (fast I/O, compression), and access features (partial I/O).
Access needs include parallel I/O
The HDF5 format and library are designed to support parallel I/O.
Access needs include random access:
An application called PyTables uses HDF5 to store tables with millions of rows, then uses HDF5's subsetting capability to extract records that meet certain criteria, all in real time.
When data is read from or written to a dataset in HDF5, an application can specify the subset of elements within the dataset's array that are to be read or written. The subset can be an n-dimensional rectangular section, it can be a set of points, or it can be a combination many rectangles and sets of points. A stride can be specified to take samples (e.g. every 10th row and column), and sequences of blocks can be accessed.
The HDF5 "reference" datatype lets applications store information about regions of interest in a dataset, then apply this information in doing the partial I/O operations.
Chunking helps optimize access to very large arrays, when partial access is desired. Only those chunks are accessed that contain the elements of interest. Even when datasets are compressed, only those chunks that contain the elements of interest need to be read or written and uncompressed.
The grouping structure in HDF5 enables applications to organize data in ways that make it easy to find and access those components that are of interest, ignoring the others.
- - Last modified: 22 June 2011