A Kind of Magic: Storing Computations in HDF5
Gerd Heber, The HDF Group
The purpose of this introduction is to highlight and celebrate a community contribution, the impact of which we are just beginning to understand. Its principal author, Mr. Lucas C. Villa Real, calls it HDF5-UDF and describes it as “a mechanism to generate HDF5 dataset values on-the-fly using user-defined functions (UDFs).” This matter- of-fact characterization is entirely accurate, but I would like to provide some context for what this means for us users of HDF5.
Google Scholar is an excellent source of HDF5 news. Search for or subscribe to ‘HDF5’, perhaps combined with your favorite other keywords, and you will see a steady flow of citations which currently amounts to about 2,000 articles a year. You are guaranteed to find subjects from science and engineering that you’d never heard of, let alone associated with the use of HDF5. One of the questions that I keep asking myself is, “Why is HDF5 such a good fit for all these different applications and scenarios?”
Humans are model thinkers. The refined models of scientists and engineers postulate certain variables and relationships. The values of these variables are also known as data. (There are no data in the absence of a model.) The HDF5 data model and its implementations offer a toolkit to precisely define decorated groupings of (array, map, …) variables that can be assigned values through a process called I/O. HDF5 containers are snapshots of model instances. That’s why!
When thinking of arrays, the values of array variables, many users probably think of rectilinear, multi- dimensional arrangements of cells populated with values of a certain type, the element type. It may sound pedantic, but another perspective is that the value of an array variable is the graph of a discrete or discretized function defined on a lattice given by the sites, cells, indexes, or gridpoints (pick your favorite metaphor!). The value
A [i, j, k] is the value of the function
A evaluated at the argument
(i,j,k). On that view, the ground on which HDF5 datasets have hitherto rested gives way, and storage yields to computation or computational storage. The change brought about by Mr. Villa Real’s proposal is seismic indeed.
With the arrival of HDF5-UDF, we now can store functions in HDF5 files. Really?
- Does that mean actual code is stored in the file? Yes.
- Which languages are supported? C/C++, Lua, Python
- Do tools continue to work with HDF5-UDF? Yes.
- Can an HDF5-UDF depend on other datasets? Yes.
- Are hardware accelerators supported? Yes.
- Does it run on Windows? Yes. (macOS is also supported.)
Figure 1: HDF5-UDF Overview (The figure was created by Mr. Villa Real and is reproduced here with his kind permission.)
Cynics might argue that this mechanism is also known as a computer virus. Rest assured that HDF5-UDF were created with a security model (trust profiles) from the start.
There are several old and new use cases for HDF5-UDF. Making non-HDF5 data appear as if they were stored in HDF5 is a classic. This can be achieved by other means, but a function is perhaps a better compromise with fewer restrictions than, say, external layout and which is much cheaper to implement than a VOL plugin. It also opens the door to new data sources such as (No-)SQL stores, streams, and service endpoints, where we are leaving the realm of mathematical functions since an HDF5-UDF might return a different value for the same arguments at a different invocation time. What-if analyses and richer dataset transformations are now possible. Finally, since an HDF5-UDF can consume and depend on other HDF5 datasets, we have a crucial building block to build and store entire workflows in HDF5. There are limitations, but new ground has been broken in the land of HDF5.
Those mentioned above and many other intriguing details and performance results can be found in the paper “User-Defined Functions for HDF5” by Lucas C. Villa Real and Maximilien de Bayser, which was submitted to last year’s HDF User Group meeting (HUG 2021). It is my great privilege to announce its availability and draw your attention to the most current and most complete account of their work on HDF5-UDF to date.