Blog

C++ has come a long way and there’s plenty in it for users of HDF5

Steven Varga

A few years ago, I was looking for a data format with low latency block and stream support. While protocol buffers offered streams, it was lacking indexed block access. Soon, I realized I was looking for a container with file system-like properties. When I examined HDF5, I found it was very close to what I needed to store massive financial engineering datasets. In 2011, HDF5 had good support for full- and partial-read-write operations for high-dimensional extendable datasets with optional compression. Also, scientific platforms such as Python, Julia, R, and MATLAB supported HDF5 and most importantly, worked across operating systems and processor architectures.

Machine learning/data science is an emerging field where data storage is a necessary part, but not the main attraction. Data science requires a general data store with fast block and sequential access, capable of storing the observations used for model building. HDF5 provides the basic building blocks for the role, but there is a gap between what it offers and what’s provided out-of-the-box.

Researchers working directly with popular linear algebra libraries, the STL, or time series can benefit from the H5CPP template library’s CRUD-like low latency operations. While engineers who need fast storage solution for arbitrary complex POD struct types—often already available in C/C++ header files—benefit from the H5CPP Clang-based compiler technology.

The current HDF5 C++ approach views C++ as a different language from C, and reproduces the C-API calls, adding only marginal value. Also, the existing C/C++ library is lacking high performance packet writing capability, seamless POD structure transformation to HDF5 compound types, and has no support for popular matrix algebra libraries and STL containers. In fact, HDF5 C++ doesn’t consider C++ templates at all; whereas modern C++ is about templates, and template meta-programming paradigms.

The original design criterion for H5CPP was to implement an intuitive, easy-to-use template-based library that supports most major linear algebra systems, with create, read, write and append operations. This work may be freely downloaded from this h5cpp11 page. However, for the past few months, in co-operation with Gerd Heber, The HDF Group, I’ve been engaged in the design and implementation of a new unique interface: a mixture of Gerd’s idea of having something Python-ically flexible, but instead of using a dictionary-based named argument passing mechanism, I proposed a sexy EBNF grammar, implemented in C++ template meta-programming. This unique C++ API allows you to start coding without any knowledge of the HDF5 API, yet it provides ample room for the details when you need them.

The type system is hidden behind templates, and I/O calls will do the right thing for most users, but you can “open the hood” and take control at any time. In addition to templates, an optional Clang-based compiler scans your project source files, detects all C/C++ POD structures being referenced by H5CPP calls, and then, from the topologically sorted nodes, produces the HDF5 compound type transformations. The HDF5 DDL (Data Description Language) is required for operations with HDF5 compound datatypes, but it can be a tedious and error-prone process to do it the old fashioned way when you have a large, complex project. DDL to source code transformation has been around for decades:  protocol buffers and Apache Thrift are good examples. However, the H5CPP compiler does the exact opposite: it takes arbitrary C/C++ source code and produces an HDF5 compound type DDL expression. The above mechanism works with arbitrary depth array types and POD struct types.

The implementation is rounded off by additional design considerations:

In conclusion, H5CPP aims to solve real life problems that you, as a software writer, engineer, scientist or project manager, encounter and find worth solving. Try the live demo and download the project.

Steven Varga is an independent researcher in machine learning and computational finance, providing convex approximations for combinatorial problems, modeling sequential, categorical data, and writing software for high performance computing in C++, Julia, Python and R.

No Comments

Leave a Comment