Elena Pourmal, John Mainzer, Jordan Henderson, Richard Warren, Allen Byrne, and Vailin Choi
The HDF Group
Introduction
On March 30th, we announced the release of HDF5 version 1.10.2. With this release, we accomplished all the tasks planned for the major HDF5 1.10 series. It is time for applications to start migrating (or start their migration) from HDF5 1.8 to the new major release as we will be dropping support for HDF5 1.8 in Summer 2019.
HDF5 1.10 is now in the maintenance mode, meaning that no new features will be added in future 1.10 releases. Only bug fixes, performance improvements and new APIs requested by our customers and (or?) contributed by our community members will be provided.
With this release, users can
- create files that are HDF5 1.8 compatible,
- get better performance with the HDF5 parallel applications including writing compressed data by multiple processes, and
- read and write more than 2GB of data in a single I/O operation.
For a full description of the changes, see the RELEASE.txt file.
In this blog, we will focus only on the major new features and bug fixes in HDF5 1.10.2. Hopefully, after reading about those, you will be convinced it is time to upgrade to HDF5 1.10.2.
Forward compatibility for HDF5 1.8-based applications accessing files created by the HDF5 1.10.2-based applications.
In HDF5 1.8.0, we introduced a new function called H5Pset_libver_bounds
. The function takes two arguments:
“low” – the earliest version of the library that will be used for writing objects and
“high” – the latest version of the library that will be used for writing objects.
In HDF5 1.10.2, we added a new value for the “low” and “high” parameters, i.e., H5F_LIBVER_V18
, and changed H5F_LIBVER_LATEST
to be mapped to H5F_LIBVER_V110
.
When H5F_LIBVER_LATEST
was used in an application linked with HDF5 1.8.*, the application was able to use the new features added in HDF5 1.8.0 such as UTF-8 encoding and more efficient group storage and access. When it is used in application linked with HDF5 1.10.0, it will enable new chunk indexing for Single Writer/Multiple Reader (SWMR) access and Virtual Dataset storage.
What does this change mean to an HDF5 application?
When an HDF5 application linked with HDF5 1.10.2 specifies H5F_LIBVER_LATEST
as a value for the “high” parameter, the application may produce files that are not compatible with the HDF5 1.8.* file format. For example, new chunk indexing will be used that was not known to HDF5 1.8.*. This means that an application linked with HDF5 1.8.* libraries may not be able to read such files.
When an HDF5 application linked with HDF5 1.10.2 specifies H5F_LIBVER_V18
as a value for the “high” parameter, the application will produce files fully compatible with HDF5 1.8.*, meaning that any application linked with the HDF5 1.8.* libraries will be able to read such files.
An example of the effect of this change would be an application that uses H5Pset_libver_bounds
with H5F_LIBVER_LATEST
as a value for the “high” parameter, and therefore, produces HDF5 files that the tools built with HD5 1.8.* cannot read. For example, netCDF 4.4.0 (Feb. 2016). We recommend that such applications use H5F_LIBVER_V18
instead of H5F_LIBVER_LATEST
to achieve forward compatibility with the HDF5 1.8.*-based applications.
For more on information about this change, see “RFC – Setting Bounds for Object Creation in HDF5 1.10.0 – Update.”
Performance optimizations for HDF5 parallel applications
Historically, the parallel version of the HDF5 library had suffered from performance issues on file open, close, and flush, which had typically been related to small metadata I/O issues. While the small metadata I/O issue had been addressed in the past, initially by distributing metadata writes across processes, and subsequently by writing metadata collectively on file close and flush and supporting collective metadata reads in some cases, the problem had re-appeared as the typical number of processes in HPC computations increased. While we are still working on the performance problem and the appropriate solution to achieve scalability, we have introduced some optimizations to speed up open/close/flush operations at scale in the HDF5 1.10.2 release.
We noticed that while the superblock read on file open is collective when collective metadata reads are enabled, all processes independently search for the superblock location—which at a minimum means that all processes independently read the first eight bytes of the file.
As this is an obvious performance bottleneck on file open for large computations, and the fix is simple and highly localized, we modified the HDF5 library so that only process 0 searches for the superblock, and broadcasts its location to all other processes.
Our users also reported slow file close on some Lustre file systems. While the ultimate cause is not understood fully, the proximate cause appears to be long delays in MPI_File_set_size()
calls at file close and flush. To minimize this problem pending a definitive diagnosis and fix, HDF5 has been modified to avoid MPI_File_set_size()
calls when possible. This is done by comparing the library’s EOA (End of Allocation) with the file systems EOF (End of File), and skipping the MPI_File_set_size()
call if the two match.
If you still see the problems, especially with the HDF5 close operation, please report it to help@hdfgroup.org, specifying the system on which you run your application. The inclusion of a simple HDF5 C program that emulates the application’s I/O pattern will be very helpful to our investigation and may increase the speed in which we can provide the right solution.
For more information on the subject, please see a draft of the “File Open, Close, and Flush Performance Issues in HDF5” white paper. The paper outlines our initial findings and is a work-in-progress.
Using compression with HDF5 parallel applications
In the past, HDF5 parallel applications could read compressed data, but couldn’t create and write data using compression and other filters like Fletcher32 for raw data checksum. This restriction was removed in HDF5 1.10.2. Here is a code snippet of the change that shows how to enable deflate (GZIP) compression for parallel HDF5 applications using the “Writing by Chunk Example” in the HDF5 Parallel Tutorial. Please notice that the H5Dwrite
call must use collective mode to write data as shown in the example.
/* * Create chunked dataset. */
plist_id = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(plist_id, RANK, chunk_dims);
/* One line change to enable compression */ H5Pset_deflate(plist_id, 6);
dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace, H5P_DEFAULT, plist_id, H5P_DEFAULT); H5Pclose(plist_id); H5Sclose(filespace);
/* * Each process defines dataset in memory and writes it to the hyperslab * in the file. */ …… /* * Create property list for collective dataset write. */
plist_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(plist_id,
H5FD_MPIO_COLLECTIVE
); status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace, plist_id, data);
This feature should still be considered as experimental. There is a known issue that triggers an assertion failure in the HDF5 library under currently unknown circumstances. We will appreciate you contacting help@hdfgroup.org, if you get the error, and providing us with an example that can reliably reproduce it.
Large MPI I/O transfers
In previous releases, the parallel HDF5 would fail when attempting to read or write greater than 2GB of data in a single IO operation. This issue stems principally from an MPI API whose definitions utilize 32 bit integers to describe the number of data elements and the datatype that MPI should use for a data transfer.
Historically, HDF5 has invoked MPI-IO with the number of elements in a contiguous buffer represented as the length of that buffer in bytes.
Resolving the issue and thus enabling larger MPI-IO transfer is accomplished first, by detecting when a user I/O request would exceed the 2GB limit as described above. Once a transfer request is identified as requiring special handling, HDF5 now creates a derived datatype consisting of a vector of fixed size block which is in turn wrapped within a single MPI_Type_struct to contain the vector and any remaining data. The newly created datatype is then used in place of MPI_BYTE
and can be used to fulfill the original user request without encountering API errors.
Locations of the VDS source files
We introduced Virtual Datasets in HDF5 1.10.0. The feature had a severe limitation: when source datasets were specified at VDS creation time, they could not be moved. To make matters worse, the application had to run in the directory where the source files were. Clearly, this created a problem when a file with VDS and corresponding source files were moved from their initial location for further processing.
In HDF5 1.10.2 we added two public APIs to set (or get) a prefix to the names of the source files, using a data access property list (DAPL) that allows the source files to locate at an absolute or relative path to the virtual file:
herr_t H5Pset_virtual_prefix(hid_t dapl_id, const char* prefix);
ssize_t H5Pget_virtual_prefix(hid_t dapl_id, char* prefix /*out*/, size_t size);
The prefix can also be set with an environment variable, HDF5_VDS_PREFIX
.
If you have any questions about HDF5 1.10.2 release, please contact help@hdfgroup.org and/or post them on the HDF Forum. Your input is invaluable for a better and faster HDF5.
The HDF Group Developers