Agenda – HDF5 User Group 2021

Watch all the videos from this event on the Youtube playlist. The chat log is also posted.

Tuesday, October 12, 2021

Time ‑ CEST (GMT+1)

9:00‑9:15 a.m. Welcome Address – Mike Folk, The HDF Group | Video
9:20‑9:40 a.m. Our Earth Who Art in Cloud – Joe Lee, The HDF Group | Slides | Video
In this presentation, we evaluate the cost and performance of accessing some NASA HDF Earthdata in cloud using different software – MATLAB, h5py, GDAL, netCDF-Java, Apache Drill, etc. We compare several cloud storage options like AWS EBS, EFS, S3 against the local file system including LocalStack and MinIO. We also evaluate softwares for the usability in cloud-based Notebook platforms like Google Colab and AWS SageMaker. We will collect and present cost and performance data from the past experiments that were performed through NASA-funded projects.This presentation will help cloud users how to organize and access HDF in cloud either cost-optimized or performance-optimized way since you can’t achieve both under the principles of cloud economics.
9:40‑10:00 a.m. A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis – Sunwoo Lee, PhD, Northwestern University | Slides | Video
In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large-scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large-scale platforms. In this presentation, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.
10:00‑10:20 a.m. The Future of H5Coro – JP Swinski | Slides | Video
H5Coro is an independent implementation in C++ of a subset of the HDF5 standard that is optimized for reading static data from cloud-based storage systems. The H5Coro software is under active development by the University of Washington and NASA/Goddard Space Flight Center as a part of the SlideRule program for on-demand processing of ICESat-2 data.

This talk will discuss the future of H5Coro and propose the formation of a subset of the HDF5 standard that is optimized for cloud computing. The value of H5Coro is not in the software itself, but in its demonstration that a narrowly focused implementation of the HDF5 standard can achieve an order of magnitude better performance than the existing HDF5 library.

Rather than investing in making the existing HDF5 library suitable for all possible use-cases, this talk argues that we should be taking steps to promote the development of independent implementations of the HDF5 standard that are optimized for different use cases.
10:20‑10:35 a.m. BREAK
10:35‑10:55 a.m. HSDS – New Features – John Readey, The HDF Group | Slides | Video
HSDS (Highly Scalable Data Server) is a REST-based service for HDF data. HSDS was designed with cloud deployments in mind, but can also be used in your on-prem datacenter, or laptop. This talk will discuss some of the interesting new features in the v0.7 release including support for AWS Lambda and “Direct Access” – two different ways to enable serverless functionality.
10:55‑11:15 a.m. HDF5 VOL Connector to Apache Arrow – Jie Ye | Slides | Video
Apache Arrow is a popular platform for columnar in-memory data representation and for efficient data processing and transfer that has been widely adopted in Big Data Analysis and Cloud Computing domain. HDF5, the most widely used parallel I/O library on HPC systems, can take advantage of Apache Arrow’s capabilities, especially the in-memory data access and columnar format. However, the performance of HDF5 could be sub-optimal in dealing with various data structures, such as column-oriented accesses (Array of structures and table-like data structures), ragged arrays, and in-memory data streaming between data producers and consumers (in-situ data processing). Apache Arrow is a good candidate in dealing with those data structures because it is an efficient in-memory, column data store. Furthermore, bridging the gap between science applications and analytic tools that use HDF5 and Apache Arrow data could bring new kinds of data together. In this presentation, we describe a VOL connector that allows applications to access Apache Arrow data through native HDF5 calls without modifying the applications. With the Arrow-VOL, we evaluate how Big Data technologies that offer new capabilities work in HPC Systems. We also present Arrow-VOL performance in dealing with data structures with column-oriented access patterns and ragged arrays.
11:15‑11:35 a.m. Dynamically loaded HDF5 VFDs – Jordan T. Henderson, The HDF Group | Slides | Video
A brief overview of a new HDF5 feature allowing users to create Virtual File Drivers as plugins and dynamically load them at runtime for use in an HDF5 application.
11:35‑11:55 a.m. Accelerating HDF5’s parallel I/O for Exascale using DAOS – Jerome Soumagne, The HDF Group | Slides | Video
The native HDF5 file format was originally designed around POSIX I/O and disk-based storage. With the emergence of new technologies such as non-volatile memory and SSDs, Intel’s DAOS distributed file system proposes new paradigms for storing and accessing data with low latency and high bandwidth.Currently in its phase of release candidate, the HDF5 VOL connector interfaces with Intel’s DAOS to define a new storage format that removes previous limitations of the native format. We will focus in this presentation on the new features that this connector is able to provide and how applications can take advantage of these new capabilities for designing efficient I/O pipelines.
11:55‑12:10 p.m. BREAK
12:10‑12:30 p.m. The Story of HDF5 Usage in High Energy Physics – Marc Paterno and Saba Sehrish, Fermi National Accelerator Laboratory | Slides | Video
In recent years, Fermilab has been investigating the use of HDF5 in large-scale analysis of experimental high energy physics (HEP) data. The combination of parallel writing, efficient management and access to columnar data, and compressed block storage matched the requirements we were looking at for movement from grid-based processing to running jobs at HPC facilities. We have now evaluated HDF5 in a wide range of HEP use-cases from raw detector data storage and retrieval to high-speed event selection during the later data analysis stages.

Much work has been done in an effort to bring HDF5 into HEP as a standard tool for data storage and access. Early on we partnered with CMS to investigate Spark as a parallel analysis tool for columnar data stored in HDF5. This was followed up by a laboratory research grant for furthering work in columnar storage of detector data and demonstrated use of parallel processing with MPI. Soon after we had our first adopters of HDF5 “ntuples” within the NOvA experiment as the PandAna framework. This work led to work under the HEP SciDAC project HEP on HPC, where efforts expanded to cover use of Pythia on HPC systems, and parallelization and new features of PandAna. DUNE is in the process of adopting PandAna. The HEP CCE project, utilizing the established expertise, has been evaluating integration of HDF5 with ROOT. Additional projects (e.g. Exa.TrkX) are using HDF5 in the context of machine learning.

In this talk we will present a historical view of all this work and current state and indicate what we see as useful future directions.
12:30‑12:50 p.m. The Story of HDF5 Usage in High Energy Physics – Marc Paterno and Saba Sehrish, Fermi National Accelerator Laboratory (continued)

12:50-1:00 p.m. BREAK
1:00‑1:30 p.m. Feature Requests and Community Discussion – Gerd Heber and Elena Pourmal, The HDF Group | Video

Wednesday, October 13, 2021

Time ‑ CDT
9:00‑9:15 a.m. Async VOL: Transparent Asynchronous I/O using Background Threads – Houjun Tang, Berkeley Lab | Slides | Video
This talk presents an asynchronous I/O framework that utilizes background threads for I/O task execution. It supports all types of HDF5 I/O operations including both collective and independent ones, requires no additional servers, manages data dependencies transparently and automatically from users. Our asynchronous I/O implementation as an HDF5 VOL connector demonstrates the effectiveness of hiding the I/O cost from the application with low overhead and easy-to-use programming interface.
9:20‑9:40 a.m. Cache VOL: Efficient Parallel I/O through Caching Data on Fast Storage Layer – Huihuo Zheng, Argonne National Laboratory | Slides | Video
Many pre-exascale systems have a fast storage layers, such as node-local SSDs, burst buffer, etc. We developed an external HDF5 VOL connector, Cache VOL, for caching data on the fast storage layer to improve the parallel I/O efficiency. The data transfer between the fast storage layer and the parallel file system is performed asynchronously through the Async VOL, which allows hidden the I/O cost behind the compute. All the complexity is hidden inside the library. Existing HDF5 applications can use the Cache VOL with minimal code modifications. Cache VOL is useful for applications with heavy check-pointing I/O or with intensive repeated read.
9:40‑10:00 a.m. HDF5 Parallel Compression Performance Factors – Scot Breitenfeld, The HDF Group | Slides | Video
HDF5 compression filters can help to minimize the amount of space consumed by an HDF5 file. HDF5 version 1.10.2 introduced parallel compression. Recently, the HDF5 parallel compression feature was investigated to understand how various HDF5 parameters affect the performance of parallel compression. This talk will present those findings.
10:00‑10:20 a.m. Data Optimization for the Cloud, continued – James Gallagher, OPeNDAP | Slides | Video
Web Object Stores have now been in use for a decade. It has become clear that using data stored in ways optimal for spinning disk can be hard when those data are now stored in the cloud using an object store. In this talk I’ll look at ways to avoid reformatting data while still being able to subset, without wholesale transfer, large data files originally organized for spinning disks. The talk will also touch on performance.
10:20‑10:35 a.m. BREAK
10:35‑10:55 a.m. HDF5 in Igor Pro – Howard Rodstein, WaveMetrics, Inc. | Slides | Video
Igor Pro is a commercial scientific graphing and data analysis program with a built-in programming environment. It runs on Macintosh and Windows.

Igor has supported import and export of HDF5 files since 2005. The current version, Igor Pro 9, adds the ability to save entire Igor Pro projects to HDF5 files and to restore them from HDF5 files.
This presentation will demonstrate the ways through which Igor Pro provides access to HDF5 files:
• Via Igor’s HDF5 Browser
• Via Igor’s programming environment
• Via Igor HDF5-base project files
10:55‑11:15 a.m. Mochi: an Approach to Composable Data Services – Jerome Soumagne, The HDF Group | Slides | Video
Distributed data services can enhance HPC productivity by providing storage, analysis, and visualization capabilities not otherwise present in conventional parallel file systems. Such services are difficult to develop, maintain, and deploy in a scientific workflow, however, due to the complexities of specialized HPC networks, RDMA data transfers, protocol encoding, and fault tolerance.The Mochi project is a collaboration between ANL, the HDF Group, LANL, and CMU. Mochi proposes a new approach to developing data management software through a collection of composable services that can be easily deployed and tailored to application needs. This presentation will give an overview, through a series of use cases, on how Mochi can facilitate and enable new data management workflows, by transforming an HPC infrastructure that is largely monolithic into a growing ecosystem of data services.
11:15‑11:35 a.m. HDF5 as foundation for big data of manifold types in scientific visualization – Dr. Werner Benger, Airborne HydroMapping GmbH | Slides | Video
Scientific visualization encounters a multitude of data types generated from observations or numerical simulations, ranging from simple images, point clouds to complex hierarchical multigrid structures of unbound complexity. Many, if not each, of these data types comes with their own specific file formats. Frequently, there is even a multitude of different file formats for the same data type (e.g. for images). The wheel is re-invented over and over, as everyone has their own notion of “roundness”. By the means of HDF5 this gordian knot can be untied – not by “another file format”, but via its capabilities to clearly distinguish between semantic, syntactical, and technical-internal properties of a dataset. By dealing with data sets on a higher level of abstraction – i.e. a purely semantic level – the burden of lower levels (such as byte order, compression schemes, coordinate systems…) become technical details while at the same time allowing data to be completely self-descriptive – rather than requiring an email in addition to a data file as requirement to interpret its contents. This presentation demonstrates the F5 scheme to layout scientific data to formulate their topological, geometrical and other properties as well as inter-relationships between them. This model allows to cover a wide range of data types under a common abstraction scheme, utilizing as HDF5 powerful and scalable basis.
11:35‑11:55 a.m. Exploring I/O Traces with DXT Explorer – Jean Luca Bez, Lawrence Berkeley National Laboratory | Slides | Video
I/O profiling tools, such as Darshan and Recorder, can collect detailed I/O traces from scientific applications. However, there is a lack of tools to analyze such logs and guide tuning. Existing approaches do not offer a straightforward way to explore and interactively visualize the I/O behavior reported in the Darshan DXT logs. Using a conventional static plot for such purposes is limited by the information it can present, due to space constraints and pixel resolution, possibly hiding I/O bottlenecks in plain sight. Furthermore, for HPC applications with numerous small I/O requests or those that run for hours, the collected trace can be huge, making them even more challenging to explore, visualize, and extract meaningful information to detect possible causes of performance issues. The DXT Explorer tool adds an interactive component to Darshan trace analysis that can aid researchers, developers, and end-users to visually inspect their applications’ I/O behavior, zoom-in areas of interest, and have a clear picture of where is the I/O problem.
11:55‑12:10 p.m. BREAK
12:10‑12:30 p.m. rhdf5: HDF5 in the Bioconductor ecosystem – Mike Smith, European Molecular Biology Laboratory | Slides | Video
Bioconductor is a community driven software project, rooted in the statistical programming language R and focused on the analysis of high-throughput biological data. One of the aims of the project is to provide robust and reusable software infrastructure to facilitate interoperability between analysis tools. The growth in biological datasets over time has necessitated the requirement for on-disk data representations that can be used seamlessly alongside existing in-memory solutions, a niche that HDF5 is helping us fill.

This talk will introduce rhdf5 and accompanying software packages as an R interface to working with HDF5 files. rhdf5 provides an interface to much of the HDF5 C-API, as well as simplified wrapper functions for many common operations, and can be a useful tool regardless of the type of data you are working with. I will also introduce some of the specific use cases of HDF5 for biological data analysis and highlight the role rhdf5 plays in the Bioconductor software stack.
12:30‑12:50 p.m. Sparse Data in Scientific Imaging Applications – Peter Ercius, Lawrence Berkeley National Laboratory | Slides | Video
A cornerstone technique in physical and biological sciences, transmission electron microscopy (TEM) is capable of imaging the atomic structure of materials and macromolecules. Data generation by these microscopes is accelerating at an exponential pace due to recent advances in imaging detector technology. Data sets have grown from the single megabyte range to over 100’s gigabytes in about the last five years putting huge pressure on researchers to investigate new storage and data compression methods. HDF5 is being adopted by the field due to its capabilities for storing heterogeneous data types, features for accessing large datasets and metadata, and its interoperability across a large, open ecosystem.

A new detector installed at Lawrence Berkeley National Laboratory is capable of generating imaging data at 480 Gbit/s leading to data sets as large as 650 GB generated in 15 seconds. This signal is known to be sparse, and thus we have implemented parallelized post-processing software on an edge-compute platform (also compatible with high performance computing) to compress the data by 10 – 100x using electron even localization. The sparse data is represented using a linear-index encoded electron event representation (EER) where each electron event is a single entry in a list. Each frame will have a different number of events, and we utilize the ability of HDF5 to store “ragged arrays” where one dataset axis can be variable in length. The resulting sparse data fits into the RAM of a typical commercial laptop, but its unique layout is incompatible with existing post-processing codes built upon traditional dense data (i.e. numpy ndarrays). This incompatibility delays and reduces the scientific output from this (and future) detectors due to the need to develop custom processing code and Al/ML algorithms. Finally, most GPUs are incapable of handling 10’s GB sized datasets and further optimization of sparse imaging data using other sparse data types in HDF5 could enable rapid advanced data analysis on GPUs.

This talk will discuss the current use cases for HDF5 in data intensive scientific imaging fields with a focus on TEM. I will describe our current data reduction and processing pipelines which could greatly benefit from a native sparse dataset implementation in HDF5. Such a new type of dataset should provide access through the standard HDF5 API to retrieve and slice data in either a dense or a sparse form. The ability to rapidly decompress sparse data to a more traditional dense format will provide interoperability with the large ecosystem of codes and programs already available for image processing. Direct access to the sparse data also provides the ability to speed up some algorithms which can be implemented in the sparse domain where operations occur on each event rather than every image pixel.
12:50-1:10 p.m. Efficient I/O and Data Management for Exascale Earthquake Simulation and Analysis – Houjun Tang, Berkeley Lab | Slides | Video
Moving toward exascale earthquake simulations, I/O and data management becomes increasingly challenging with (1) increased volume of input and output data significantly affects the overall simulation run time, (2) new requirements for I/O and data emerge as simulation code evolves, (3) easy-to-access data format enables efficient analysis and data sharing, and (4) new techniques such as compression is required to enable large scale data analysis. This talk presents efforts to address these challenges using HDF5 in the ECP EQSIM project.
1:00‑1:30 p.m. h5 Dreams, a Community Discussion | Video
Learn about what developments The HDF Group wants to do and what has been requested by others, followed by feedback and community discussion

Thursday, October 14, 2021

Time ‑ CDT
9:00‑9:15 a.m. Computational storage with HDF5-UDF – Lucas C. Villa Real, IBM Research | Slides | Video
In this talk, we present an infrastructure for the HDF5 file format that enables dataset values to be populated on the fly: scripts can be attached into HDF5 files and only execute when the dataset is read by an application. We provide details on the software architecture that supports user-defined functions (UDFs) and how it integrates with hardware accelerators and computational storage. Moreover, we describe the built-in security model that limits the system resources a UDF can access. Last, we present several use cases that show how UDFs can be used to extend scientific datasets in ways that go beyond the original scope of this work.
9:20‑9:40 a.m. HDF5 and SQL, Together At Last! – Charles Givre, CISSP, CEO & Founder: DataDistillr | Slides | Video
A considerable amount of scientific data is stored in HDF5 format. While there are libraries to view this data, such as HDFView, performing exploratory data analysis of HDF5 data is challenging, mainly because it requires writing code to do so.Apache Drill is a unique way to query and explore HDF5 data. It is unique in that it is the only open source federated query engine that natively supports HDF5 data. With Drill you can query complex HDF5 datasets, using standard ANSI SQL, as well as explore the metadata of these files. Additionally, you can join HDF5 datasets with any other data that Drill can query.Since Drill uses ANSI SQL for all operations, it is quite easy for new users to adopt and start getting value from data. What’s more is that there are python and R integrations for Drill to enable you to seamlessly execute a query in Drill and pipe the results in to a data frame for more sophisticated analysis.

In this presentation, Mr. Givre will demonstrate how to query HDF5 files and metadata, and how to use the various python libraries to pull data from Drill.
9:40‑10:00 a.m. Selection I/O in HDF5 Virtual File Drivers – Neil Fortner, The HDF Group | Slides | Video
Currently all I/O calls that HDF5 makes to the file driver layer consist of a single offset and length in the file, except for the MPIO driver which uses an undocumented, library internal mechanism to pass the MPI datatype describing I/O. The selection I/O feature currently under development will allow HDF5 to pass HDF5 dataspace selections or vectorized offset/length pairs to the file driver instead. This will allow file drivers other than MPIO to take advantage of any non-contiguous I/O acceleration in the underlying storage system.
10:00‑10:20 a.m. Improving NetCDF Compression – Edward Hartnett, CIRES, NOAA NCEP | Slides | Video
Since netCDF-4.0, netCDF has supported zlib compression of data. Since netcdf-c-4.7.4, netCDF has supported use of compression with parallel I/O. However, compression remains a challenging problem, due to the significant delays involved in compressing and decompressing data. Furthermore, lossless zlib compression has limited effect – by applying lossy compression, much smaller resulting datasets can be achieved. In this talk I will describe the quantization feature currently being added to the netCDF C and Fortran libraries, which will permit lossy compression. A functional prototype of this lossy compression was developed in the CCR project (https://github.com/ccr/ccr). Quantization is a feature that may also be considered for implementation in the HDF5 library.
10:20‑10:35 a.m. BREAK
10:35‑10:55 a.m. Extendable type-safe, thread-safe, asynchronous APIs for Neutron Science Data using modern C++ on top of HDF5 – William Godoy, Addi Malviya Thakur and Steven Hahn, Oak Ridge National Laboratory | Slides | Video
We present the lessons learned from the requirements of writing a domain-specific input/output (I/O) library for neutron science raw data stored using the standard NeXus hierarchical metadata-rich schema on top of HDF5. We introduce type-safe application programming interfaces (APIs) for accessing datasets and creating appropriate metadata “in-memory” indexes using C++17 template metaprogramming auto-deduction features, as well as thread-safe and asynchronous, available since C++11, to match the processing requirements of single instruction multiple data (SIMD) and concurrency for task based parallelism. Building a domain-extendable layer on top of HDF5 APIs also allows leveraging future roadmap features in the HDF5 library towards input output (I/O) and computation concurrency. At the same time, providing type-safe APIs enable consumers to catch errors early in their development as checks are moved from runtime to compile-time or just-in-time. While we focus on the neutron data produced and post-processed on a large many-core computational node at Oak Ridge National Laboratory (ORNL) facilities: SNS and HFIR, the present methodology is extendable to well-defined schemas that use the standard HDF5 hierarchical data model as a basis. Overall, leveraging HDF5 with modern C++ features to build portable and convenient domain-specific APIs, which can be extended to other languages (e.g. Python), to serve domain-specific communities in their data interaction tasks such as analysis, tests and in-memory index construction and searching.
10:55‑11:15 a.m. Predicting and optimizing the I/O performance of HDF5 applications – Donghe Kang | Slides Video
Many applications are increasingly becoming I/O-bound. To improve scalability, analytical models of parallel I/O performance are often consulted to determine possible I/O optimizations. However, I/O performance modeling has predominantly focused on applications that directly issue I/O requests to a parallel file system or a local storage device. A single request to an object in HDF5 applications can trigger a cascade of I/O operations to different storage blocks. The I/O performance of HDF5 applications is a complex function of the underlying data storage model, user-configurable parameters and object-level access patterns. As a result, domain scientists need an analytical cost model to predict the end-to-end execution of HDF5 applications to optimize the application performance.

An example to optimize the array storage using an analytic cost model is to consolidate and place small arrays on heterogeneous data stores. Two scientific pipelines, detecting supernovae and post-processing computational fluid dynamics simulations, face scalability bottlenecks when processing massive small arrays. One solution is to organize small arrays in one big file. However, storing everything in one file does not fully leverage the heterogeneous data storage capabilities of modern clusters. We implement a system, Henosis, in the HDF5 VOL layer to intercept data accesses and transparently redirects I/O to in-memory Redis object store and TileDB array store.

In this talk, I will present the two works to model the performance of HDF5 applications and tune the storage layout by array consolidation and placement. In addition to model the I/O time, it is crucial to model the cost of transforming data to a particular storage layout (memory copy cost), as well as model the benefit of accessing a software cache. Based on the model, we develop an integral array consolidation and placement algorithm and develop the Henosis system. Two real scientific data analysis pipelines shows that consolidation with Henosis makes the I/O 300\times faster than directly reading small arrays.
11:15‑11:35 a.m. MATLAB Meets HDF5 in the Cloud – Ellen Johnson, MathWorks | Slides | Demo (MP4) | Video
MATLAB has a long history of supporting HDF5 features through our rich high and low-level library interfaces. This talk presents our latest HDF5 capabilities including support for Single-Writer/Multiple-Reader (SWMR), Virtual Datasets (VDS), and working with HDF5 data hosted on the cloud – which is becoming increasingly important considering the acceleration of cloud computing. We will review MATLAB’s HDF5 cloud I/O capabilities including read/write support for S3 and Azure, and read support for Hadoop, and demonstrate advanced workflows combining SWMR and VDS on local and cloud platforms. We will wrap up with performance considerations and our tentative roadmap for future HDF5 enhancements.
11:35‑11:55 a.m. UI Design Challenges in HDF5 – Robert Seip | Video
Displaying generalized HDF5 structure presents some unique challenges in UI design.  This presentation focuses on techniques used to render the HDF5 data hierarchy and associated metadata in ways that are visually effective across the full breadth of HDF5 datastores, irrespective of the scope and complexity of the HDF5 structure.

Specifically, the talk will address the following: giving attributes special consideration; handling very long object names, expressing a fully-qualified HDF5 object with metadata in a one-line rendering; the techniques used to render groups with very large numbers of child objects; how to scroll individual HDF5 groups as well as an entire graph, and finally, a proper  left- and right-justification of an HDF5 graph.In a tightly-specified data environment, standardized UI components can often suffice and offer cost-effective solutions for UI applications.

However, exciting future directions of HDF5, whether it’s the use of HSDS cloud deployments to AWS or Azure, or the use of h5cpp as a better persistent local datastore for native application development, give an indication of the ever-growing importance of displaying HDF5 structure without UI limitation.
11:55‑12:10 p.m. BREAK
12:10‑12:30 p.m. Improving HDF5 write performance with log-based storage layout – Kaiyuan Hou, Northwestern University | Slides | Video
HDF5 is a popular file format and I/O library among scientific applications. Its parallel I/O feature has demonstrated scalable performance on modern parallel computers. HDF5 file format maintains the canonical order of multi-dimensional arrays in the file storage layouts, which provides a concise representation of data easily understandable by the users.

However, maintaining such canonical order for parallel I/O bears an expensive overhead of inter-process communication to redistribute and reorganize data before accessing the files. One solution to mitigating such cost is to store the data in a log layout and track the location of each data relative to its canonical order. While such idea has shown a significant improvement in recent literature, it breaks the HDF5 file format specification if applied to HDF5.

In the recent release of HDF5 version 1.12.0, the Virtual Object Layer (VOL) was introduced to allow users to customize I/O operations for accessing HDF5 data objects. With a VOL, the data stored on disk can be a different format than the HDF5 format, while its metadata can still be managed by HDF5 APIs. We present the design and implementation of a log-based VOL driver that stores HDF5 datasets in a log-based storage layout. We explored several techniques to mitigate the metadata overhead, including encoding, deduplication, and compression. Storing data in log layout increases the file size, as the additional metadata describing the relative locations of individual logged data to their canonical order layout. Such metadata size can become large when the applications make many fragmented I/O requests. Using the I/O kernel extracted from the E3SM simulation framework, we study the impacts of various performance tuning techniques to the overall I/O costs.
12:30‑12:50 p.m. Using flexFS for cloud-native, high-throughput, cost-effective HDF data volumes – Gary Planthaber | Slides | Video
We will demonstrate how to store and analyze large volumes of HDF data in the cloud without making any changes to tooling, data formats, or workflows. With flexFS, read-write-many volumes of HDF data can easily be shared across clusters of thousands of workers and high throughput rates can typically be achieved without converting legacy data to newer cloud-optimized formats. Furthermore, flexFS provides a fully compliant POSIX implementation that supports all existing file-based tools and makes development of new tools easier and more portable by abstracting away the details and complexities of cloud infrastructure and optimization.
12:50-1:10 p.m. Paradise Lost – Moving away from HDF5 – Gerd Heber, The HDF Group | Slides | Video
In his blog post, “Moving away from HDF5,” Cyrille Rossant “… describe(s) what is HDF5 and what are the issues that made us move away from it.” In a follow-up, “Should you use HDF5?,” Mr. Rossant offers “… some further thoughts, in no particular order.” All in all, he raised several important issues. With the benefit of hindsight and a distance of 5 years, perhaps this is a good time to revisit them, ask what, if anything, has changed, and what new challenges the HDF community is facing.
1:10‑1:30 p.m. Roadmap and Features in Upcoming Releases – Elena Pourmal, The HDF Group | Slides | Video