HDF5 User Group (HUG) Meeting Agenda 2023


Wednesday, August 16, 2023

Time (EDT) Session
8:00 a.m. – 8:45 a.m. Registration and Breakfast
8:45 a.m. – 9:00 a.m. Opening Address – Suren Byna, The Ohio State UniversitySlide Deck | Video
9:00 a.m. – 9:45 a.m. Keynote: Computing Challenges in Unlocking the Secrets of the Universe – Marc Paterno, Fermi National Accelerator LaboratorySlide Deck | Video

The aim of high-energy physicists is to discover the secrets of the physical universe. We do this by studying the universe at both the smallest (elementary particle) and largest (cosmological) scales. We try to answer questions like: What is the universe made of? What forces govern it? How did it become the way it is today? Scientists at the Fermi National Accelerator Laboratory (Fermilab) are engaged in investigating answers to all these questions.

The scientists that try to answer these questions use some of the most complex machines ever built (such as the Large Hadron Collider, a ring of more than 10 thousand magnets with a circumference of more than 16 miles) and some of the largest detectors (such as the NOvA detector, a 14 kiloton mostly plastic structure at which particles are shot from 500 miles away). They generate enormous data sets which present significant computing challenges (the Dark Energy Survey took more than 150 thousand images at 350 million pixels per image).

Dr. Paterno has been involved with such research at Fermilab since 1989, and in the computing challenges presented in such research since 1999. Over this time, computing work in high energy physics has moved from Fortran 77 and batch programs run on 1-MIP workstations to C++ and Python programs run on thousands of distributed nodes or on multi-petaflop supercomputers.

He will give a personal view of some of these ongoing research efforts, highlighting their goals and particular data processing challenges they face.
9:45 a.m. – 10:00 a.m. Break
10:00 a.m. – 10:20 a.m. HDF5 use in Long Term Energy Modeling Systems at the U.S. Energy Information Administration – Josh Whitlinger, U.S. Energy Information Administration (EIA)Slide Deck | Video

The U.S. Energy Information Administration (EIA) develops and runs two large, highly visible energy models for use in the Annual Energy Outlook and International Energy Outlook publications. Each of these modular systems require careful i/o management to support increasingly long run times with increasingly complex outputs. This talk will provide highlights of EIA’s experience and challenges in converting these models to operate off of HDF5.

The World Energy Projection System (WEPS), used to support the International Energy Outlook, has used HDF5 files for runtime data storage since 2016. In the simpler WEPS system, each module reads in shared data from the HDF5, processes the information, and then writes shared information back to the HDF5. EIA has developed expertise and tools around this system, and is working toward porting these lessons to the larger, more complex modeling system used for the domestically focused Annual Energy Outlook.

The National Energy Modeling System (NEMS), used to support the Annual Energy Outlook, presents a series of challenges beyond those encountered with WEPS. NEMS is currently built around an in-memory Fortran architecture. This can make debugging difficult, particularly because NEMS is written in multiple languages (Fortran, python, AIMMS, GAMS, etc). Further NEMS iterates rapidly (often <10 seconds) between data intensive modules and must complete all i/o and calculations in this span. To meet these speed goals, NEMS currently stores data in a binary file without any metadata.

To investigate the move to HDF5, EIA leveraged Python’s numpy/F2Py package to create a new interface program between the Fortran and python programs with the hdf5 data file as the interlocutor. While this workaround was a successful initial implementation, EIA is working through speed- and data-size related issues. Writing 3,728 variables to the original Fortran data file takes approximately 1 second, while writing a metadata encoded set of pandas tables to HDF takes approximately 5 minutes. Part of the problem regarding speed is the size and dimensionality of these variables (3D, 4D, even 5D). The second problem facing EIA and the HDF restart file is the 10x increased size of the restart file when stored in HDF.

EIA plans to continue this investigation into using HDF5 as a restart file for NEMS and looks forward to suggestions and comments from the user community on how to more efficiently and effectively do this.
10:20 a.m. – 10:40 a.m. A Workflow for Using CGNS in Parallel HPC Analyses – Gregory Sjaardema, Sandia National LaboratoriesSlide Deck | Video

Sandia Laboratories uses the CGNS Database to define the model input for most analyses that involve structured mesh geometries, ranging from small-scale scoping studies to large-scale multi-physics calculations on hundreds, thousands, or tens of thousands of processors and GPUs.

The IOSS library is used in many Sandia analysis codes to read and write databases in multiple formats, including CGNS. This talk will provide a summary of the IOSS library’s features that facilitate an efficient and scalable parallel I/O interface between the analysis code and the CGNS library. The speaker will describe the parallel decomposition method developed to partition the model across processing units and the parallel results output which can either merge the results into either a single output file or output a file-per-rank. The resulting output database remains consistent irrespective of the number of ranks used in the simulation.

Sandia Laboratories provides several tools to support analyst workflow needs, including tools to evaluate the parallel decomposition’s quality, merge file-per-rank outputs into a single file, convert structured meshes to unstructured meshes, and visualization.

The CGNS library employs HDF5 as its underlying on-disk file format, with development and support provided by the HDF5 Group.

In summary, this talk offers insights into Sandia Laboratories’ workflow for using CGNS in parallel HPC analyses, describing the IOSS library’s features, parallel decomposition method, and tools for analysts.

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
10:40 a.m. – 11:00 a.m. H5Intent: Autotuning HDF5 with user Intent – Dr. Hariharan Devarajan, Lawrence Livermore National LaboratorySlide Deck | Video

The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and multitiered storage systems. The HDF5 library is a popular interface to interact with storage systems in HPC workloads. The library manages the complexity of diverse I/O behaviors by providing user-level configurations to optimize the I/O for HPC workloads. The HDF5 library consists of hundreds of properties that provide various I/O optimizations and improve I/O performance for different use cases. However, these configurations are challenging to set by users who lack expertise in HDF5 library internals. We propose a paradigm change through our H5Intent software, where users specify the intent of I/O operations and the software can set various HDF5 properties automatically to optimize the I/O behavior. This work demonstrates several use cases that map user-defined intents to HDF5 configurations to optimize I/O. In this study, we make three observations. First, I/O intents can accurately define HDF5 configurations while managing conflicts and improving I/O performance by up to 22x. Second, I/O intents can be efficiently passed to HDF5 using the Intent Adaptor with a small footprint of 6.84 KB per node for 1000s intents per process. Third, H5Intent vol can dynamically map I/O intents to HDF5 configurations for various I/O behaviors exhibited by our microbenchmark and improve I/O performance by up to 8.8x. In conclusion, the H5Intent software optimizes complex large-scale HPC workloads such as VPIC and BD-CATS by up to 11x better I/O performance on the Lassen supercomputer.
11:00 a.m. – 11:20 a.m. Drishti and HDF5: What is actually happening in my application? – Jean Luca Bez, Lawrence Berkeley National LaboratorySlide Deck | Video

The current parallel I/O stack is complex and difficult to tune due to the interdependencies among multiple factors that impact data movement performance between storage and compute systems. When performance is slower than expected, end-users, developers, and system administrators rely on I/O profiling and tracing information to identify the root causes of inefficiencies. However, there is a gap between the currently available metrics, the potential bottlenecks, and the implementation of optimizations that would mitigate performance slowdowns. This talk discusses Drishti: a novel interactive, user-oriented visualization and analysis framework. Drishti seeks to aid users in pinpointing the root causes of I/O performance problems and providing actionable recommendations for improving performance based on the observed characteristics of an application. We cover its usage in applications that rely on HDF5 to store its data by providing some real study cases and discuss how the community can extend Drishti with additional analysis and visualizations.

11:20 a.m. – 11:40 a.m. HDF5 in the Julia Ecosystem – Mark Kittisopikul, Ph.D., Janelia Research Campus, Howard Hughes Medical InstituteSlide Deck | Video

HDF5.jl is a Julia package for accessing HDF5 files, using the HDF5 C library maintained by The HDF Group. It provides a simple, high-level interface making it easy to save and load data, as well as a more flexible interface allowing users to take advantage of many of HDF5’s features. Although the HDF5.jl package has been around since 2012, we have recently undertaken some significant changes to improve the modularity and make available newer features. An example of this is the implementation of HDF5 filter and compression plugins in Julia to connect HDF5 with the underlying C libraries directly.

HDF5.jl serves as the basis for JLD.jl which implements the Julia data format for storing Julia variables. It also serves as a reference implementation for JLD2.jl, a pure Julia library for writing and reading a subset of HDF5.

Via Julia’s BinaryBuilder.org ecosystem, the Julia community recently implemented HDF5_jll using cross compilation of HDF5 1.14.0 across 192 platform permutations of operating system, libc implementation, gfortran compiler version, processor architecture, and MPI implementation.

Looking towards the future, the Julia community looks forward to helping improve and simplify cross compilation of HDF5 and explore multithreaded I/O access to HDF5 files.
11:40 a.m. – 12:15 p.m. Lunch
12:15 p.m. – 1:00 p.m. Lunch Talk: From XFiles to SAF: the early pedigree of HDF5 – Mark C. Miller, Lawrence Livermore National LabsSlide Deck | Video

Some of the earliest conversations about the development of HDF5 took place as part of the Data Models and Formats (DMF) effort of the Accelerated Strategic Computing Initiative (ASCI). ASCI-DMF was a tri-lab (LLNL, SNL and LANL) effort to develop a common and flexible scientific data modeling API and file format for the emerging generation of scalable parallel, high performance computing applications starting in the late 1990s. The NCSA HDF team participated in ASCI-DMF efforts beginning with the initial development of HDF5. Early designs of HDF5 benefited from lessons learned and knowledge gained from not only its predecessor, HDF4, but also from several other tri-lab data management technologies such as XFiles, PDB, Aio, Silo, ExodusII, Nemesis, and SAF. We’ll take a walk through this early history and some of the influence it had on the early development of HDF5.
1:00 p.m.- 1:20 p.m. HyperChunking for the Cloud – John Readey, The HDF Group – Slide Deck | Video

Chunk sizes that are fine for local disk access can lead to poor performance when HDF5 files are accessed in the cloud. On the other hand, it can be impractical to rechunk datasets for large repositories in order to deploy to the cloud. To improve performance for this use case, HSDS has introduced the concept of “hyperchunking” – grouping HDF5 chunks into larger virtual chunks that optimize performance for access over the web. This talk will discuss how this works and show some real world examples.
1:20 p.m.- 1:40 p.m. Reintroducing the REST VOL – Matt Larson, The HDF GroupSlide Deck | Video

The REST VOL is once again in active development, providing another way to interact with HDF5 stored on cloud servers. Will include an introduction to basic VOL concepts, overview of how the REST VOL works, and compare and contrast it with other tools like the ROS3 VFD.
1:40 p.m. – 2:00 p.m. Cloud-Optimized HDF5 Files – Aleksandar Jelenak, The HDF GroupVideo

As more HDF5 files end up in cloud object stores their producers and data managers need to be aware about the file properties that enable efficient cloud-native data access. “Cloud optimized” means that internal HDF5 file structures support requiring minimal number of requests to read the desired data from a file in a cloud object store. The talk will present what are the features of cloud-optimized HDF5 files and several ways how that can be achieved.

2:00 p.m.- 2:20 p.m. HDF and DAOS using Google Cloud – Glenn Song, The HDF GroupSlide Deck | Video

As technology trends increasingly towards the cloud, there has been a mass migration of services and applications to cloud-based platforms. With Google Cloud now offering HPC solutions, HDF5 would also like to provide the ability for users to use and test DAOS in the cloud. With Google’s HPC toolkit being created for ease of use, a blueprint has been created to help users set up a simple environment with HDF5, DAOS, and the DAOS VOL connector.

2:20 p.m. – 2:40 p.m. Zarr as HDF5 Cloud Format? – Aleksandar Jelenak, The HDF GroupVideo

Zarr is a fairly recent format for multidimensional data arrays specifically targeting storage systems with key-value interface. Some scientific communities interested in implementing scalable cloud-native data analysis are considering Zarr as their chosen data format because of its straightforward implementation in cloud object stores. HDF Group had developed its own cloud-native HDF5 format, called HSDS schema, about the same time as Zarr. Only HDF Group’s developed software, HSDS, currently creates data in the HSDS schema. Since both Zarr and HSDS schema share the same design approach, it would be worthwhile to consider whether Zarr could serve as the cloud-native HDF5 format. The currently developed Zarr version 3 specification introduces the concept of extensions as a way to add more storage features. The goal of this session is to discuss pros and cons of using Zarr v3 to formulate a new cloud-native HDF5 format. Some technical information will be provided with aim to open up discussion among all attendees.
2:40 p.m.- 2:55 p.m. Break – Poster Sessions – our poster session authors will be available for discussion right outside the meeting room
2:55 p.m.- 3:15 p.m. AirMettle: A Real-Time Smart Data Lake for Accelerated In-Place Analytics of Scientific Data – Donpaul C. Stephens, AirMettle, Inc.Slide Deck | Video

In an era characterized by an exponential surge in data generation across scientific and industrial sectors, efficiently managing and processing extensive volumes of complex HDF5 and NetCDF4 data has become a pressing challenge. It is critical to understand that many analytics tasks demand processing only specific subsets of this data, often involving extractions, aggregations, or more sophisticated operations like re-scaling. Traditional retrieval of data for external processing is viable but not optimal due to speed and cost constraints, limiting researchers in the breadth of content they can effectively analyze.

AirMettle is spearheading an innovative approach to accelerate scientific data analysis. Our Real-Time Smart Data Lake (RT-SDL) seamlessly integrates massively parallel in-storage data processing within a robust, scalable, software-defined storage (SDS) framework. In this talk, we will elucidate how AirMettle empowers parallel in-place processing of NetCDF4 data, focusing on extractions and preliminary aggregations. We will also outline our ambitious endeavors to dynamically re-scale data and broaden our in-storage analytics capabilities to encompass a wider spectrum of HDF5 data.

Our work has garnered support from NSF and NOAA to date. We are actively collaborating on new initiatives with the HDF Group and the University of Alabama in Huntsville. We invite you to join us with your big data challenges; let’s explore how AirMettle can help make your data analysis lightning fast through intelligent storage solutions.
3:15 p.m.- 3:35 p.m. Toward Multi-Threaded Concurrency in HDF5 – John Mainzer, Lifeboat, LLCSlide Deck | Video

Machine learning frameworks such as TensorFlow and PyTorch may load the datasets using multiple threads to reduce the overall I/O overhead. Because of the global lock on the HDF5 library, such multi-threaded applications are effectively limited to single thread I/O. While converting the HDF5 library to support multi-threading is a daunting task, there is a feasible path to enhance the HDF5 library and enable multi-threaded VOL connectors that can mitigate the existing bottleneck.

Lifeboat, LLC received DOE Phase I and Phase II SBIR grants to enhance HDF5 to support multi-threaded VOL connectors and to create a VOL connector that uses multiple threads for concurrent I/O on HDF5 files. In our talk we will report on the progress we have made towards multi-threaded concurrency in HDF5.

3:35 p.m. – 3:55 p.m. PROV-IO+: A Provenance Framework for Scientific Data on HPC Systems – Runzhou Han, Iowa State UniversitySlide Deck | Video

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges in I/O intensive HPC workflows due to their incompatible provenance models and/or system implementations. In this work, we propose a HDF5-friendly provenance framework for scientific data on HPC systems by leveraging the HDF5 provenance Virtual Object Layer (VOL). We evaluated PROV-IO+ on two HPC systems with four workflows from different domains, including: (1) an HDF5-based acoustic sensing workflow; (2) a synthetic workflow adapted from H5Bench; (3) a Graph Neural Network (GNN) workflow; (4) a Large Language Model (LLM) workflow. Our experiments show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance.

3:55 p.m.- 5:00 p.m. State of HDF5, New Features, and Upcoming Roadmap – Dana Robinson, The HDF Group and Neil Fortner, The HDF Group Video

5:00 p.m. Shuttle to SpringHill Suites

Thursday, August 17, 2023

Time (EDT) Session
8:00 a.m. – 8:55 a.m. Breakfast
8:55 a.m. – 9:00 a.m. Opening Address
9:00 a.m. – 9:45 a.m. Keynote: AI for Biodiversity: AI and Humans Combatting Extinction Together – Dr. Tanya Berger-Wolf, The Ohio State UniversitySlide Deck | Video

We are losing the planet’s biodiversity at an unprecedented rate and in many cases, we do not even have the basic numbers. Photographs, taken by field scientists, tourists, automated cameras, and incidental photographers, are the most abundant source of data on wildlife today. AI can turn massive collections of images into high resolution information source about living organisms, enabling scientific inquiry, conservation, and policy decisions. This is our vision of trustworthy AI for biodiversity conservation and for the new scientific field of imageomics.

A deployed system, Wildbook, a project of tech for conservation non-profit Wild Me, is an example of how data-driven, AI-enabled decision process becomes trustworthy by opening a wide diversity of opportunities for participation, supporting community-building, addressing the inherent data and computational biases, and providing transparent measures of performance. The community becomes the decision-maker, and AI scales the community, as well as the puzzle of data and solutions, to the planetary scale. Wildbook has been recently chosen by UNSECO as one of the top AI 100 projects worldwide supporting the UN Sustainable Development Goals.

9:45 a.m. – 10:00 a.m. Break
10:00 a.m. – 10:20 a.m. Community sandbox for HDF5 compression testing – Mark C. Miller, Lawrence Livermore National LabSlide Deck | Video

As new compression libraries become available, they are often offered to the community in the form of shell command-line tools (e.g. gzip, xz, pkzip, 7zip, bzip, pigz) and may or may not come with quality documentation necessary to use the underlying library effectively to, for example, create an HDF5 compression plugin. Next, when a compression plugin for HDF5 is developed using one of these libraries, it isn’t necessarily easy or clear how to measure performance or compare performance with the original tools. In addition, when combined with advanced features of HDF5 such as dataset chunking, chunk caching, partial I/O, fill values, etc, it can be difficult to under understand the performance within the context of HDF5, identify and correct performance issues when they are discovered or even document for users how to avoid performance pitfalls. The HDF5 community needs a compression sandbox which includes raw data sets (e.g. not HDF5 stored data but raw, binary data files) where plugin compression performance information is regularly maintained and compression libraries used in HDF5 plugins can be compared with performance of their baseline, shell command line counterparts. This talk will propose a new GitHub organization, web site, and associated repositories and compression testing workflows to build such a community-focused HDF5 compression testing sandbox and invite participants to begin contributing to it.
10:20 a.m. – 10:40 a.m. Revolutionizing I/O Performance: Lossy Compression Meets HDF5 – Dingwen Tao, Indiana UniversitySlide Deck | Video

In this presentation, I will delve into our recent projects funded by NSF, focusing on the development of lossy compression cyberinfrastructure, encompassing software and user community expansion. We capitalize on HDF5 capabilities, such as compression filters and the Virtual Object Layer (VOL) connectors, to elevate usability. Additionally, I will offer an overview of several studies in which we utilized our lossy compression in tandem with HDF5 to significantly enhance the I/O performance of HPC applications.
10:40 a.m. – 11:00 a.m. SEEKCommons – Gerd Heber, The HDF GroupSlide Deck | Video

What can Open Science (OS) contribute to the present and future of socio-environmental research and knowledge dissemination? Why is this question worth pursuing through a distributed network of STS researchers, OS practitioners, and socio-environmental researchers working with climate-impacted communities?

In order to address these questions, the “Socio-Environmental Knowledge Commons” (SEEKCommons) project will create and consolidate a network dedicated to building pathways for horizontal collaborations. Bio- and geo-physical studies of environmental dynamics have traditionally been siloed from social research. To create conditions for meaningful interdisciplinarity around social and environmental action, socio-environmental research projects will provide concrete data problems and datasets to be curated, documented, and widely shared with OS tools, while providing novel contexts to apply OS principles. Expert OS practitioners will contribute tools, methodologies, and ethical guidance on FAIR principles that, when translated and adapted to socio-environmental action research, can be well-understood and effectively used by community partners.
11:00 a.m. – 11:20 a.m. Lightning Sessions (5 minutes each)

Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5 – Sian Jin, Indiana UniversitySlide Deck | Video

Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve the parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space in the process to handle possible data overflows resulting from prediction uncertainty in compression ratios. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency.

Efficiently utilizing HDF5 compression filter on Adaptive Mesh Refinement simulation – Daoce Wang, Indiana UniversitySlide Deck | Video

As supercomputers progress towards exascale capabilities, the computational intensity and data volume requiring storage and transmission are experiencing remarkable growth. The emergence of Adaptive Mesh Refinement (AMR) presents a viable solution to these twin challenges. Similarly, error-bounded lossy compression has proven to be one of the most effective strategies to manage the data volume issue. However, there has been limited exploration into how AMR and error-bounded lossy compression can be synergistically applied. To enhance the Input/Output (I/O) performance and usability further, we propose employing the HDF5 compression filter. But there exist obstacles in integrating the HDF5 filter with the AMR application. To this end, in this brief talk, we will illustrate how to proficiently deploy the HDF5 compression filter on AMR data.

Advanced Concepts and Issues with H5Z-ZFP – Mark C. Miller, Lawrence Livermore National LabSlide Deck | Video

The H5Z-ZFP filter has some advanced use cases that are worth highlighting. These include dealing with endian portability even though the ZFP library already produces an endian-portable stream, using HDF5 chunking to accommodate higher than 4D datasets even though the ZFP library currently supports a maximum of only 4 dimensions, reading and writing data that is already compressed in memory as part of the ZFP library’s compressed arrays, and understanding the interplay between ZFP chunklets and HDF5 chunking including HDF5 dataset fill values and partial I/O.

2023 European HDF User Group Meeting Announcement – Gerd Heber, The HDF GroupSlide Deck | Video

Can’t get enough HUGs? Join us at the upcoming 2023 European HDF User Group meeting, happening 19-21 September at DESY, Hamburg, Germany! It will be all about the future of data stored using HDF technologies, and taking a deep dive into long-term data accessibility, plugin availability, and usability. Be there or be square!

11:20 a.m. – 11:40 a.m. Supporting Sparse Data in HDF5 – Elena Pourmal, Lifeboat, LLCSlide Deck | Video

Physics, Neutron and X-ray Scattering and Mass Spectrometry. In many use cases, only 0.1% to 10% of gathered data is of interest. HDF5, due to its proven track record and flexibility, remains the data format of choice. As the amount of data produced continues to grow due to higher instrument and detector resolution and sampling rates, there is a clear demand for efficient management of sparse data in HDF5. Adding support for sparse data will simplify data processing software and widen adoption of HDF5.

In our talk we will present proposed extensions to the HDF5 File format and public APIs to support sparse data in HDF5. The proposed sparse storage is agnostic to memory structures used to represent sparse data in RAM (e.g., sparse matrix), and provides storage savings and portability between different memory formats.
11:40 a.m. – 12:15 p.m. Lunch
12:15 p.m. – 1:00 p.m. Lunch Talk: A brief history of HDF5 – Mike FolkSlide Deck | Video

HDF5 has endured and grown for a quarter century thanks to countless contributions from people, organizations, and applications. We will share our perspective on those 25 years. We will visit some drivers that led to creating HDF5, highlight some early adopters and applications that helped HDF5 to gain acceptance, look at some interesting and fun uses of HDF5, call out the people and organizations that have kept HDF5 alive and healthy for a generation through their guidance and support, and describe the evolution of HDF5 over time.
1:00 p.m.- 1:20 p.m. Object serialization with HDF5 – Mark C. Miller, Lawrence Livermore National LabsSlide Deck | Video

In the early days of scientific computing (think Fortran era), most application data took the form of large arrays. These are easily stored to any of many different file formats including HDF5. As scientific computing applications have evolved and grown in sophistication and design, large arrays still play a key role but so do highly complex, pointer-linked, multiply nested, data structures involving abstract data types. Any developer confronted with persisting these complex data structures to disk, especially in a way which is machine portable, ultimately seeks a body of code that is simple and easily adaptable as that data structure changes over the life of the application in which it is used. When that goal is pitted against the relatively deep documentation dive developers must take to fully understand what HDF5 is capable of and how best to apply it in any given circumstance, this often leads to the very naïve use of recursive traversal of data structures where tiny HDF5 datasets, groups and attributes are emitted. What makes perfect sense from an HDF5 API perspective turns into a horrendous disaster from an I/O performance perspective. This talk will describe the basic use case, the problems with the naïve approach as well as a much improved approach which though more complex to code is much more I/O performant.
1:20 p.m.- 1:40 p.m. The Open Standard for Particle-Mesh Data – Axel Huebl, Lawrence Berkeley National LaboratorySlide Deck | Video

The Open Standard for Particle-Mesh Data (openPMD) is a metadata standard for tabular (particle/dataframe) and structured mesh data in science and engineering. We show the basic components of openPMD, its extensions to specific domains and applications from laser-plasma physics, particle accelerators, light sources, astrophysics to imaging.

openPMD is implemented on top of portable, hierarchical data formats, especially HDF5 and ADIOS/ADIOS2. An extensive community ecosystem enabled productive workflows for developers and users alike, spanning Exascale simulations, in-transit data processing, post-processing, 3D visualization, GPU-accelerated data analytics and AI/ML. We will present the organization of this community, benefits and experience from supporting multiple data format backends, and future directions.

[1] Axel Huebl, Remi Lehe, Jean-Luc Vay, David P. Grote, Ivo F. Sbalzarini, Stephan Kuschel, David Sagan, Christopher Mayes, Frederic Perez, Fabian Koller, and Michael Bussmann. “openPMD: A meta data standard for particle and mesh based data,” DOI:10.5281/zenodo.591699 (2015)
[2] Homepage: https://www.openPMD.org
[3] GitHub Organization: https://github.com/openPMD
[4] Projects using openPMD: https://github.com/openPMD/openPMD-projects
[4] Reference API implementation: Axel Huebl, Franz Poeschel, Fabian Koller, and Junmin Gu. “openPMD-api 0.14.3: C++ & Python API for Scientific I/O with openPMD,” DOI:10.14278/rodare.1234 (2021)
[5] Selected earlier presentations on openPMD:
[6] Axel Huebl, Rene Widera, Felix Schmitt, Alexander Matthes, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, and Michael Bussmann. “On the Scalability of Data Reduction Techniques in Current and Upcoming HPC Systems from an Application Perspective,” ISC High Performance 2017: High Performance Computing, pp. 15-29, 2017. arXiv:1706.00522, DOI:10.1007/978-3-319-67630-2_2
[7] Franz Poeschel, Juncheng E, William F. Godoy, Norbert Podhorszki, Scott Klasky, Greg Eisenhauer, Philip E. Davis, Lipeng Wan, Ana Gainaru, Junmin Gu, Fabian Koller, Rene Widera, Michael Bussmann, and Axel Huebl. Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2, Part of Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation, SMC 2021, Communications in Computer and Information Science (CCIS), vol 1512, 2022. arXiv:2107.06108, DOI:10.1007/978-3-030-96498-6_6

1:40 p.m. – 2:00 p.m.

Connecting HDF5 to the Proactive Data Containers – Houjun Tang, Berkeley LabSlide Deck | Video

The Proactive Data Containers (PDC) software provides an object-centric API and a runtime system with a set of data object management services. These services allow placing data in the memory and storage hierarchy, performing data movement asynchronously, and providing scalable metadata operations to find data objects. In this talk, we will talk about our PDC VOL connector implementation and the performance benefits it brings to the HDF5 applications.

2:00 p.m.- 2:20 p.m.

Metadata Management to Support Scientific Inquiry – Jay Lofstead, Sandia National LaboratoriesSlide Deck | Video

IO libraries, like HDF5, today offer the ability to attach attributes to various components within the file. Existing attribute management approaches can make using these attributes efficiently difficult prompting users to develop supplementary solutions to augment these built in facilities. These tools have developed a life of their own, but should be more tightly linked with these external metadata management systems, particularly for the long term.

This metadata is used to identify what files, or other storage containers, may contain data of interest to accelerate scientific inquiry. By eliminating searching data that definitely does not contain the desired feature and potentially limiting data reading to just areas that may, searching speed can be greatly increased by eliminating unnecessary data movement. These systems have been under development for many years, including some commercially. Some have focused on general capabilities while others have focused on particular domains being customized to better support particular data features.

This talk will explore the features of various IO library attribute systems, the major generations of external metadata management tools, and explore some of the current efforts forging new ground. It also links this all together to look at efforts on long-term data identification.

2:20 p.m. – 2:40 p.m.

Towards Self-contained Metadata Search Capability for Self-describing File Formats – Wei Zhang, Lawrence Berkeley National LaboratorySlide Deck | Video

In this talk, we journey back to 2019, when we introduced an innovative Metadata Indexing and Querying Service (MIQS), which has since significantly altered our approach to handling metadata search within self-describing data formats like HDF5 and netCDF. We will delve into how MIQS managed to successfully eliminate the need for external Database Management Systems, thereby significantly reducing the time spent on index construction and memory footprint, while also providing up to 172kx improvement in search performance.

We will conduct a detailed review of MIQS, exploring its benefits and challenges in the current context. In doing so, we hope to stimulate fresh discussions and new perspectives about metadata search in self-describing data formats, not just in the lens of our past achievements, but more importantly, through the prism of today’s demands and tomorrow’s possibilities. This conversation will underscore the continued significance of MIQS and inspire the audience to envision its future potentials.

2:40 p.m.- 2:55 p.m.Break2:55 p.m.- 3:15 p.m.

LowFive: In Situ Data Transport for High-Performance Workflows – Tom Peterka, Argonne National LaboratorySlide Deck | Video

We describe LowFive, a new data transport layer based on the HDF5 data model, for in situ workflows. LowFive is implemented as an HDF5 VOL plugin. Executables using LowFive can communicate in situ (using in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes. We demonstrate the above features in a series of experiments featuring both synthetic benchmarks as well as a representative use case from a scientific workflow, and we also compare with other data transport solutions in the literature.

3:15 p.m.- 3:25 p.m.

Lightning Sessions (5 minutes each)

Data Reduction for Flash-X Simulations – Rajeev Jain, Argonne National LabSlide Deck | Video

Scientific simulations generate vast amounts of data, posing challenges in terms of storage, processing, and analysis. Compression technologies play a crucial role in addressing these challenges by reducing the size of simulation output while preserving essential scientific information. This talk provides an overview of compression technologies employed in the context of FLASH-X, a widely-used simulation code. FLASH-X, a scalable, parallel, and modular code developed for multi-physics simulations, produces extensive datasets that can overwhelm storage systems and hinder efficient analysis. To mitigate this issue, FLASH-X incorporates various compression techniques to reduce the data footprint while maintaining scientific fidelity. The talk will highlight some recent sz3 (lossy), zfp compression (lossy and lossless) results.

Evolving role of HDF5 at the upgraded Advanced Photon Source – Tejas Guruswamy, Argonne National LaboratorySlide Deck | Video

The Advanced Photon Source X-ray synchrotron at Argonne National Laboratory has now paused operation for its long-planned upgrade to a fourth-generation storage ring (APS-U). As part of the project beamlines are also receiving significant upgrades to their detector and instrument capabilities, supporting both existing and new scientific techniques, and are projected to generate very significantly increased data at a much higher rate. The HDF5 file format plays a key role in our computing and data strategy to handle this. I will share some of the workflows currently in use and planned at the APS for X-ray data using HDF and related tools, including heavy use of EPICS, areaDetector, and BlueSky as data sources; APS-DM (Data Management) and Globus for data management; and both local resources and Argonne Leadership Computing Facilities for analysis and long-term archiving. Finally I will highlight some of the challenges we have encountered along the way, and our priorities for future development.

3:25 p.m. – 5:00 p.m.

Community Discussion – Dana Robinson, The HDF Group and Neil Fortner, The HDF GroupSlide Deck | Video

5:00 p.m.Shuttle to SpringHill Suites


6:00 p.m.Included group dinner at Bravo Italian Kitchen Lennox, 1803 Olentangy River Road, Columbus OH 43212.

Friday, August 18, 2023

Time (EDT) Session
8:00 a.m. – 9:00 a.m. Breakfast
9:00 a.m. – 10:20 a.m. Intro to HDF5 – Glenn Song, The HDF Group, and Gerd Heber, The HDF GroupSlide Deck
10:20 a.m. -10:40 a.m. Break
10:40 a.m. – 12:00 p.m. Advanced HDF5 – Aleksandar Jelenak, The HDF Group
12:00 p.m. – 1:00 p.m. Lunch
1:00 p.m. – 2:20 p.m. HDF5 on Cloud HPC Tutorial – Quincey Koziol, Amazon, and Scot Breitenfeld, The HDF Group
2:20 p.m. – 2:40 p.m. Break
2:40 p.m. – 4:00 p.m. Highly Scalable Data Service (HSDS) Tutorial – John Readey, The HDF Group
4:00 p.m. Shuttle to Springhill Suites