What happens when you can’t use the HDF5 library to read a file? In this session of Call the Doctor, we dive into HDF5 Pickles—a new machine-readable specification that allows for low-level binary analysis and repair using GNU poke.
GitHub Repo: https://github.com/HDFGroup/hdf5-pickles
Discuss this session: https://forum.hdfgroup.org/t/13729
Official Documentation: https://support.hdfgroup.org/documentation/hdf5/latest/index.html
HDF5 files are usually accessed through high-level APIs, but sometimes the clearest way to understand a file is to inspect its on-disk structures directly. In this HDF Clinic, we will walk through a hands-on tutorial using HDF5 pickles with GNU poke to explore HDF5 metadata at the byte level. Starting from a sample file, we will launch poke, load the pickles, map the superblock, locate the root object header, decode object header messages, verify checksums, and follow links to dataset metadata. Along the way, we will examine how dataspaces, datatypes, and layout information are represented in the file, and how GNU poke turns raw bytes into readable, queryable structures.
The session will also show how these pickles provide an executable, machine-readable view of HDF5 that is useful for inspection, validation, debugging, and learning the format itself. We will finish by demonstrating a small write-through edit on a disposable copy of an HDF5 file and discussing the care needed when changing metadata directly, including dependent fields such as checksums. This clinic is aimed at developers, advanced users, and anyone who wants a practical introduction to working with HDF5 files from the inside out using GNU poke.
Topics Covered:
- Binary introspection of HDF5 Super Blocks
- GNU poke for technical communication
- Checksum verification (lookup3 hash)
- Machine-readable file format specifications
- Security fuzzing and file repair strategies
Chapter List for this Video:
00:00:00 – Introduction & HDF5 Pickles Overview
00:01:41 – The HDF5 Pickles GitHub Repository
00:03:38 – Setting up GitHub Codespaces for GNU poke
00:07:23 – Starting the Tutorial: Binary Introspection
00:12:15 – Loading the HDF5 Super Block in Poke
00:15:40 – Accessing the Root Group Object Header
00:18:50 – Manually Validating Header Checksums
00:21:40 – Decoding Link Info and Group Messages
00:25:35 – Inspecting Data Set Messages (Data Space, Layout)
00:30:15 – Creating a New HDF5 File from Scratch in Memory
00:39:35 – Exploring the Pickle Source Code (.pk files)
00:43:50 – Use Case: Security Fuzzing & File Repair
00:46:25 – How AI (Claude) helped build these specifications
00:48:40 – NSF Support and the Future of HDF Shines
00:49:55 – Closing Q&A
Summary
Introduction to HDF5 Pickles
The HDF Group has introduced a repository called HDF5 Pickles. These are machine-readable specifications of the HDF5 file format written for the GNU poke framework. The goal is to provide a way to analyze, parse, and encode HDF5 structures at the bit and byte level.
Environment Setup
The demonstration uses GitHub Codespaces with an Arch Linux container to ensure access to the latest version of GNU poke (version 4.3). This setup allows developers to immediately begin introspecting HDF5 files without local installation.
Inspecting the HDF5 Super Block
By loading the common, super_block, and OHDR (Object Header) pickles, a user can map the start of an HDF5 file to the super_block structure [12:15].
- Example command:
super_block @ 0#B - This reveals the file signature, versioning, size of offsets, and the critical address of the root group’s object header.
Verifying Header Integrity
HDF5 uses checksums for its internal structures to ensure data integrity. In this session, we demonstrated calculating the lookup3 hash of an object header in memory and comparing it against the checksum stored in the file [18:50]. This allows for a verification of the file’s integrity independent of the HDF5 library’s internal logic.
Creating HDF5 Files from Scratch
Beyond inspection, the pickles allow for the construction of HDF5 primitives in memory. By defining a memory buffer (IO space), you can initialize a super block and root group header, apply the necessary checksums, and “save” the buffer as a valid .h5 file that can be opened in standard viewers like H5Web [30:15].
Transcript
[00:00:00] Gerd Heber: Okay, let’s get started. Welcome, everybody, to this next episode of our HDF Clinic. Before I begin, are there any questions about HDF5 in general or about using it in your applications?
[Pause]
[00:00:26] Gerd Heber: No? Okay, then feel free to interrupt me at any time should you have a question, even if it’s unrelated to what I’m about to show.
Today is going to be extra fun. This is something I’ve wanted to do for a long time, but life gets in the way. It’s even better when someone else does the work! In this case, all the credit goes to my colleague, Vailin, who did this work. I just have the privilege of presenting it to you. She is still working on it, but it is far enough along to be worth highlighting.
[00:01:41] Gerd Heber: This is a repository on GitHub under The HDF Group account called HDF5 Pickles. We might rename it in the future, but that’s where we started. I think of it as a machine-readable specification for HDF5 analysis, parsing, and encoding.
How do you tool that? In this case, there is a framework called GNU poke. The author of GNU poke named the modules “pickles,” which is why we call this repository HDF5 Pickles. We are using poke to describe on-disk binary structures corresponding to the HDF5 file format.
Today, I’m going to show you how to use GNU poke—without using the HDF5 library in any shape or form—to create a minimum, empty HDF5 file containing a super block and a root group.
[00:03:38] Gerd Heber: To make it easy, you can use GitHub Codespaces. I’ll create a container instance now. This isn’t process-intensive, so two cores are fine. I’m using Arch Linux because it has the latest version of Poke (4.3), whereas many other distros are a bit behind.
[00:07:23] Gerd Heber: Now we have a bash shell where we can invoke GNU poke. We’ll take an HDF5 file, look at it through the eyes of poke, modify it, and finally create a new file from scratch.
[00:08:56] Gerd Heber: Once poke is launched, we need to load the modules (pickles) Vailin developed.
-
load common: Common functions for converting byte arrays. -
load super_block: For parsing different versions of the super block. -
load OHDR: To look at object headers. -
load lookup3: For Jenkins lookup3 hashes used in checksums.
[00:12:15] Gerd Heber: First, let’s look at the super block. In poke speak, super_block is a predefined struct. We define a variable SB and point it to the byte offset 0.
If you are fluent in the file format specification, you know an HDF5 file starts with the signature \x89HDF\r\n\x1a\n. We can see the version, size of offsets, size of lengths, and most importantly, the address of the object header of the root group.
[00:18:50] Gerd Heber: One of the most useful things we can do is check the integrity. We can calculate the lookup3 hash on a byte range of the object header and compare it to the checksum stored in the file. As we see here, they match. This verifies the integrity of the structure independently of the HDF5 library.
[00:21:40] Gerd Heber: Now, we can look at “messages.” In HDF5, data objects like groups are interpreted through these messages. We use a method called get_messages. For this group, we see a link_info message, a group message (confirming this object is a group), and an actual link message. Since this file has a dataset, the link message tells us the name of the link and the destination address of that dataset.
[00:27:00] Gerd Heber: If we jump to the dataset’s address, we see different messages: dataspace (telling us the rank and dimensions), datatype (fixed-point integer), and fill_value. We even see a pipeline message because this dataset uses Gzip compression. We are deciphering all of this at the byte level.
[00:30:15] Gerd Heber: Now, let’s create a file from scratch. We start poke without a file argument. We’ll use a module that constructs HDF primitives in memory. We start with a memory buffer, define a super block version 2, and manually set the offsets and lengths.
We then construct a root group by creating a message prefix and adding a link_info message. Once the structures are prepped and checksummed in memory, we use the save command to write that memory buffer to a file. That file is now a valid HDF5 file that can be opened in H5Web.
[00:43:50] Gerd Heber: This is just the beginning. Vailin is currently working on B-trees and chunked dataset indexes. Once this is complete, it becomes a powerful tool for analysis and repair. If someone has a file they can’t open, you can go in and edit individual fields to attempt a restoration.
It’s also great for security research. You can use this to “fuzz” the format—creating specifically targeted “broken” files to see if the HDF5 library parsing layer recovers correctly.
[00:46:25] Robert Seip: Gerd, do you think in the future you’d have tooling that automatically generates these pickle type infos as the HDF5 codebase evolves?
[00:46:40] Gerd Heber: Believe it or not, Vailin used Claude (AI) to help create these. She gave Claude the file format specification and guided it to come up with the specifications. She’s an experienced library developer, so she was able to steer the AI and correct it when needed. So, AI is already part of this process.
[00:48:40] Gerd Heber: We are glad to do this under our HDF SHINES project with support from the National Science Foundation. If you want to make HDF5 safer and more secure, you need tools like this to automate auditing and analysis.
[00:49:55] Gerd Heber: Any final questions? No? All right, then. Try it out for yourself in the repository. Thanks for coming, and see you next time.