In this episode of Call the Doctor, Gerd Heber addresses a growing need in high-performance computing: verifiable data provenance. He demonstrates that HDF5 is capable of supporting tamper-evident blockchain patterns by interlinking datasets with SHA-256 hashes.
The second half of the session introduces a suite of low-level tools from the HDF5 SHINES project. h5explain and h5markers leverage machine-readable GNU poke “pickles” to map internal HDF5 structures directly onto binary data. This allows for interactive exploration and forensics—scanning for B-trees and object headers—even when the file is too corrupted for the standard HDF5 library to open. Finally, Gerd proposes a “Golden Copy” documentation workflow, where machine-readable specifications are paired with YAML sidecars to automatically generate perfectly synchronized, human-readable manuals.
Relevant Links
- HDF5 Pickles & GNU poke Specification: https://github.com/HDFGroup/hdf5-pickles
- The example used: https://github.com/HDFGroup/hdf-clinic/tree/main/2026-05-05
Topics Covered
- HDF5 Blockchain Design: Using 1D datasets, attributes, and SHA-256 hashes to create tamper-evident (though not tamper-proof) data structures.
- Machine-Readable Specifications: How the HDF5 SHINES project uses GNU poke “pickles” to formalize the HDF5 byte-level layout.
- h5explain & h5markers: Tools for low-level file forensics, allowing users to navigate objects (
ls,cd,info) without the HDF5 library. - Binary Forensics: Using magic markers (e.g.,
TREE,BTHD) to jumpstart investigations into corrupt or unopenable files. - Automated Documentation: Merging machine-readable pickles with YAML prose to generate synchronized Markdown specifications.
Chapter List
0:00 – Introduction
0:26 – Designing a Tamper-Evident Blockchain in HDF5
6:44 – Implementation: Datasets, Attributes, and Opaque Types
15:12 – Machine-Readable File Specifications (HDF5 Pickles)
23:39 – Tool Preview: h5explain Interactive Byte Explorer
26:59 – h5markers: Scanning Large Files for Metadata Signatures
37:09 – Decoding Object Headers and Attributes
44:35 – The “Golden Copy” Documentation Strategy
50:56 – Q&A: Cloud-Friendly HDF5 Detection
Summary
Gerd Heber begins by addressing a frequent technical challenge: ensuring data integrity in event logs. He demonstrates that HDF5 is uniquely suited for this via a tamper-evident blockchain implementation. By using one-dimensional arrays for payloads and interlinking them with SHA-256 hashes stored as opaque data types, users can detect accidental or intentional modifications to their data. Gerd clarifies that while this structure is tamper-evident, it is not “tamper-proof,” as any user with write access to the file could theoretically recalculate the hashes.
The focus then shifts to the HDF5 SHINES project’s latest milestone: a machine-readable specification of the HDF5 file format. Using GNU poke, the team has “pickled” the HDF5 format, allowing tools to understand the file’s byte-level layout without relying on the HDF5 library. This has led to the creation of h5explain, an interactive explorer that allows users to navigate the binary structure of a file using familiar commands like ls and cd.
Gerd also showcases h5markers, a high-performance tool that scans linear byte streams for “magic markers” like B-tree signatures. This is particularly useful for forensic investigations into corrupt files where the superblock may be missing or damaged. By identifying these markers, developers can establish a starting point for data recovery.
Finally, the session concludes with a look at the future of HDF5 documentation. Gerd proposes using the machine-readable pickles as the “Golden Copy.” By pairing these pickles with YAML sidecars containing human-readable prose, The HDF Group can automatically generate Markdown documentation. This ensures that the technical specification and the manual remain perfectly synchronized, reducing the risk of undocumented fields or errors in the formal specification.
Transcript
-
[00:00] Gerd Heber: Welcome. Thanks for coming. Before I start, are there any HDF-related questions? No? Okay. We have two main topics today—actually, a third occurred to me that I’ll share if we have time.
-
[00:26] Gerd Heber: I want to tell you about something that came up recently in a project involving event logs and provenance data. Someone asked: “If we want to detect potential changes or tampering, how would we do that in HDF5?”
-
[01:23] Gerd Heber: Here is my cartoon image of a blockchain. You can think of these blocks as event logs. There is no assumption that these blocks must be a fixed size; they can vary.
-
[03:08] Gerd Heber: You always have the last hash—the hash of all the blocks that came before. You embed the previous hash into the new block and then calculate the hash on top of that. This interlinks the blocks.
-
[04:32] Gerd Heber: This creates a tamper-evident data structure. It will be evident if this is modified because there will be a checksum mismatch. However, this is not “tamper-proof.” It is only “tamper-evident.” Anyone with write access could theoretically recalculate the whole chain.
-
[06:44] Gerd Heber: To implement this in HDF5, we use a group for an individual chain. I opted to store both the offset and the length of the payloads for better file optics. This is how I represent that blockchain in HDF5.
-
[15:21] Gerd Heber: In our HDF SHINES project, we are working on creating a machine-readable file format specification. One of those description languages is called poke.
-
[23:39] Gerd Heber: Now that we have these “pickles” (the poke specs), we can apply these mappings to explore the contents of HDF5 files at a very low level. That is what our tool, h5explain, does.
-
[26:59] Gerd Heber: We also have h5markers. If you have a 100GB file and you don’t know where to start—maybe the superblock is corrupted—you can use this to scan the linear byte stream. It looks for magic markers like ‘TREE’ or ‘BTHD’ to find metadata entry points without needing the HDF5 library.
-
[31:31] Gerd Heber: In h5explain, our interactive byte-level explorer, we can jump to the root group, the superblock, or
cdinto a group or dataset. -
[37:09] Gerd Heber: I am now in the “chain” group. It tells me there are eight header messages. If I run
info, it decodes those header messages and tells us exactly what they are—attributes like schema version, hash algorithm, and block count. -
[44:35] Gerd Heber: It would be logical to say: we created a machine-readable specification, now let’s turn that around. I want to generate the human-readable specification from the machine-readable version.
-
[46:38] Gerd Heber: The trick is the YAML sidecar. The pickle is code; it’s hard for a human to read. By joining that code with prose stored in a YAML file, we can generate a beautiful Markdown specification that is guaranteed to be technically accurate because it’s generated from the code. This is our “Golden Copy.”
-
[50:56] Gerd Heber: For cloud-optimized files, you specifically need Superblock Version 2 or 3 to support paged allocation. You can actually tell if a file is cloud-optimized just by looking at the superblock and checking the version number.