In this “Call the Doctor” session, Aleksandar Jelenak (Senior Informatics Architect) and a user engage in a live troubleshooting discussion. They dive into new h5diff features for cleaner regression testing and share critical benchmarks on why standard S3 mount points might be costing you more than you think.
Relevant Links
- GitHub Issue #6364: Exclude attributes in h5diff
- GitHub PR #6372: New “Exclude Attribute” Option
- AWS Tech Blog: S3 File System Launch
- Industry Analysis: Why S3 is not a filesystem
- Hands-on Learning: HDF5 Tutorial Repository
Topics Covered
- Live PR Review: Walking through the “exclude attribute” feature in
h5diff. - Regression Testing: Managing volatile metadata (like timestamps) in automated tests.
- S3 Performance Realities: Benchmarking Amazon’s mount points vs. native HDF5 drivers.
- Cloud Cost Optimization: Understanding the 32KB billing minimum.
- Training Tools: Using GitHub Codespaces for zero-install HDF5 learning.
Chapter List
0:00 Introduction
0:23 h5diff: Excluding attributes for regression testing
6:10 Python vs. CLI tools for cloud access
7:43 NASA Goddard: Managing orbital granules
10:45 Benchmarking S3 mount points
12:44 The 32KB pricing trap in S3
16:46 Strategies for Cloud-Optimized HDF5 (COHDF)
20:42 Consolidating metadata with h5repack
23:37 GitHub Codespaces & Interactive Tutorials
Summary
The session began with a look at PR #6372, which introduces the ability to exclude specific attributes by name in h5diff. This is particularly useful for regression testing where files may contain volatile metadata like “current time,” which causes unnecessary diff failures.
The conversation then shifted to the challenges of migrating large-scale scientific data, such as NASA’s orbital granules, to cloud storage. Aleksandar highlighted that while Amazon S3 mount points exist, they often perform poorly with HDF5’s random access patterns. He noted a critical “TIL” regarding S3 pricing: users are often charged for a minimum of 32KB per request, making small, frequent reads very expensive and inefficient.
Finally, the duo discussed Cloud-Optimized HDF5. By using larger page sizes and front-loading metadata (often via h5repack), developers can significantly improve performance in object storage. Aleksandar also pointed to the updated HDF5 tutorials on GitHub, which now leverage GitHub Codespaces to provide a zero-install environment for learning these advanced concepts.
Transcript
[0:00:23] Aleksandar Jelenak: Nothing really for me [prepared]… but I had a couple of posts onto the request… dealing with excluding attributes in h5diff. I put that into GitHub to be able to exclude specific attributes.
[0:00:54] Aleksandar Jelenak: I think there is a PR… #6372. “Add exclude attribute by name option to h5diff.”
[0:02:12] User: Although I don’t know if that’s… if you had the same name in your file twice but you only wanted to exclude one of them, you’re not providing a full path to the object.
[0:03:05] Aleksandar Jelenak: You can comment on this pull request to ask that specific issue… and maybe the larger context is that I need to use it for our regression testing.
[0:03:35] User: In some of our files, we insert things like the current time, and current time of course is going to change every time you run your test. So you don’t actually want to test against that in your regression testing process. You only want to test against the data which you expect to be constant.
[0:04:46] User: Sometimes you’re willing to accept differences in the low order bits… which isn’t necessarily true if you’re doing regression testing at the HDF5 Group where you expect bits to be exactly the same every time.
[0:06:58] User: We’re being encouraged and eventually forced to move a lot of our data files to cloud storage. One of the things we’re talking about now is how to put that into our build process so we grab those files for things like regression testing.
[0:09:11] Aleksandar Jelenak: If that storage is S3 compatible, then the ROS3 driver of the library should work. But if not S3 compatible, then the library itself has nothing right now… in h5py it’s easier to deal with these backend storage options.
[0:10:45] Aleksandar Jelenak: I had a project where I explored “mount S3” (S3 mount point) on Amazon. It wasn’t useful at all. I guess it’s useful for file formats where you go slowly and sequentially, but not for HDF5. When I did some benchmarking, it was really, really bad.
[0:12:44] Aleksandar Jelenak: There are some pricing consequences. Minimum you pay 32KB no matter how much you pull—one byte or 32, they charge you 32KB. If it’s an HDF5 file and not cloud-optimized, you could easily read 118 or 236 bytes and you pay for 32KB.
[0:16:46] User: Is there any way to take existing files and try to understand where the metadata resides to understand if they’ve… I assume cloud optimization essentially means all engineering metadata is front-loaded on the file?
[0:17:33] Aleksandar Jelenak: That’s one idea. If you cloud optimize HDF5 and you’re mostly going to access it with the library… you increase the chunk size in bytes. The library is going to read in these file pages that are now some number of megabytes. Anything in a file page comes together as a file page to the library.
[0:20:42] User: Does h5repack consolidate the metadata?
[0:20:47] Aleksandar Jelenak: Yes. That’s one of the things we’re lucky that h5repack could do immediately. Better to produce it immediately… you change only three things in your code and off it goes [as a] cloud-optimized file.
[0:23:37] Aleksandar Jelenak: I’m working on revising the HDF5 tutorial on GitHub… and I decided to do something with the notebook that talks about access to files in S3.
[0:24:43] Aleksandar Jelenak: What was nice about it—you’ve heard about GitHub Codespaces? It’s essentially a virtual machine instance that you can fire up at GitHub and immediately someone can run these notebooks. You don’t have to download it or set up your computer.