Standardizing Tabular Data: Introducing the HDF5 Extension Proposal HEP001 - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

In this episode of “Call the Doctor,” Aleksandar Jelenak explores the HDF5 Extension Proposal HEP001. This proposal aims to modernize how we handle tabular data in HDF5 by introducing a columnar storage specification, moving beyond legacy row-based approaches to enable better interoperability, performance, and metadata flexibility.

Relevant Links

Topics Covered

The evolution of tabular data storage in HDF5 (from H5TB to PyTables).
The motivation for a “storage specification first” approach.
Data model details: Columnar storage, attribute usage, and class conventions.
Handling missing values and search indexing.
Alignment with community standards like anndata.

Chapter List

0:00 – Introduction & Project Updates
3:45 – What is HEP001?
6:22 – Columnar vs. Row-based storage
11:48 – Data Model: Columns as Datasets
15:23 – Optional Search Indexing
18:39 – Handling Missing Values
20:21 – Key Attribute Conventions
27:23 – Interoperability with community tools
30:52 – Technical Documentation with MyST
34:30 – Q&A: Compound Types & Atomicity
38:11 – Q&A: Multi-read/write API & Future Development
42:30 – Conclusion & Community Call to Action

Cleaned Transcript

0:00 – Introduction Aleksandar: Any questions? Community Member: Oh, yeah, sure. I don’t think I have any at the moment. I haven’t really had a chance to investigate the material you sent me on analyzing header file sizes; I’ve been sidetracked by other aspects of the project. Aleksandar: No worries. Whenever you’re ready. As long as you got the email, that’s the important thing. Community Member: I did. We’re in the process of moving our configuration management into GitLab. It’s been a long process. I worked for a foundation years ago that used GitLab, so I know it’s fine, but our internal workflow details are complex, and getting those to mesh with GitLab has been the primary challenge. Aleksandar: I understand.

3:45 – What is HEP001? Aleksandar: Now that we have more people joining, I’ll start the presentation. Today, I’m feeling organized—I’ve actually prepared slides! I’ll start with the proposal, then we can talk. So, what is this HEP thing? It stands for “HDF5 Extension Proposal.” It’s an idea that’s been cooking at The HDF Group for a while: creating a facility where the community can propose how to store or handle things using the HDF5 library and data model.

HEP001 is the first one. I left 000 open for general rules on the process, so I’m starting at 001. During a recent project, I had a need to store tabular data, and I realized there was a better way to do it in HDF5 that hasn’t been explored as much as others. This proposal lays out how to store tabular data where each column is a separate HDF5 dataset. This columnar storage technique has been popularized over the last 10+ years by big data tools like Hadoop and Spark.

7:21 – History of Table Storage in HDF5 Aleksandar: The proposal is still a pre-draft, open to comments and criticism. Historically, HDF5 used the “HDF5 Table Specification” (H5TB), which uses a 1D dataset with a compound data type where each column is a field. Later, “PyTables” became very popular. It was software-first and storage-second, providing a query engine and a whole package. About two years ago, I stumbled upon a community project called anndata, which used columnar storage. We weren’t aware of it previously, but HEP001 aims to be reasonably aligned with it.

11:48 – Data Model: Columns as Datasets Aleksandar: The core of HEP001 is that each column gets its own 1D HDF5 dataset. There are three types of columns: standard, row labels, and categorical. One big advantage is that HDF5 attributes can now be attached to individual column datasets, making them much more meaningful. These sit under an HDF5 group marked with a class attribute set to column_table.

15:23 – Search Indexing & Missing Values Aleksandar: You can optionally include a search_index group containing datasets to speed up queries. Regarding missing values: the current spec does not allow NaN. This is debatable, and I’m open to feedback. The spec defines a “fill value” for each dataset, which serves as the indicator for missing data.

20:21 – Attribute Conventions Aleksandar: We are using conventions that mirror existing HDF5 practices (like title, units, and description). I’ve used uppercase for foundational/required attributes and lowercase for column-specific ones. We’ve also included attributes for valid_min and valid_max to help define data ranges.

27:23 – Interoperability & Future Roadmap Aleksandar: I want this to be anndata-friendly. I’m thinking about tightening up how we handle row labels—perhaps using HDF5 object references for our internal spec while using strings for anndata compatibility. This will remain a working draft for about two months before we solidify it for our ongoing projects.

30:52 – Technical Documentation with MyST Aleksandar: Our proposal lives in a GitHub repository using MyST (Markedly Structured Text). It’s markdown, but with serious technical capabilities like LaTeX equations and cross-references. I can’t stand PDFs for technical documentation anymore. This allows us to write first-class, web-based articles.

34:30 – Q&A: Compound Types & Atomicity Community Member: Does the column dataset always have to be rank 1? Aleksandar: Yes. You could use a compound type in a column, but that would sort of defeat the purpose of columnar storage. Community Member: The separation of concerns here is really appealing. It exploits what HDF5 can do while separating table semantics. Aleksandar: Exactly. The challenge with columnar storage is atomicity. With row-based storage, you write a row and it’s done. Here, it’s a multi-step process. But in my experience, data is read much more often than it is written, so this is a worthy trade-off.

38:11 – Q&A: Multi-read/write API & Future Development Aleksandar: There is a multi-read/write API in the HDF5 library, but it’s not accessible via H5PY yet. I’m focused on the storage specification first, and we can develop the software tooling alongside it. I’m biased, but I think it’s a promising technology layer.

42:30 – Conclusion Aleksandar: If people want to leave comments, please use the GitHub repository issues. We want well-meaning criticism. If you think this is interesting, let your colleagues know. Thanks for watching, and see you next month!

Relevant Links

Topics Covered

Chapter List

Cleaned Transcript

Leave a Comment Cancel Reply