2022 European HDF5 User Group Agenda

Slides and Video are archived below. You can access a full playlist of all videos.

Tuesday 31 May 2022

CEST (GMT+2) CDT (GMT‑5) Presentation
9:00-9:45 a.m. 2:00-2:45 a.m. Introduction – Gerd Heber (Video)
Keynote – Anders Wallander, head of ITER Controls Division (Slides | Video)
9:45-10:05 2:45-3:05 State of HDF5 at the new ESRF source – EBS – Andy Götz, ESRF (Slides | Video)
This talk will present the state of adopting HDF5 at the new ESRF source – the EBS. It will briefly describe the ESRF and what kind of data it produces. Then it will describe the way HDF5 has been implemented in BLISS, the new beamline control system. It will describe the tools developed for processing HDF5 files. The talk will end with a list of issues encountered adopting HDF5.
10:05-10:25 3:05-3:25 New Features for HSDS – John Readey, The HDF Group (Slides | Video)
This talk will describe some of the new features in the HSDS 0.7 release, including:
• Serverless implementation using AWS Lambda or h5pyd direct access
• Streaming support for large requests
• Fancy Indexing acceleration
• Improved dynamic scaling
• RBAC support

I’ll also describe some ways in which HSDS is used by different organizations to support large scale data analytics.
10:25-10:45 3:25-3:45 hdf5plugin – Thomas VINCENT, ESRF (Slides | Notebook | Video)
hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5.

This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides.
10:45-11:05 3:45-4:05 Blosc2 & HDF5: A Proposal For Working As A Team – Francesc Alted, ironArray SL (Slides | Video)
Blosc2 is the new generation of the ultra-fast Blosc compression library. Although it is usually thought as a compressor, Blosc should actually be regarded as a compressor orchestrator, transparently handling a combination of codecs (compressors) and filters (or preconditioners). Blosc2 comes with lots of new features, but perhaps the most interesting ones are the new 63-bit frame specification and Caterva, a new multidimensional layer at its top.

Blosc developers have a large experience with HDF5, and in fact, several of Blosc main features are a consequence of our will to optimize several parts of the compression pipeline in HDF5; this includes faster filters (shuffle using SIMD instructions) and fast multi-threaded operation. Recently, the Blosc developement team learned about the direct chunk read/write operation in HDF5, where it can delegate the handling of the compression pipeline to the application. We think that opens an excellent way for enabling recent new features of Blosc2 from HDF5 applications.

In our talk, we will be talking on how HDF5 and Blosc2 can work as a team thanks to this development, and how this level of cooperation can lead to new heights of performance in existing HDF5 applications.
11:05-11:20 4:05-4:20 15-minute break
11:20-11:40 4:20-4:40 The story of Hierarchical Semi-Sparse Cubes in HDF5 – Jiri Nadvornik, Czech Technical University in Prague (Slides | Video)
Combining sparse multi-dimensional data is always a challenge. In our case, we combine multi-modal data – images and spectra, to be able to extract additional knowledge by machine learning techniques. Combining fundamentally different datasets like these requires an efficient way of linking them together to build a coherent data cube. Our research showed that using a database-like solution will not cut it and from the I/O middleware options HDF5 came out as the winner because of its good performance on dense data and flexibility. We built a python framework on top of h5py called Hierarchical Semi-Sparse cube (HiSS cube) that implements such a coherent data cube.

In this talk, we show how the HDF5 can be used for other than dense data effectively for both visualization and contiguous data access purposes. We also touch on the MPI I/O performance to show the scalability of this solution. Some nasty lessons learned on h5py and HDF5 are also included and hopefully an answer the usual question: “Can I stay within Python and make my solution fast?”
11:40-12:00 4:40-5:00 HDFql – the easy way to manage HDF5 data – Gerd Heber (Slides | Video)
12:00-12:20 5:00-5:20 LImA/LImA2 with HDF5, current practice and next gen DAQ evolutions – A.Homs, L. Claustre and S. Debionne, ESRF, European Synchrotron Radiation Facility (Slides | Video)
LImA, a Library for Image Acquisition widely used in the synchrotron world, saves acquired data in HDF5 following the Nexus convention. In this talk, we first discuss the current practices in terms of data compression, chunking and the issues encountered. Then we introduce LImA2, a distributed version of LImA designed for high performance detectors, and the foreseen challenges with data aggregation, sparse data from online data reduction, and fast data access for further online data analysis.
12:20-12:40 5:20-5:40 Why did DECTRIS choose HDF5? – Rizalina Mingazheva, DECTRIS (Slides | Video)
12:40-12:45 5:40-5:45 cushion time
12:45-1:45 5:45-6:45 LUNCH
1:45-2:05 6:45-7:05 BD5: an open format for representing quantitative biological dynamics data – Koji Kyoda, RIKEN BDR (Slides | Video)
With bioimage informatics techniques, research groups worldwide are producing a huge amount of quantitative data on spatiotemporal dynamics of biological objects ranging from molecules to organisms. To facilitate the reuse of such data, we developed several open unified data formats for representing quantitative data of biological dynamics. We first developed an XML-based data format, BDML (Biological Dynamics Markup Language), that has the advantage of machine/human readability and extensibility; however, it becomes difficult to access data with a large file size because XML-based file requires sequential read. We next developed BD5, a binary data format based on HDF5, enabling fast random data access to address the issue. It allows practical data reuse for understanding biological mechanisms underlying the dynamics. We are currently developing a zarr-based format, BD-zarr, compatible with ome-zarr, the next generation file format of bioimaging data. We will report on our latest progress in the development of data formats.
2:05-2:25 7:05-7:25 MATLAB and HDF5: Compression, Cloud, and Community – Ellen Johnson, MathWorks (Slides | Video)
This talk will provide an overview of MATLAB workflows for writing and reading HDF5 datasets using dynamically loaded filters (DLFs). Writing datasets using third-party compression filters enables full write/read roundtrip DLF capabilities and continues our long history of supporting HDF5 features through our rich high-and low-level library interfaces. We will also review MATLAB’s support for HDF5 Single-Writer/Multiple-Reader (SWMR), Virtual Datasets (VDS), and working with HDF5 data on cloud locations — including read/write support for S3 and Azure, and read support for Hadoop. The talk will provide a demo showing use of dynamically loaded filter, SWMR, and VDS features on local and cloud platforms. We will wrap up with discussion on collaborations with open-source projects/software such as our long-standing work with The HDF Group and a new collaboration with Neurodata Without Borders.
2:25-2:45 7:25-7:45 hepfile: wrapping HDF5 to give ROOT-like functionality for HEP datasets and more – Matthew Bellis, Siena College (Slides | Video)
Most file formats excel when the data exists in some simple, homogenous n x m block structure. High Energy Physics (HEP) datasets present a challenge because of the inhomogeneous nature of the dataset: one event may have 3 jets and 2 muons and the next event may have 12 jets and no muons. The HEP community has broadly adopted the ROOT analysis library, which has its own file format. This means that users have traditionally had to import the entire ROOT ecosystem just to read the files, locking out users from other communities that do not use ROOT, though it should be acknowledged that newer modules like uproot allow for a more modular approach. We have taken a different approach with hepfile (Heterogeneous Entries in Parallel – file) which provides a data description for organizing this type of data, a API definition for interfacing with these data, and a description of how to pack this into a file. These abstract definitions have been implemented with a python API and making use of the HDF5 file format. wrapper to the HDF5 format that gives users access to ROOT functionality without ROOT files all the while making use of native python tools. The performance of this tool and its application to both HEP and non-HEP datasets will be presented.
2:45-3:05 7:45-8:05 TBA
3:05-3:20 8:05-8:20 15-minute break
3:20-3:40 8:20-8:40 Latest developments in data visualisation in the web with H5Web – Loïc Huder, ESRF (Slides | Video)
H5Web is a web viewer for HDF5 files. It supports the visualisation of datasets using performant WebGL-based visualisations and the exploration of HDF5 files requested from a separate back-end (e.g. HSDS) for modularity. Users can browse and visualise their data online or locally in H5Web, notably through its JupyterLab extension.

In addition to the web viewer, H5Web visualisations can serve as general-purpose data visualisation components. At ESRF, several web applications are now using these visualisations for fields ranging from X-ray diffraction to tomography reconstruction. In this talk, we will show how the development of these applications drove the latest developments in H5Web.
3:40-4:00 8:40-9:00 Native HDF5 in the browser: jsfive and h5wasm – Brian B Maranville, NIST Center for Neutron Research (Slides | Video)
In many disciplines of science, engineering and medicine, there has been strong movement toward providing data processing and visualization in a browser-native environment. End-users can then interact with their data from any device, without installing additional software. For tools that work with the data files directly, reading and writing text-based datasets such as CSV is easy; this is not true for datasets in rich structured binary formats such as HDF5. Here we present tools for handling HDF5 files directly in the browser:

In “jsfive”, the HDF5 specification is re-implemented in pure Javascript (following the work in “pyfive”) to make a very lightweight library for reading a subset of HDF5 files (only specific datatypes supported) from a file image as bytes.

In “h5wasm”, a small library of functions using the HDF5 C API (libhdf5) is compiled to WebAssembly, and used by a higher-level Javascript library. Using the Emscripten virtual file system, it can open “files” in read, write or read/write modes, and slice datasets to return only a subset of of the data. Reading of ARRAY, COMPOUND and ENUM types is fully supported, and data is returned in native formats (TypedArray) where possible. We will show examples of this library in current use.

We also recently contributed to the addition of the “h5py” package to the “pyodide” environment, bringing the powerful Python API for HDF5 manipulation to the browser, and further lowering the barrier for using HDF5 in e.g. JupyterLite browser-based notebooks.

 

Wednesday 1 June 2022

CEST (GMT+2) CDT (GMT‑5) Presentation
900-9:45 2:00-2:45 FAIRmat, a German NFDI Project for FAIR Data Management – Sandor Brockhauser, FAIRmat, HU Berlin
The goal of the FAIRmat project is to establish best data management practices in Materials Science. As an NFDI project it is one of the few German flagships to build the National Research Data Infrastructure. Working with international collaborators, FAIRmat also connects Synthesis, Experiments, and Theory. Next to describing FAIRmat, the presentation also shows the recent efforts in data format standardisation, the use of NeXus, and the requirements for file formats and data storage in general. Here, it is in particular addressed what is required from hdf5 for being able to be integrated in FAIRmat’s data management infrastructure, NOMAD.
9:45-10:05 2:45-3:05 Storing EPICS process variables in HDF5 files for ITER – RODRIGO CASTRO, Spanish National Fusion Energy Laboratory – CIEMAT (Slides | flat_structure.txt)
EPICS is the technology for distributed control used by ITER in all its systems. EPICS uses an architecture based on a distributed protocol that allows the exchange of control data (process variables) between different elements and subsystems. In its previous versions, EPICS used “Channel Access” as the communication protocol, and its control variables had a simple structure that allowed the exchange of a very limited set of primitive data types. From its new version 7, EPICS uses a new “PV Access” protocol, and its process variables allow for nested data types that can form very complex data structures. From a data storage point of view, both protocols have challenges to address. In the case of “Channel Access”, ITER plans to handle more than a million signals, which requires, in order to have a functional solution, the aggregation of tens of thousands of time evolution variables in a single file. In the case of PVAccess, some control variables datatypes will contain thousands of fields. In this presentation, we describe the solutions for both protocols that we have implemented for ITER and that are based on HDF5 files.
10:05-10:25 3:05-3:25 Data Storage in High Energy Physics with ROOT – Dr Jakob Blomer, CERN (Slides)
Researchers in High Energy Physics (HEP), at CERN and elsewhere, need to efficiently analyze petabytes of data. Data sets at particle colliders are analyzed with statistical methods, comparing theoretical expectations of certain, often very rare physics processes with recorded data from particle detectors. As the number of particles produced in each and every collision is a priori unknown, HEP data is not naturally tabular but instead modeled by more complex hierarchical collections. This presentation introduces the HEP data model and the main data format used to store and retrieve HEP data. The data format and I/O API is provided by the ROOT toolkit. The ROOT I/O stores (nested) collections of C++ objects in columnar layout. The presentation aims at pointing out HEP specific I/O needs as well as similarities to HDF5 and Big Data formats.
10:25-10:45 3:25-3:45 Parallel HDF reading for Imaging Techniques – Paola C. Ferraz, PhD – CNPEM / Brazilian Synchrotron Light Laboratory (Slides)
Imaging techniques at Sirius, the 4th generation brazilian synchrotron, can generate large datasets that are stored in the main datacenter using the HDF format, which is also the official file format for most of the beamlines that are under commissioning. Some beamlines were constructed in such a way that the detector (in-house developments) directly store a block of measured images (at high rates, e.g., greater than 1000 FPS) in a compressed way. The compression exists in order to alleviate the central storage, otherwise, typical imaging datasets would consume order of terabytes of information. Hence, the post-processing of large datasets strongly depends on an important subproblem, which is the decompression of a dataset in order that the elapsing times for data consumption do not increase dramatically. A particular example of post-processing that the scientific computing group is facing is the computational support for ptychographic experiments combined with tomography for beamlines such as Cateretê and Carnaúba. Here, a two-dimensional frame is obtained through a scanning procedure of N points at a detector having dimensions of 3072×3072 pixels. After rotation of M angles, it is easy to observe that extremely large datasets can be obtained, for instance with N=500 and M=1000, providing a total of O(17TB) of data to be read and processed (bit depth of 32). At the end of the scanning procedure, we are dealing with a sequence of blocks, each denoted by B, composed of N compressed images in the HDF file format, and loading each one of them takes roughly O(2 mins) without further specific strategies. We have then devised a strategy using MPI (mpi4py) using X processors for which each of these processes will read a part of the block B. The reading of each fraction of B is done using Python package h5py (in parallel mode). This approach then consumes T=O(15 sec) for reading a block B, which is a good time reduction for reading a single block. Looping over M=500 files takes roughly O(2 hrs), a process that does not include further data processing (ptychography over B, for instance). Hence, data processing elapsing times are bounded by reading rates from storage (100Gbps in our case) and the HDF/MPI strategy. We have noticed that X=32 is an optimal number, running on servers equipped with 2 sockets, 18 cores per socket and 4 threads per core. Reading M files at once and using the same strategy for HDF/MPI can also reduce 20% of the total elapsing time (reading + processing), or even 50% in some cases. The scientific computing and Tepui group (responsible for HPC/Storage infrastructure), acting as supporting groups for beamlines at Sirius, are investigating more efficient ways to reduce the elapsing time T for a given block B, improving user experience.
10:45-11:05 3:45-4:05 IMAS Data Model and I/O library: Status and needs – Olivier Hoenen, ITER Organization (Slides)
The ITER Organization and the ITER Members are developing the Integrated Modelling & Analysis Suite (IMAS) to support the physics modelling of plasma pulses in the ITER tokamak and detailed analysis from diagnostics measurements. At the core of IMAS lies a standardized Data Model and its associated data-access libraries that can combine various physics models and codes written in different programming languages in workflows of various complexity. While this data model is fusion domain-specific, it is agnostic to both the experimental device and the codes, and can cover simulation and experimental data seamlessly. This flexibility allows covering a wide range of use cases and has the potential to become the de-facto data standard for the fusion community. A shortcoming of this approach is that different access patterns from the various use cases make it challenging to obtain always the best I/O performance. This presentation will cover the addition of a new storage backend for the IMAS data-access library that relies on HDF5, discuss some of its advantages compared to other available backends and list areas where further investigation and improvements will be required.
11:05-11:20 4:05-4:20 15-minute break
11:20-11:40 4:20-4:40 Toward Multi-Threaded Concurrency in HDF5 – John Mainzer, Elena I Pourmal, Lifeboat, LLC (Slides)
The lack of concurrent access is a long-standing limitation for the HDF5 library that creates deployment and adoption barriers for multi-threaded applications. To date, there was no effort to overcome this library restriction, mostly due to the difficulty of retrofitting a large code base with thread concurrency.

Current multi-threaded applications use thread-safe builds of HDF5 to access data or read HDF5 files directly by-passing the HDF5 library. The latter applications usually support a limited set of HDF5 features and do not provide a general solution.

The recent HDF5 developments and HDF5 library architecture prompted us to propose a strategy for concurrent read access without API changes and relatively quick delivery of the desired features when compared to a full HDF5 library rewrite.

In our talk we will discuss the proposed strategy and how it will lead to implementation of multi-threaded concurrency in HDF5.
11:40-12:00 4:40-5:00 Use of HDF5 format on board of power generators for marine and terrestrial applications – Giuseppe Giannino, CIS (Innovation and Development Center) – Isotta Fraschini Motori [ITALIA] (Slides)
Power Generators are equipped with a huge amount sensors on board, useful to act automation logic for controlling engine behaviors, performances, warning and alarms. The main needs that have pushed development and integration of devices able to acquire data traveling on board of a Power Generators by our company are:
a. equipping the asset with an instrument ready to be used for understanding occurred troubles,
b. collecting more information on asset distributed on field for a deepen knowledge.
12:00-12:20 5:00-5:20 Programmable datasets with User-Defined Functions for HDF5 – Lucas Villa Real, IBM Research (Slides)
This talk will present an infrastructure to embed scripts (written in Python, C/C++, Lua, or CUDA) in HDF5 datasets. The values of such datasets are defined on-the-fly by the seamless execution of such scripts. It is possible to create data transformers, to virtualize data stored in different formats, to access external devices and sensors, and more. The presentation will cover details of the project’s infrastructure and will include use-cases that are driving its current and future development.
12:20-12:40 5:20-5:40 HDF5 ⬌ Zarr – Aleksandar Jelenak (Slides | Video)
12:40-12:45 5:40-5:45 cushion time
12:45-1:45 5:45-6:45 LUNCH
1:45-2:05 6:45-7:05 HighFive: An Easy-To-Use, Header-Only C++ Library for HDF5 – Luc Grosheintz, Nicolas Cornu, EPFL – Blue Brain Project (Slides | Video)
Portable scientific data formats are vital for reliable data storage, knowledge transfer, and long-term maintainability and reproducibility. Hierarchical Data Format (HDF) 5 is considered the de-facto standard for this purpose in various domains within academia as well as industry. While the official HDF5 library is versatile and well supported, it only provides a low-level C/C++ interface. Lacking proper high-level C++ abstractions dissuades the use of HDF5 in many scientific applications. There are a number of C++ wrapper libraries available. Many, however, are domain-specific, incomplete or not actively maintained.

To address these challenges we present HighFive, an easy-to-use, header-only C++11 library that simplifies data management in HDF5. It is designed with performance as well as ease of use in mind. Thanks to compiler inlining on header-only templates, it reaches near-zero runtime overhead. The library features: automatic C++ type-mapping, automatic memory management via RAII, parallel MPI I/O support, and adjustable data selections for partial I/O. HighFive is developed as an open-source library and can be downloaded from: github.com/BlueBrain/HighFive.
2:05-2:25 7:05-7:25 TBA
2:25-2:45 7:25-7:45 HDF5 work in the ECP project – Suren Byna (Slides)
2:45-3:05 7:45-8:05 BioSimulations: a platform for sharing and reusing biological simulations – Jonathan Karr, Icahn School of Medicine at Mount Sinai (Slides | Video)
To help investigators share, reuse, and combine biological models, we developed BioSimulations, a central repository of biological models, simulations of these models, the results of these simulations, and data visualizations of these results. To enable BioSimulations to support models for a broad range of biological systems and scales, we combined domain formats for models and simulations with HDF5 and HSDS for simulation results and the Vega format for data visualizations. This combination of HDF5, HSDS, and Vega enables interactive figures, closes crucial gaps in the provenance of figures, and opens new possibilities for sharing and reusing visualizations.
3:05-3:20 8:05-8:20 15-minute break
3:20-3:40 8:20-8:40 The State of HDF5 – Dana Robinson, The HDF Group (Slides | Video)
HDF5 Community Building – Dana Robinson, The HDF Group (Slides | Video)
3:40-4:00 8:40-9:00 HDF5 Developers Lightning Talks
Hermes – Chris Hogan, The HDF Group (Slides | Video)

Onion VFD – Songyu “Ray” Lu, The HDF Group (Video)
4:00-4:20 9:00-9:20 continued lightning talks
4:20-4:40 9:20-9:40 Wrap-up – Gerd Heber, The HDF Group (Video)

 

Thursday 2 June 2022

CEST (GMT+2) CDT (GMT‑5) Meeting Room 2113 Meeting Room 4054 Meeting Room 3046
10:00-11:30 a.m. 3:00-4:30 a.m. ITER site visit (no remote presentation)
11:30-12:45 p.m. 4:30-5:45 Problem Solving Session – A.Homs, L. Claustre and S. Debionne hdfplugin – Thomas Vincent
12:45-1:45 p.m. 5:45-6:45 LUNCH
1:30-3:30 p.m. 6:30-8:30 h5web – Axel Bocciarelli / Loic Huder   h5py – Aleksandar Jelenak
3:30-5:30 p.m. 8:30-10:30 HSDS – John Readey   HDF5 Test and Tune – Gerd Heber (Slides)