European HDF Users Group Summer 2021

The European HDF Users Group (HUG) Summer 2021 was held July 7-8, 2021, starting at 2:00 p.m CEST each day. Sessions included  a variety of talks from users throughout the HDF5 community on topics like new storage architectures and parallel and cloud IO, performance and debugging issues, analysis and visualization, wrappers and VOL connectors, new uses of HDF5, and HDF5 applications in science and industry. Everyone was welcome to present and attend, the event was just held at a time more convenient for our European community members. (Check out the Fall 2021 HUG event, with planned times more comfortable for those closer to the central time zone.) 

Videos are linked individually in the agenda below. You can also start the playlist which includes every session. The chat log, which contains a fair amount of discussion and many additional links is also posted.

Agenda

Wednesday, July 7

Time‑CEST Time‑CST  
2:00‑2:15 7:00-7:15 Welcome Address (Video)
Mike Folk, The HDF Group
2:15-2:20 7:15-7:20 (Break)
2:20-2:35 7:20-7:35 HDFql – the easy way to manage HDF5 data (Slide Deck | VIdeo)
Rick (Mr. HDFql), The HDFql Project and Gerd Heber, The HDF Group
In this presentation, we give an overview of HDFql, a high-level language to manage HDF5 data. Inspired by the simplicity and power of SQL, and unlike typical HDF5 language bindings, HDFql is a declarative platform-independent guest language that can be used in conjunction with many host languages (C, C++, Java, Python, C#, Fortran, R) to perform the full breadth and depth of HDF5 data management tasks. We will show how this is achieved and provide plenty of examples of HDFql in action. In HDFql, we support popular HDF5 features, such as HDF5 selections, parallel HDF5, and direct chunk I/O, but also data export/import to/from MS Excel. We will give a preview of what’s on the HDFql drawing board and hope for plenty of feedback to help us steer future development.
2:35-2:40 7:35-7:40 (Break)
2:40-2:55 7:40-7:55 ESCAPE: building exabyte-scale federated storage for ESFRI communities (Slide Deck | Video)
Paul Millar, DESY
In this presentation, we will briefly describe ESCAPE: an ongoing EU project that has established a single collaborative cluster of next generation European Strategy Forum on Research Infrastructures (ESFRI) facilities in the area of astronomy and accelerator-based particle physics in order to implement a functional link between the concerned ESFRI projects and European Open Science Cloud (EOSC).
The talk will focus on WP2 “Data Infrastructure for Open Science” (DIOS). This work-package is developing a blueprint for exabyte-scale federated store for nine ESFRI communities (CTA, MAGIC, SKA, LOFAR, LSST, EGO-Virgo, FAIR, XENON1T, Km3NeT), along with running a testbed within which the scalability and manageability of that infrastructure is being demonstrated.
In particular, this talk is an open invitation, where we are looking for partners who are interested in collaborating in investigating to what extent HDF5 (and the libraries that implement its support) may make better use of the technologies being prototyped within ESCAPE
2:55-3:00 7:55-8:00 (Break)
3:00-3:15 8:00-8:15 User-Defined Functions for HDF5 (Slide Deck | Video Demo | Video)
Lucas C. Villa Real, IBM Research
This talk will present HDF5-UDF, an infrastructure that enables the attachment of user-defined functions — written in Python, C/C++, or Lua — to HDF5 files. Such functions are disguised as regular datasets that, once read, execute the associated code and populate the dataset contents on-the-fly. It is possible to create routines that access web services, that virtualize data stored in different formats, that access external devices and sensors, and many more. HDF5-UDF implements a security model that allows users to restrict the operations that user-defined functions provided by third-party can execute, as well as which parts of the file system they can access. The presentation will cover details of the project’s infrastructure and will include use-cases that are driving its current and future development.
3:15-3:20 8:15-8:20 (Break)
3:20-3:35 8:20-8:35 H5Coro: The HDF5 Cloud-Optimized Read-Only Library (Slide Deck | Video)
JP Swinski, NASA Goddard Space Flight Institute
NASA’s migration of science data products and services to AWS has sparked a debate on the best way to access science data stored in the cloud. Given that a large portion of NASA’s science data is in the HDF5 format or one of its derivatives, a growing number of efforts are looking at ways to efficiently access H5 files residing in S3. This presentation describes one of those efforts, H5Coro, and argues for the creation of a standardized subset of the HDF5 specification targeting cloud environments. H5Coro is an open-source C++ module written from scratch that implements a performant HDF5 reader for H5 files that reside in S3. It targets high latency/high throughput environments by minimizing the number of I/O operations through caching and intelligent range GETs. H5Coro is currently available as a C library and includes Python bindings.
3:35-3:40 8:35-8:40 (Break)
3:40-3:55 8:40-8:55 OME-NGFF: scalable format strategies for interoperable bioimaging data (Slide Deck | Video)
Joshua Moore, University of Dundee
Despite significant advances in biological imaging and analysis, major informatics challenges remain unsolved: file formats are proprietary, storage and analysis facilities are lacking, as are standards for sharing image data and results.OME releases specifications and software for managing image datasets and integrating them with other scientific data. OME’s Bio-Formats is a file translator that enables scientists to open and work with imaging data in the software application of their choice. OMERO is an image database application that provides data management and sharing capabilities to imaging scientists
Bio-Formats and OMERO are used in 1000’s of labs worldwide to enable discovery with imaging.
Despite these efforts, there are still inherent limits in existing research infrastructure available for tackling the next scale of bioimaging: the cloud. As a result, OME in collaboration with collaborators and the community have begun defining a next-generation file format (OME-NGFF) to address these next challenges.
This talk explores lessons learned over nearly two decades of supporting bioimaging scientists and their data formats, discusses our existing open file formats as well as those under development, and proposes strategies for the exchange of imaging data publishing and re-analyzing images.
The related pre-print can be found on bioRxiv: https://doi.org/10.1101/2021.03.31.437929
3:55-4:10 8:55-9:10 (15-Minute Break)
4:10-4:25 9:10-9:25 Explore and visualizing HDF5 file contents in JupyterLab with jupyterlab-h5web (Slide Deck | Video
HUDER Loïc, ESRF
HDF5 (with Nexus) is becoming the de facto standard in most X-ray facilities. HDF5 file viewers are needed to allow users to browse and inspect of the hierarchical structure of HDF5 files, as well as visualise the datasets inside as basic plots (1D, 2D). Moreover, in the current context web-based applications such as `JupyterLab` are becoming more and more used as easily accessed remotely. This presentation will focus on `jupyterlab-h5web`, a JupyterLab extension meant to open HDF5 files in `JupyterLab` notebooks. `jupyterlab-h5web` is based on the components of `h5web`, the open-source web-based viewer developed at the European Synchrotron Radiation Facility. These components, made to be used in other web applications such as `jupyterlab-h5web`, are built with React, a front-end web development library, and WebGL for performant visualisations.
4:25-4:30 9:25-9:30 (Break)
4:30-4:45 9:30-9:45 Gold Standard for macromolecular crystallography diffraction data (Slide Deck | Video
Herbert J. Bernstein, Ronin Institute for Independent Scholarship
As reported in [Bernstein, H.J., Förster, A., Bhowmick, A., Brewster, A.S., Brockhauser, S., Gelisio, L., Hall, D.R., Leonarski, F., Mariani, V., Santoni, G. and Vonrhein, C., 2020. Gold Standard for macromolecular crystallography diffraction data. IUCrJ, 7(5)], ” In the culmination of an effort dating back more than two decades, a large portion of the research community concerned with high data-rate macromolecular crystallography (HDRMX) has now agreed to an updated specification of data and metadata for diffraction images produced at synchrotron light sources and X-ray free-electron lasers (XFELs). This `Gold Standard’ will facilitate the processing of data sets independent of the facility at which they were collected and enable data archiving according to FAIR principles, with a particular focus on interoperability and reusability. This agreed standard builds on the NeXus/HDF5 NXmx application definition and the International Union of Crystallo­graphy (IUCr) imgCIF/CBF dictionary, and it is compatible with major data-processing programs and pipelines. Just as with the IUCr CBF/imgCIF standard from which it arose and to which it is tied, the NeXus/HDF5 NXmx Gold Standard application definition is intended to be applicable to all detectors used for crystallography, and all hardware and software developers in the field are encouraged to adopt and contribute to the standard.” We will report on updates to the CBFlib package to release 0.9.7 in support of the Gold Standard and HDF5 1.12.
4:45-4:50 9:45-9:50 (Break)
4:50-5:05 9:50‑10:05 How to leverage multi-tiered storage to accelerate I/O (Slide Deck
Anthony Kougkas, Illinois Institute of Technology
Modern system architectures include multiple tiers of storage organized in a hierarchy. The goal is to mask the I/O gap between compute and remote storage. However, this adds complexity to the end user resulting in under-utilization of specialized I/O resources.
In this talk, we will present Hermes, a heterogeneous aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes aims to remove the complexities associated with multi-tiered storage environments. It offers a simple, yet powerful, buffering API that abstracts the existence of tiers of storage. Further, the Hermes adapter layer is design to support existing legacy I/O APIs (POSIX, STDIO, MPIIO) transparently to the user via interception.
5:05-5:10 10:05‑10:10 (Break)
5:10-6:00 10:10-11:00 Lightning Talks:
  • Using HDF5 as a wire format for multi-dimensional data (Slide Deck | Video)
    Vijay Kartik, Deutsches Elektronen-Synchrotron
    A brief look into recent attempts at using HDF5 as a common ‘wire format’ for sending serialized multi-dimensional data between processes.
  • Idiomatic MPI for Modern C++ (Slide Deck | Video
    Steven Varga, VargaConsulting
    In this announcement I am reaching out to HPC professionals to share a concept of a header only MPI library for modern C++. This idea closes the gap behind the outdated C++ MPI and would provide a meta programming based header only implementation with exceptional performance, seamless interop with the underlying C MPI library, and similar intuitive syntax and functionality of interpreted languages such as Python.
    Following up from ISC’19 BoF presentation, strong similarity between MPI and HDF5 systems allow significant code/pattern reuse from the H5CPP project. While there are naming differences in concepts such as HDF5 property lists vs MPI_info, the main building blocks remain the same, allowing rapid development of a performant MPI library with similar user experience of python but with the speed of C.
    In this 5 minutes talk I invite you to collaborate on a small MPI example, to review its proposed syntax, functionality and investigate possible limitations.
    Keywords: MPI C++, HDF5, H5CPP
  • BENCHMARK: an idiomatic C++ performance measurement library for distributed computing (Slide Deck | Video
    Steven Varga, VargaConsulting
    BENCHMARK is yet another library to help with timing and throughput measurements of your C or C++ code. What sets this header only implementation apart from competition is the template meta programming based intuitive pythonic syntax, MPI capability, and the ability to measure performance of HDF5 CAPI calls without boilerplate; thanks to its direct integration with H5CPP — an HDF5 C++ library from the same author.
    While C and C++ are different languages, with the exception of few corner cases they are close enough to compile C code with a C++ compiler, opening up possibilities to use template meta programming based approach to ease on, or entirely remove boiler plate code. In this lightning talk I am going to walk you through a simple example of a performance and timing measurement of an HDF5 C API call, revealing the simplicity behind the idea: performance measurement should be easy.
    Keywords: C++17, HDF5, H5CPP
  • FAIRmat for Materials Science to follow FAIR principles (Slide Deck | Video)
    Sandor Brockhauser, Humboldt University, Berlin
    FAIRmat is a new project to enable Materials Science to follow FAIR principles and to contribute to the establishment of the National Research Data Infrastructure in Germany. FAIRmat is being built on the NOMAD Laboratory which is the biggest data store in computational materials science, worldwide. We are currently advancing its data infrastructure towards materials synthesis, experimental physics, theory, and data processing workflows. The new digital infrastructure and cloud services, based on leading edge IT technologies supports Open Data and Open Science towards data-centric materials science involving Artificial Intelligence.
  • h5nuvola (Slide Deck | Video)
    Andrea Lorenzon, CERIC-ERIC
    h5nuvola is a project, born under PaNOSC project, aimed at the realization of a web interface to browse HDF5 files in the context of photon and neutron scientific data. Written in pure python+javascript, and with the idea of giving users a easy tool to quickly explore the content of their HDF5 files, it already features basic plots for 1-2-3D data, based on Bokeh, plus plugins for matplotlib custom plots of metadata-recognized datasets, like beam distributions with marginal histograms.

Thursday, July 8

Time‑CEST Time-CST  
2:00‑2:15 7:00-7:15 HDF5 at the ESRF (Slide Deck | Video
Wout De Nolf, ESRF
This presentation provides an overview of how HDF5 is used at the ESRF. The focus will be on saving raw data from the new beamline control software BLISS. The current situation and required HDF5 developments will be discussed.
2:15-2:20 7:15-7:20 (Break)
2:20-2:35 7:20-7:35 hdf5plugin (Slide Deck | Notebook | Video)
Thomas VINCENT, ESRF
hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5.
This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides.
It also illustrates how the provided compression filters can be enabled to read compressed datasets from other (non-Python) application.
Finally, it discusses how hdf5plugin manages to distribute the HDF5 plugins for reuse with different libHDF5.
2:35-2:40 7:35-7:40 (Break)
2:40-2:55 7:40-7:55 Experiences with virtual datasets (Slide Deck | Video)
Thomas Kluyver, European XFEL
Introduced in HDF5 1.10, virtual datasets offer a combined view of data stored in several separate datasets and even separate files, presenting them as pieces of one multidimensional array. This can save copying large amounts of data into a single dataset. I’ll describe some practical uses for virtual datasets at European XFEL, in a similar context to that which motivated the design of the feature. I’ll also demonstrate how virtual datasets can be conveniently assembled in Python using the high-level API in h5py. Finally, I’ll mention some issues we have encountered with the virtual dataset machinery, both at European XFEL and from issues reported to h5py, along with workarounds and possible solutions.
2:55-3:00 7:55-8:00 (Break)
3:00-3:15 8:00-8:15 H5CPP: non intrusive persistence for Modern C++ (Slide Deck | Video)
Steven Varga, VargaConsulting
H5CPP is a novel approach to persistence, it provides high performance sequential and block access to HDF5 containers through modern C++ interface, with easy to use API properties much similar to python but with the speed of C. With its LLVM based source code transformation tool: h5cpp-compiler; the CRUD like header only templates are augmented with compiler assisted reflection, helping you to persist arbitrary complex POD types — homogeneous or non-homogeneous data with contiguous/adjacent memory layout. In addition to reflection, exploiting structure, H5CPP supports major linear algebra systems, and provides mechanism to integrate and BLAS/LAPACK based libraries.
With detailed long term plan in mind this project provides scalable seamless persistence framework from laptops to MPI based clusters/supercomputers, providing a solid, flexible, low latency data solution.
Applications: financial markets, sensor networks, science and engineering, Matlab, Julia, R, … interop with C++, custom storage systems, etc.
3:15-3:20 8:15-8:20 (Break)
3:20-3:35 8:20-8:35 Lightning Talks:
  • Experiences with GPU decompression for bitshuffle+LZ4 data (Slide Deck | Video)
    Jon Wright, ESRF
    We are producing large volumes of data which are compressed using the bitshuffle+LZ4 format and saved into hdf5 files. Thanks to the nvcomp library from nvidia these data can be LZ4 decompressed inside a GPU instead of using the usual hdf5 plugin. By implementing a bitshuffle filter, the uncompressed chunks can be recovered ready for processing with GPU based algorithms. We will look at a few benchmarks to try to find out whether this is a useful optimisation.
  • Parallel HDF5 and compression filters with synchrotron scattering data (Slide Deck | Video)
    Zdenek Matej, MAX IV Laboratory, Lund University
    Parallel HDF5 is a robust and scalable framework for handling very large scientific datasets. HDF5 is used extensively at photon light facilities as synchrotrons and X-FELs. Compression and direct chunk write have been found very effective in reducing stored volumes of data and requirements on storage and data transfer infrastructure. We describe our experience of using compression filters with parallel HDF5 for storing large series of image like data. Alternatives for achieving required performance figures with serial HDF5, as direct chunk write and virtual datasets are known and widely used however such solutions imply additional constrains on datasets topology and data analysis software.
3:35-3:40 8:35-8:40 (Break)
3:40-3:55 8:40-8:55 Live Eiger Analysis with HDF5 and SWMR at Diamond Light Source (Slide Deck | Video)
Graeme Winter, Diamond Light Source
The current state of live X-ray diffraction data analysis for macromolecular crystallography will be presented, including a small history of how we arrived here. The focus is on the use of SWMR and HDF5 as part of a data analysis chain for the latest generation of synchrotron X-ray diffraction detectors, capable of operating at up to 560 18 megapixel frames per second.
In addition to the use of the software, the hardware necessary to support these demands will be described, as the overall system is far more than the sum of the individual parts.
3:55-4:05 8:55-9:05 10-Minute Break
4:05-4:20 9:05-9:20 Preliminary results of SWMR HDF5 on Spectrum Scale (Video)
Lana Abadie, ITER
We are going to present our requirements, the preliminary results of read and write test on Spectrum Scale, and our next steps
4:20-4:25 9:20-9:25 (Break)
4:25-4:40 9:25-9:40 Finding life beyond Earth with HDF5 (Slide Deck | Video)
Danny C. Price
International Centre for Radio Astronomy Research (ICRAR) 
Are we alone? The prevalence of life beyond Earth is a deeply profound and unanswered question within astronomy and astrobiology. The Search for Extraterrestrial Intelligence (SETI) seeks to detect intelligent life via ‘technosignatures’: artificial signals indicating technologically-capable societies. Modern technosignature searches analyse billions of frequency channels across huge portions of the electromagnetic spectrum, which results in large data volumes and computational challenges. In this presentation, I will introduce the 10-year, $100M Breakthrough Listen search for intelligent life, the methods we are using to search large datasets, and how we are using HDF5 to store many petabytes of high-resolution data.
4:40-4:45 9:40-9:45 (Break)
4:45-5:00 9:45‑10:00 MATLAB Modernization on HDF5 1.10 and Support for SWMR and VDS (Slide Deck | Video Demo | Video)
Ellen Johnson, MathWorks
This talk presents our effort at MathWorks toward modernizing on HDF5 1.10.7 and adding support for the much-requested Single-Writer/Multiple-Reader and Virtual Dataset features. We will discuss our updated 1.10.7 HDF5 functionality available today for MATLAB users in the R2021b prerelease (with R2021b full release planned for September) and would like to hear early feedback from the community. We will also discuss performance and compatibility considerations plus our tentative roadmap for future HDF5 enhancements.
5:00-5:05 10:00‑10:05 (Break)
5:05-5:20 10:05-10:20 HSDS – Serverless support with AWS Lambda and direct access (Slide Deck | Video)
John Readey, The HDF Group
The HDF Server (HSDS) enables HDF data to be read and written over a http connection, but sometimes setting up a server is just too much to deal with due to cost, time, or management concerns. In this talk we’ll discuss two alternative ways to utilize HSDS technology but leaving the server behind. The first, HSDS for AWS Lambda supports the HDF REST API but runs entirely using Lambda functions. The second approach is “HSDS Direct Access”, a client-side library that enables HSDS like features exclusively on the client: Object storage read and write, multi-threading support, and sql-style queries.
5:20-5:50 10:20-10:50 Update on new HDF5 VFDs: SWMR, Mirror, Onion (Slide Deck | Video)
John Mainzer and Brian Sawicki, The HDF Group
In our talk, we will give update including benchmarking results on recently released VFDs. VFD SWMR is a re-implementation of SWMR, which offers “almost” full SWMR and has the potential to support NFS and SWMR on parallel computations. The Mirror VFD allows mirroring of an HDF5 file on a remote system as it is being created on a local file system. Onion VFD allows access to previous versions of an HDF5 file on a per open/close cycle basis.
5:50-6:30 10:50-11:30 Open Discussion
Hosted by Elena Pourmal and The HDF Group staff (HDF Update Slide | Video)
Please feel free to submit a question. (Technical, roadmap, etc.). This session may run beyond the stated end time.