European HDF Users Group Summer 2021
The European HDF Users Group (HUG) Summer 2021 was held July 7-8, 2021, starting at 2:00 p.m CEST each day. Sessions included a variety of talks from users throughout the HDF5 community on topics like new storage architectures and parallel and cloud IO, performance and debugging issues, analysis and visualization, wrappers and VOL connectors, new uses of HDF5, and HDF5 applications in science and industry. Everyone was welcome to present and attend, the event was just held at a time more convenient for our European community members. (Check out the Fall 2021 HUG event, with planned times more comfortable for those closer to the central time zone.)
Videos are linked individually in the agenda below. You can also start the playlist which includes every session. The chat log, which contains a fair amount of discussion and many additional links is also posted.
Agenda
Wednesday, July 7
Time‑CEST | Time‑CST | |
2:00‑2:15 | 7:00-7:15 | Welcome Address (Video) Mike Folk, The HDF Group |
2:15-2:20 | 7:15-7:20 | (Break) |
2:20-2:35 | 7:20-7:35 | HDFql – the easy way to manage HDF5 data (Slide Deck | VIdeo) Rick (Mr. HDFql), The HDFql Project and Gerd Heber, The HDF Group In this presentation, we give an overview of HDFql, a high-level language to manage HDF5 data. Inspired by the simplicity and power of SQL, and unlike typical HDF5 language bindings, HDFql is a declarative platform-independent guest language that can be used in conjunction with many host languages (C, C++, Java, Python, C#, Fortran, R) to perform the full breadth and depth of HDF5 data management tasks. We will show how this is achieved and provide plenty of examples of HDFql in action. In HDFql, we support popular HDF5 features, such as HDF5 selections, parallel HDF5, and direct chunk I/O, but also data export/import to/from MS Excel. We will give a preview of what’s on the HDFql drawing board and hope for plenty of feedback to help us steer future development. |
2:35-2:40 | 7:35-7:40 | (Break) |
2:40-2:55 | 7:40-7:55 | ESCAPE: building exabyte-scale federated storage for ESFRI communities (Slide Deck | Video) Paul Millar, DESY In this presentation, we will briefly describe ESCAPE: an ongoing EU project that has established a single collaborative cluster of next generation European Strategy Forum on Research Infrastructures (ESFRI) facilities in the area of astronomy and accelerator-based particle physics in order to implement a functional link between the concerned ESFRI projects and European Open Science Cloud (EOSC). The talk will focus on WP2 “Data Infrastructure for Open Science” (DIOS). This work-package is developing a blueprint for exabyte-scale federated store for nine ESFRI communities (CTA, MAGIC, SKA, LOFAR, LSST, EGO-Virgo, FAIR, XENON1T, Km3NeT), along with running a testbed within which the scalability and manageability of that infrastructure is being demonstrated. In particular, this talk is an open invitation, where we are looking for partners who are interested in collaborating in investigating to what extent HDF5 (and the libraries that implement its support) may make better use of the technologies being prototyped within ESCAPE |
2:55-3:00 | 7:55-8:00 | (Break) |
3:00-3:15 | 8:00-8:15 | User-Defined Functions for HDF5 (Slide Deck | Video Demo | Video) Lucas C. Villa Real, IBM Research This talk will present HDF5-UDF, an infrastructure that enables the attachment of user-defined functions — written in Python, C/C++, or Lua — to HDF5 files. Such functions are disguised as regular datasets that, once read, execute the associated code and populate the dataset contents on-the-fly. It is possible to create routines that access web services, that virtualize data stored in different formats, that access external devices and sensors, and many more. HDF5-UDF implements a security model that allows users to restrict the operations that user-defined functions provided by third-party can execute, as well as which parts of the file system they can access. The presentation will cover details of the project’s infrastructure and will include use-cases that are driving its current and future development. |
3:15-3:20 | 8:15-8:20 | (Break) |
3:20-3:35 | 8:20-8:35 | H5Coro: The HDF5 Cloud-Optimized Read-Only Library (Slide Deck | Video) JP Swinski, NASA Goddard Space Flight Institute NASA’s migration of science data products and services to AWS has sparked a debate on the best way to access science data stored in the cloud. Given that a large portion of NASA’s science data is in the HDF5 format or one of its derivatives, a growing number of efforts are looking at ways to efficiently access H5 files residing in S3. This presentation describes one of those efforts, H5Coro, and argues for the creation of a standardized subset of the HDF5 specification targeting cloud environments. H5Coro is an open-source C++ module written from scratch that implements a performant HDF5 reader for H5 files that reside in S3. It targets high latency/high throughput environments by minimizing the number of I/O operations through caching and intelligent range GETs. H5Coro is currently available as a C library and includes Python bindings. |
3:35-3:40 | 8:35-8:40 | (Break) |
3:40-3:55 | 8:40-8:55 | OME-NGFF: scalable format strategies for interoperable bioimaging data (Slide Deck | Video) Joshua Moore, University of Dundee Despite significant advances in biological imaging and analysis, major informatics challenges remain unsolved: file formats are proprietary, storage and analysis facilities are lacking, as are standards for sharing image data and results.OME releases specifications and software for managing image datasets and integrating them with other scientific data. OME’s Bio-Formats is a file translator that enables scientists to open and work with imaging data in the software application of their choice. OMERO is an image database application that provides data management and sharing capabilities to imaging scientists Bio-Formats and OMERO are used in 1000’s of labs worldwide to enable discovery with imaging. Despite these efforts, there are still inherent limits in existing research infrastructure available for tackling the next scale of bioimaging: the cloud. As a result, OME in collaboration with collaborators and the community have begun defining a next-generation file format (OME-NGFF) to address these next challenges. This talk explores lessons learned over nearly two decades of supporting bioimaging scientists and their data formats, discusses our existing open file formats as well as those under development, and proposes strategies for the exchange of imaging data publishing and re-analyzing images. The related pre-print can be found on bioRxiv: https://doi.org/10.1101/2021.03.31.437929 |
3:55-4:10 | 8:55-9:10 | (15-Minute Break) |
4:10-4:25 | 9:10-9:25 | Explore and visualizing HDF5 file contents in JupyterLab with jupyterlab-h5web (Slide Deck | Video) HUDER Loïc, ESRF HDF5 (with Nexus) is becoming the de facto standard in most X-ray facilities. HDF5 file viewers are needed to allow users to browse and inspect of the hierarchical structure of HDF5 files, as well as visualise the datasets inside as basic plots (1D, 2D). Moreover, in the current context web-based applications such as `JupyterLab` are becoming more and more used as easily accessed remotely. This presentation will focus on `jupyterlab-h5web`, a JupyterLab extension meant to open HDF5 files in `JupyterLab` notebooks. `jupyterlab-h5web` is based on the components of `h5web`, the open-source web-based viewer developed at the European Synchrotron Radiation Facility. These components, made to be used in other web applications such as `jupyterlab-h5web`, are built with React, a front-end web development library, and WebGL for performant visualisations. |
4:25-4:30 | 9:25-9:30 | (Break) |
4:30-4:45 | 9:30-9:45 | Gold Standard for macromolecular crystallography diffraction data (Slide Deck | Video) Herbert J. Bernstein, Ronin Institute for Independent Scholarship As reported in [Bernstein, H.J., Förster, A., Bhowmick, A., Brewster, A.S., Brockhauser, S., Gelisio, L., Hall, D.R., Leonarski, F., Mariani, V., Santoni, G. and Vonrhein, C., 2020. Gold Standard for macromolecular crystallography diffraction data. IUCrJ, 7(5)], ” In the culmination of an effort dating back more than two decades, a large portion of the research community concerned with high data-rate macromolecular crystallography (HDRMX) has now agreed to an updated specification of data and metadata for diffraction images produced at synchrotron light sources and X-ray free-electron lasers (XFELs). This `Gold Standard’ will facilitate the processing of data sets independent of the facility at which they were collected and enable data archiving according to FAIR principles, with a particular focus on interoperability and reusability. This agreed standard builds on the NeXus/HDF5 NXmx application definition and the International Union of Crystallography (IUCr) imgCIF/CBF dictionary, and it is compatible with major data-processing programs and pipelines. Just as with the IUCr CBF/imgCIF standard from which it arose and to which it is tied, the NeXus/HDF5 NXmx Gold Standard application definition is intended to be applicable to all detectors used for crystallography, and all hardware and software developers in the field are encouraged to adopt and contribute to the standard.” We will report on updates to the CBFlib package to release 0.9.7 in support of the Gold Standard and HDF5 1.12. |
4:45-4:50 | 9:45-9:50 | (Break) |
4:50-5:05 | 9:50‑10:05 | How to leverage multi-tiered storage to accelerate I/O (Slide Deck) Anthony Kougkas, Illinois Institute of Technology Modern system architectures include multiple tiers of storage organized in a hierarchy. The goal is to mask the I/O gap between compute and remote storage. However, this adds complexity to the end user resulting in under-utilization of specialized I/O resources. In this talk, we will present Hermes, a heterogeneous aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes aims to remove the complexities associated with multi-tiered storage environments. It offers a simple, yet powerful, buffering API that abstracts the existence of tiers of storage. Further, the Hermes adapter layer is design to support existing legacy I/O APIs (POSIX, STDIO, MPIIO) transparently to the user via interception. |
5:05-5:10 | 10:05‑10:10 | (Break) |
5:10-6:00 | 10:10-11:00 | Lightning Talks:
|
Thursday, July 8
Time‑CEST | Time-CST | |
2:00‑2:15 | 7:00-7:15 | HDF5 at the ESRF (Slide Deck | Video) Wout De Nolf, ESRF This presentation provides an overview of how HDF5 is used at the ESRF. The focus will be on saving raw data from the new beamline control software BLISS. The current situation and required HDF5 developments will be discussed. |
2:15-2:20 | 7:15-7:20 | (Break) |
2:20-2:35 | 7:20-7:35 | hdf5plugin (Slide Deck | Notebook | Video) Thomas VINCENT, ESRF hdf5plugin is a Python package (1) providing a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and (2) enabling their use from the Python programming language with h5py a thin, pythonic wrapper around libHDF5. This presentation illustrates how to use hdf5plugin for reading and writing compressed datasets from Python and gives an overview of the different HDF5 compression filters it provides. It also illustrates how the provided compression filters can be enabled to read compressed datasets from other (non-Python) application. Finally, it discusses how hdf5plugin manages to distribute the HDF5 plugins for reuse with different libHDF5. |
2:35-2:40 | 7:35-7:40 | (Break) |
2:40-2:55 | 7:40-7:55 | Experiences with virtual datasets (Slide Deck | Video) Thomas Kluyver, European XFEL Introduced in HDF5 1.10, virtual datasets offer a combined view of data stored in several separate datasets and even separate files, presenting them as pieces of one multidimensional array. This can save copying large amounts of data into a single dataset. I’ll describe some practical uses for virtual datasets at European XFEL, in a similar context to that which motivated the design of the feature. I’ll also demonstrate how virtual datasets can be conveniently assembled in Python using the high-level API in h5py. Finally, I’ll mention some issues we have encountered with the virtual dataset machinery, both at European XFEL and from issues reported to h5py, along with workarounds and possible solutions. |
2:55-3:00 | 7:55-8:00 | (Break) |
3:00-3:15 | 8:00-8:15 | H5CPP: non intrusive persistence for Modern C++ (Slide Deck | Video) Steven Varga, VargaConsulting H5CPP is a novel approach to persistence, it provides high performance sequential and block access to HDF5 containers through modern C++ interface, with easy to use API properties much similar to python but with the speed of C. With its LLVM based source code transformation tool: h5cpp-compiler; the CRUD like header only templates are augmented with compiler assisted reflection, helping you to persist arbitrary complex POD types — homogeneous or non-homogeneous data with contiguous/adjacent memory layout. In addition to reflection, exploiting structure, H5CPP supports major linear algebra systems, and provides mechanism to integrate and BLAS/LAPACK based libraries. With detailed long term plan in mind this project provides scalable seamless persistence framework from laptops to MPI based clusters/supercomputers, providing a solid, flexible, low latency data solution. Applications: financial markets, sensor networks, science and engineering, Matlab, Julia, R, … interop with C++, custom storage systems, etc. |
3:15-3:20 | 8:15-8:20 | (Break) |
3:20-3:35 | 8:20-8:35 | Lightning Talks:
|
3:35-3:40 | 8:35-8:40 | (Break) |
3:40-3:55 | 8:40-8:55 | Live Eiger Analysis with HDF5 and SWMR at Diamond Light Source (Slide Deck | Video) Graeme Winter, Diamond Light Source The current state of live X-ray diffraction data analysis for macromolecular crystallography will be presented, including a small history of how we arrived here. The focus is on the use of SWMR and HDF5 as part of a data analysis chain for the latest generation of synchrotron X-ray diffraction detectors, capable of operating at up to 560 18 megapixel frames per second. In addition to the use of the software, the hardware necessary to support these demands will be described, as the overall system is far more than the sum of the individual parts. |
3:55-4:05 | 8:55-9:05 | 10-Minute Break |
4:05-4:20 | 9:05-9:20 | Preliminary results of SWMR HDF5 on Spectrum Scale (Video) Lana Abadie, ITER We are going to present our requirements, the preliminary results of read and write test on Spectrum Scale, and our next steps |
4:20-4:25 | 9:20-9:25 | (Break) |
4:25-4:40 | 9:25-9:40 | Finding life beyond Earth with HDF5 (Slide Deck | Video) Danny C. Price International Centre for Radio Astronomy Research (ICRAR) Are we alone? The prevalence of life beyond Earth is a deeply profound and unanswered question within astronomy and astrobiology. The Search for Extraterrestrial Intelligence (SETI) seeks to detect intelligent life via ‘technosignatures’: artificial signals indicating technologically-capable societies. Modern technosignature searches analyse billions of frequency channels across huge portions of the electromagnetic spectrum, which results in large data volumes and computational challenges. In this presentation, I will introduce the 10-year, $100M Breakthrough Listen search for intelligent life, the methods we are using to search large datasets, and how we are using HDF5 to store many petabytes of high-resolution data. |
4:40-4:45 | 9:40-9:45 | (Break) |
4:45-5:00 | 9:45‑10:00 | MATLAB Modernization on HDF5 1.10 and Support for SWMR and VDS (Slide Deck | Video Demo | Video) Ellen Johnson, MathWorks This talk presents our effort at MathWorks toward modernizing on HDF5 1.10.7 and adding support for the much-requested Single-Writer/Multiple-Reader and Virtual Dataset features. We will discuss our updated 1.10.7 HDF5 functionality available today for MATLAB users in the R2021b prerelease (with R2021b full release planned for September) and would like to hear early feedback from the community. We will also discuss performance and compatibility considerations plus our tentative roadmap for future HDF5 enhancements. |
5:00-5:05 | 10:00‑10:05 | (Break) |
5:05-5:20 | 10:05-10:20 | HSDS – Serverless support with AWS Lambda and direct access (Slide Deck | Video) John Readey, The HDF Group The HDF Server (HSDS) enables HDF data to be read and written over a http connection, but sometimes setting up a server is just too much to deal with due to cost, time, or management concerns. In this talk we’ll discuss two alternative ways to utilize HSDS technology but leaving the server behind. The first, HSDS for AWS Lambda supports the HDF REST API but runs entirely using Lambda functions. The second approach is “HSDS Direct Access”, a client-side library that enables HSDS like features exclusively on the client: Object storage read and write, multi-threading support, and sql-style queries. |
5:20-5:50 | 10:20-10:50 | Update on new HDF5 VFDs: SWMR, Mirror, Onion (Slide Deck | Video) John Mainzer and Brian Sawicki, The HDF Group In our talk, we will give update including benchmarking results on recently released VFDs. VFD SWMR is a re-implementation of SWMR, which offers “almost” full SWMR and has the potential to support NFS and SWMR on parallel computations. The Mirror VFD allows mirroring of an HDF5 file on a remote system as it is being created on a local file system. Onion VFD allows access to previous versions of an HDF5 file on a per open/close cycle basis. |
5:50-6:30 | 10:50-11:30 | Open Discussion Hosted by Elena Pourmal and The HDF Group staff (HDF Update Slide | Video) Please feel free to submit a question. (Technical, roadmap, etc.). This session may run beyond the stated end time. |