HDF5 Users Group 2020 Agenda
TUESDAY, OCTOBER 13, 2020 – HDF5 HANDS-ON TUTORIAL
This tutorial assumes attendees have some HDF5 knowledge and would like to learn more about HDF5 capabilities.
The goal of this HDF5 Tutorial is to introduce our users to a diverse set of tools and techniques when working with HDF5 data and applications. Those tools and techniques can be used regardless of where HDF5 data is stored, i.e., local disk, parallel file system or Cloud. We will be using an HPC cluster in the Cloud to allow participants to have the first-hand experience of the HDF5 capabilities and learn about the HDF5 ecosystem. This tutorial will consist of four different sections listed below. Participants can join at any time to attend the section of interest. We encourage everyone to join at the beginning to learn about the HPC cluster in the Cloud that will be used for hand-on sessions. This tutorial assumes attendees have some HDF5 knowledge and would like to learn more about HDF5 capabilities.
The goal of this HDF5 Tutorial is to introduce our users to a diverse set of tools and techniques when working with HDF5 data and applications. Those tools and techniques can be used regardless of where HDF5 data is stored, i.e., local disk, parallel file system or Cloud. We will be using an HPC cluster in the Cloud to allow participants to have the first-hand experience of the HDF5 capabilities and learn about the HDF5 ecosystem. The tutorial will consist of four different sections listed below. Participants can join at any time to attend the section of interest. We encourage everyone to join at the beginning to learn about the HPC cluster in the Cloud that will be used for hand-on sessions.
Start Time (Central) |
Topic | Speaker | Abstract |
9:00 a.m. | Welcome
(Video) |
Elena Pourmal, The HDF Group | |
9:00 a.m. | Introduction to H5CLUSTER
(Video) |
Steven Varga, independent researcher, VargaConsulting | In this section we will introduce H5CLUSTER which we’ll use in the tutorial. We will provide a brief overview of techniques to write and manage HDF5 based applications in a parallel environment. The tutorial will introduce you to SLURM resource manager, SPACK package system, as well as provide an opportunity to review the software tools used in the following sessions. 45 minute session, 15 minute Q&A/Break. |
10:00 a.m. | HDF5 Ecosystem
(Videos: rhdf5 & HDFql, Using Python locally, Using Python in the Cloud) |
Gerd Heber, The HDF Group Steven Varga, independent researcher, VargaConsulting Aleksandar Jelenak, The HDF Group John Readey, The HDF Group |
In this section of the tutorial, you will get a glimpse of the HDF5 ecosystem’s diversity. Give your HDF5 productivity a boost whether you are working on a local system or in the Cloud! rhdf5 and HDFql (Gerd, 30 min); H5CPP (Steven, 15 min); Using Python to work with HDF5 data locally and in the Cloud (Aleksandar and John, 30 mins); Q&A and break, 15 minutes |
11:30 a.m. | Parallel HDF5 quick start and tuning knobs
(Video) |
Scot Breitenfeld, The HDF Group Quincey Koziol, Principal Data Architect, Lawrence Berkeley National Lab (LBNL) |
In this part of the tutorial, we will give you a quick introduction to parallel HDF5 and give an overview of several tuning knobs for parallel applications. Q&A and break, 15 minutes |
12:30 p.m. | HDF5 Performance Troubleshooting
(Video) |
Gerd Heber, The HDF Group | As a three-legged table doesn’t wobble, the three tools presented in this part of the tutorial form a solid basis for the analysis and diagnosis of HDF5 performance issues. Minimize the guesswork and maximize your chances of closing the gap between benchmark results and your application’s performance by following the figures! Q&A and adjourn, 15 minutes |
Wednesday, October 14, 2020 – HDF5 Features
Questions/comments can be placed into this google document. HDF5 Features Video Playlist
Start Time (Central) |
Topic | Speaker | Abstract |
9:00 a.m. | Welcome Address | Mike Folk, Interim Executive Director, The HDF Group | |
9:05 a.m. | The HDF Group’s Business Model
|
Mike Folk, Interim Executive Director, The HDF Group Dax Rodriguez, Director of Commercial Services and Solutions, The HDF Group |
The HDF Group’s mission is to ensure the sustainable development of open-source HDF technologies and to provide ongoing accessibility of HDF-stored data. This includes meeting the high technical standards required by our community while simultaneously providing a satisfying and productive business environment for our staff. In order to support this mission and the organizations that depend on our technologies for their mission-critical systems, we pursue an open-source software model that offers permissive, free software licenses. In addition, we rely heavily on the collaboration from our community and funding through support subscriptions and consulting agreements. The HDF Group’s primary revenue is sourced from government supported engagements but we are slowly obtaining more commercial agreements and developing other ways to diversify our revenue. |
9:15 a.m. | HDF5: Toward a Community-Driven Open Source Project | Elena Pourmal, Director of Technical Services and Operations, The HDF Group | HDF5 is an open-source, high-performance technology suite that consists of an abstract data model, library, and file format used for storing and managing extremely large and/or complex data collections. The technology is used worldwide by government, industry, and academia in a wide range of science, engineering, and business disciplines. In my talk, I will present the HDF5 roadmap, including upcoming releases, new features under development and the steps The HDF Group has taken to revamp HDF5 as a community-driven Open Source project. |
9:45 a.m. | Multithreaded Concurrency Efforts | Quincey Koziol, Principal Data Architect, Lawrence Berkeley National Lab (LBNL) Chris Hogan, The HDF Group |
The HDF5 library is currently thread-safe, but not concurrent, as only one thread is allowed to execute within the HDF5 library at a time. This presentation will discuss current efforts to allow fully concurrent execution of all HDF5 API routines from multiple threads. The technical approach will be outlined and opportunities for community engagement and contributions presented. |
10:05 a.m. | BREAK – 10 minutes | ||
10:15 a.m. | Experimental and Observational Data and Superfacility | Quincey Koziol, Principal Data Architect, Lawrence Berkeley National Lab (LBNL) Suren Byna, Staff Scientist, Lawrence Berkeley National Lab (LBNL) |
Technological advances are pushing the sizes of data produced at experimental and observational facilities past terabytes and into the petabyte realm. Management of these massive experimental and observational data sets (EOD) poses unique challenges to HDF5. To handle these challenges, we have developed features into HDF5, including capabilities to mirror data between two systems, search massive datasets, capturing provenance, and version control. We have also improved performance of adding streaming data to existing files by 10X. This presentation discusses the EOD management challenges, our recent advances in HDF5 to handle these challenges, and remaining challenges. |
10:35 a.m. | HDF for Cloud – HSDS server | John Readey, Senior Architect, The HDF Group | HSDS is a REST-based HDF server designed to work efficiently in the cloud or with on prem deployments. This talk will review the features and architecture of HSDS. |
10:55 a.m. | FIREFLY – A Scientific Data Server for Acquiring and Analyzing Test Data | Mike Folk, Interim Executive Director, The HDF Group | The US Air Force expends significant resources acquiring and analyzing test data. Much of this data resides in data repositories that are isolated from each other with limited data search capabilities within them and limited data access between them. This in turn impedes data discovery, slows down data analysis turnaround times, and constrains larger scale data analyses. FIREfly is a demonstration of how HDF5 and the Highly Scalable Data Server can address these challenges. FIREfly is an open source, web-based scalable data server capable of ingesting flight data, performing queries on the data, performing server-side analysis of the data, and delivering data to be visualized, analyzed and downloaded to a client’s workstation. The FIREfly server can be adapted to perform in both private and public cloud environments, and can be adapted to provide the same capabilities for other scientific, engineering, and business data collections. |
11:15 a.m. | The Use of HSDS on SlideRule | JP Swinski, NASA/GSFC | The NASA/ICESat-2 program is investing in a collaboration between Goddard Space Flight Center and the University of Washington to develop a cloud-based on-demand science data processing system called SlideRule to lower the barrier of entry to using the ICESat-2 data for scientific discovery and integration into other data services. SlideRule is a server-side framework implemented in C++/Lua that provides REST APIs for processing science data and returning results. We have chosen HDF5 and HSDS as our first supported data format and service for SlideRule. In this presentation, we will discuss (1) why we chose HSDS, and (2) what direction we want HSDS to take in order to meet the future needs of our project. |
11:35 a.m. | BREAK – 10 minutes | ||
11:45 a.m. | h5web: a web-based viewer of HDF5 files | Loic Huder, ESRF | HDF5 (with Nexus) is becoming the de facto standard in most X-ray facilities. However, it is not always easy to navigate such files to get quick feedback on the data due to the peculiar structure of Nexus files. HDF5 file viewers are one way to solve this issue. They allow for the browsing and inspecting of the hierarchical structure of HDF5 files, as well as visualising the datasets they contain as basic plots (1D, 2D, 3D). This presentation will focus on `h5web`, the open-source web-based viewer we are developing at the European Synchrotron Radiation Facility. Our intent is to provide synchrotron users with an easy-to-use application and to make open-source components available for other similar web applications. `h5web` is built with React, a front-end web development library. It supports the exploration of HDF5 files, requested from a separate back-end (e.g. HSDS) for modularity, and the visualisation of datasets using performant WebGL-based visualisations. Demo at https://h5web.panosc.eu/ |
12:05 p.m. | Open Energy Data Initiative | Michael N Rossol, National Renewable Energy Initiative | Historically sharing of scientific data, both within collaborative teams and with the public, was facilitated through data repositories. In this paradigm, the data is physically delivered to the analysts. The issues with this approach are numerous (e.g., difficulty finding data stored across multiple repositories, non-standard data formats and types), but are particularly problematic as data has become large (GB to TB in size). This is primarily due to the prohibitive cost to store and manage data and the technical difficulties in providing timely access. The Open Energy Data Initiative (OEDI) aims to solve these issues through the development of a Department of Energy (DOE) “data-lake” hosted on Amazon Web Services (AWS). The data-lake approach will provide a single location to store and access DOE’s open datasets within an eco-system (AWS) that can provide the tools and computation power need to properly access, analyze, and leverage truly “big” data. The data-lake will leverage the cloud and all of the leading-edge resources available to ignite innovation, build capabilities, expand solutions, and develop new businesses. Thus, OEDI aims to advance DOE’s approach to open data in an effort to better support data access, and scientific analysis and innovation across the public, academic, and private sectors. |
12:25 p.m. | GPU Direct I/O VFD for HDF5 | John Ravi, Graduate Student, NCSU | With large-scale computing systems are moving towards using GPUs as workhorses of computing, file I/O to move data between GPUs and storage devices becomes critical. I/O performance optimizing technologies, such as NVIDIA’s GPU Direct Storage (GDS), becomes critical in reducing the latency of data movement between GPUs and storage. In this presentation, we will talk about a recently developed virtual file driver (VFD) that takes advantage of the GDS technology allowing data transfers between GPUs and storage without using CPU memory as a “bounce buffer”. |
12:30 p.m. | New VFDs and SWMR Redesign and Re-Implementation | John Mainzer, Principal Architect, The HDF Group | A brief introduction to several new / re-implemented features in HDF5. The Mirror VFD allows mirroring of an HDF5 file on a remote system as it is being created on a local file system. VFD SWMR is a reimplementation of SWMR, which offers “almost” full SWMR and has the potential to support NFS and SWMR on parallel computations. Onion VFD allows access to previous versions of an HDF5 file on a per open/close cycle basis. |
12:45 p.m. | Roundtable Discussion
(Video) |
The HDF Users Group Committee: Elena Pourmal, The HDF Group, Quincey Koziol, Principal Data Architect, Lawrence Berkeley National Lab (LBNL), Suren Byna, Staff Scientist, Lawrence Berkeley National Lab (LBNL), Lori Cooper, The HDF Group |
Thursday, October 15, 2020 – HDF5 Vols and Apps
Questions/comments can be placed into this google document.
HDF5 Vols and Apps Video Playlist
Start Time (Central) |
Topic | Speaker | Abstract |
9:00 a.m. | Morning Check-in | ||
9:05 a.m. | H5Glance: Explore HDF5 files in a terminal or a notebook | Dr. Thomas Kluyver, European XFEL | There’s HDFview, vitables, h5web, and more. Who needs yet another tool for viewing HDF5 files? Well, we did at European XFEL. We often work with HDF5 files in an SSH session or a Jupyter notebook, and we wanted to explore the files without leaving those contexts. So we made H5Glance, which works both as a shell command and a Python library to use with h5py. I’ll show what H5Glance can do in both contexts, and explain some of the specific features that help it to fit in to our workflows. |
9:25 a.m. | Liberate Real-Time Data into HDF5 | Ted Selig, COO, FishEye Software, Inc. | Real-Time systems generate massive flows of complex data that make it difficult to balance data overflow and losing critical information required for accurate analysis and clean for intelligent machine learning. FishEye in building real-time radars for over 20 years sees the lack of technology able to easily liberate complex machine data from real-time systems to enable machine learning and real-time analysis. In this presentation FishEye will demonstrate a lambda architecture that includes high-performance data collection, a scalable architecture that dynamically curates and routes a real-time data flow to real-time machine learning prediction, real-time analysis and visualization, and into HDF5 for postprocess machine training and multi-run analysis. |
9:45 a.m. | LOFS: A simple file system for massively parallel cloud models | Leigh Orf, Research Scientist, Space Science and Engineering Center, UW-Madison | We present an overview of Lack Of File System (LOFS), a file-based file system comprised of HDF5 files written serially but concurrently, that is being used to conduct, analyze, and visualize the world’s highest resolution tornadic thunderstorm simulations on the NSF sponsored Blue Waters and Frontera supercomputers. LOFS exploits two features of HDF5: The core driver, allowing for the creation of HDF5 files buffered in memory, and the use of external compression plugins such as lossy ZFP floating point compression. During each save cycle of the CM1 cloud model modified to utilize LOFS, 1 file per compute node (assembled to contain a continuous subdomain of model data spanning each node) is grown in memory, with multiple variables and multiple times organized using HDF5 groups. Periodically, files are flushed to disk (every 50-100 save cycles), reducing the latency associated with frequent writes to disk. This combination of buffering and lossy compression allows for data to be saved in very small time intervals, as small as the model’s time step. Post-processing, conversion, and visualization routines utilizing LOFS data will be described. We will show how the features of HDF5 enable research on tornadoes that would not otherwise be possible. |
10:05 a.m. | Leveraging HDF5 infrastructure to build an interoperable and contextualized data in Pharma using semantic technology | Amnon Ptashek, Allotrope Foundation | Data interoperability is a fundamental key to transform the acquisition, exchange, and management of pharmaceutical laboratory data throughout its complete lifecycle. Leveraging the HDF5 file format and infrastructure, and using the World Wide Web Consortium (W3C) semantic technology, the Allotrope Data Format (ADF) and the associated Allotrope Data Model (ADM) together with a controlled vocabulary provide a mechanism to model simple as well as rich use case scenarios. The presentation covers the technical aspects of utilizing HDF5 and the modeling process from the fundamental structure of the file format and models all the way to validation. It covers the technology in a high level but at the same token dives and zooms into the structure of the fundamentals such as W3C RDF, Allotrope Ontology, modeling and W3C Shapes Constraint Language (SHACL) validation. |
10:25 a.m. | BREAK – 10 minutes | ||
10:35 a.m. | tar2h5: Small Files Packer for Machine Learning Tasks | Dawei Mu, NCSA | Nowadays, many DL workloads display small random read I/O patterns. Researchers, especially those who newly adopt deep learning methods, tend to save their dataset into millions of small files. Such Operation could cause storage system overload and all users could suffer severe performance reduction. In this work, we will introduce our data convertor tool, tar2h5, which was developed for deep learning users to convert numerous small files into single hdf5 files to improve the I/O performance. We implemented multiple data layouts to meet various requirements, including h5compactor, h5compactor-sha1, and h5shredder. From our preliminary users’ report, loading data from the hdf5 file made their training 5x faster. |
10:55 a.m. | The PIO Library for Scalable HPC Performance | Ed Hartnett, NOAA/CIRES | The PIO C and Fortran libraries enable high-performance, scalable I/O on HPC systems with many processors. Doing I/O from many processors at the same time causes system contention and inefficiencies. Instead, users may select a small number of processors to be responsible for all I/O. Code on the computational processors calls netCDF I/O functions as usual, but instead of writing directly to disk, the data are sent with MPI to the I/O processors, who execute the disk I/O. The PIO libraries are available in C and Fortran, and work with Unidata’s netCDF package, the parallel-netcdf package from Argonne Labs, and the HDF5 library from the HDF Group. The PIO libraries are maintained and distributed by NCAR and NOAA, and are free and open software. Recent improvements in PIO include full netCDF integration, allowing users to use existing netCDF code bases with little modification. |
11:15 a.m. | Intro to VOLs and Implementing VOL Connectors | Quincey Koziol, Principal Data Architect, Lawrence Berkeley National Laboratory | The Virtual Object Layer (VOL) is a new abstraction Layer within HDF5 library that redirects I/O operations into a VOL “connector”, immediately after an API routine is invoked. This talk will describe the VOL architecture, currently implemented VOL connectors and capabilities, and how to create new connectors that extend HDF5 to add new capabilities or storage mechanisms. |
11:35 a.m. | Async I/O VOL: Transparent asynchronous I/O using background threads | Houjun Tang, Computer Research Scientist, Lawrence Berkeley National Laboratory | This talk presents an asynchronous I/O framework that utilizes background threads for I/O task execution. It supports all types of HDF5 I/O operations including both collective and independent ones, requires no additional servers, manages data dependencies transparently and automatically from users, and requires minimal code modifications. Our asynchronous I/O implementation as an HDF5 VOL connector demonstrates the effectiveness of hiding the I/O cost from the application with low overhead and easy-to-use programming interface. |
11:55 a.m. | BREAK – 10 minutes | ||
12:05 p.m. | Caching VOL: Efficient Parallel I/O through Caching Data on the Node-local Storage | Huihuo Zheng, PhD, Assistant Computer Scientist, Argonne National Laboratory | We present an approach inside the HDF5 library to incorporate node-local storage as a cache to the parallel file system to improve the parallel I/O efficiency. We implemented the feature within the HDF5 Virtual Object Layer framework so that the existing HDF5 applications can use the feature with minimal modification of the codes. In this talk, we will present the details of our prototype design as well as initial performance evaluation. |
12:25 p.m. | Advancing HDF5’s parallel I/O for Exascale with DAOS | Jerome Soumagne, The HDF Group | The existing parallel HDF5 API has been mostly designed around POSIX I/O, which has been a limiting factor for achieving performance at extreme scale. This presentation will give an overview of the upcoming HDF5 VOL connector that interfaces with Intel’s DAOS distributed file system. We will focus on the new features that this connector is able to provide to HDF5 and how it can steer applications in a new direction for doing parallel I/O through HDF5. |
12:45 p.m. | REST VOL and sharded data storage for HDF | John Readey, Senior Architect, The HDF Group | The REST VOL enables existing HDF5 applications to utilize the REST based HSDS server. This talk will discuss how the REST VOL works and future plans for the REST VOL. There will also be a review of the sharded storage model used by HSDS. |
1:05 p.m. | Log-structured VOL
(Video) |
Kai-yuan Hou | HDF5 is an I/O library widely adopted in scientific applications due to its flexible data model and a wide range of functions that allow applications to organize complex data along with a wide variety of metadata in a hierarchical way. However, the performance of HDF5 can suffer when the I/O pattern is complex. One of the contributing factors to the performance issue is the way HDF5 stores its datasets in the file. HDF5 stores the data in a contiguous layout that requires inter-process communication to reorder the data into canonical order. HDF5 provides a chunked storage layout shown to improve I/O performance in many applications. However, we found that it has little to no effect on I/O patterns consisting of a high volume of noncontiguous and irregular I/O requests. In this work, we introduce the Log I/O VOL, an HDF5 Virtual Object Layer (VOL) plug-in that enables a log-based storage layout for HDF5 datasets. Instead of arranging the data into canonical order, the Log I/O VOL stores the data as-is alongside the metadata that describes its logical location. In this way, the expensive overhead of rearranging the data is deferred to the reading stage. Log I/O VOL provides an effective way for applications that care less about reading performance to trade read performance for write performance. |
Friday, October 16, 2020 – The HDF5 Ecosystem
Questions/comments can be placed in this google document.
HDF5 Ecosystem Presentations Video Playlist
Start Time (Central) |
Topic | Speaker | Abstract |
9:00 a.m. | Morning Check-in | ||
9:05 a.m. | Template Based Persistence of C++ Class Types with Non Contiguous Memory Layout
(Video) |
Steven Varga, independent researcher, VargaConsulting | Abstract |
9:25 a.m. | h5py: A bridge between HDF5 and Python | Dr. Thomas Kluyver, European XFEL | HDF5 and the Python language, both popular tools in science and data analysis, are ideal partners. The popular NumPy library, which provides multidimensional arrays in Python, closely matches the core features of HDF5 datatypes & dataspaces. I’ll show how the h5py bindings, built on the HDF5 C API, let you read & write HDF5 files with a convenient, high-level interface. This includes advanced HDF5 features such as creating virtual datasets. |
9:45 a.m. | Enhancing the Performance and Scalability of Third-Party Libraries in the Sandia I/O Software Stack | Greg Sjaardema, Sandia National Laboratories | The HPC I/O software stack used by the majority of the finite element analysis codes and corresponding workflow applications at Sandia National Laboratories uses the IOSS and Exodus libraries which are based on the NetCDF and CGNS mid-level third-party libraries (TPL). These libraries have been available for several years and are in general use throughout the National Laboratories. Although each TPL is intended for use in MPI-parallel applications, they are not generally used at the scales encountered in the HPC environment. Both of these TPL use the HDF5 library for on-disk storage. In this talk I will describe several HDF5-related performance enhancements that have been applied to the TPLs to improve both their serial performance and their large-scale parallel scalability. In some cases, changes have improved I/O performance by three orders of magnitude. The performance enhancements are, in most cases, obtained by using recently released HDF5 features such as collective metadata, compact storage, and others. |
10:05 a.m. | HDF5-UDF: user-defined functions in HDF5 using Lua, Python and C/C++ | Lucas C. Villa Real, IBM Research | Scientific processes related to physical simulations and to observations of real-life phenomena are known to generate large data volumes. In special, many data pre- and post-processing tasks (e.g., dataset aggregation and filtering) create variations of the original data, causing a direct impact on storage capacity. This talk will present HDF5-UDF, a mechanism that enables datasets to be described through user-defined functions in Lua, Python, and C. Differently from a regular dataset (whose data goes to disk), UDFs are compiled into a binary form (which often takes no more than 5 KB) and embedded into HDF5; once an application requests to read the corresponding dataset, HDF5-UDF executes that binary code and generates the data on-the-fly. The talk will go through the design and implementation of HDF5-UDF and will cover several use-cases that may be of interest to the HDF community. |
10:25 a.m. | BREAK – 10 minutes | ||
10:35 a.m. | FasTensor: Pain-free HDF5 data analysis at large scale | Bin Dong, PhD, Research Scientist, Lawrence Berkeley National Laboratory | HDF5 is a well-founded file container for a large-scale scientific data sets in various applications. Extracting meaningful knowledge from these HDF5 files, however, still remains a challenge. One root cause is the mismatch between the data model in HDF5 (n-dimensional array) and the one in modern big data analysis systems like MapReduce or Spark recognize. In this talk, we will introduce a new way of processing HDF5 files using FasTensor (or ArrayUDF). The FasTensor allows numerous custom analysis functions directly on HDF5 files. It also frees users from data management tasks and parallelization in processing large scale HDF5 files. We observed over 1000X performance improvement compared to Spark and other peer systems. We would like to show our use-cases in distributed acoustic sensing and in particle physics. |
10:55 a.m. | Status of HDF5 usage at ITER | Dr Lana Abadie, responsible for data archiving and handling at ITER | Complex data structures need to be stored and retrieved. We explain how we model the data structure in HDF5 to have a good writing speed and a good data retrieval. The data access shall be able to return the full set of data for a given time range and a dedicated field for a given time range. In this work, we use HDF5 SWMR, 1.12. We also would like to present some performance results with writing and reading HDF5 data on GPFS. |
11:15 a.m. | Neurodata Without Borders – An Ecosystem for Standardizing Diverse Neuorophysiology Data | Andrew Tritt, Lawrence Berkeley National Lab | The neurophysiology of cells and tissues are monitored electrophysiologically and optically during diverse tasks in species from flies to humans. A major impediment to extracting return into neurophysiology data collection is lack of standards for data and metadata. Here, I will describe design and implementation principles for extensible standardization of diverse neurophysiological data. Our software architecture and ecosystem (Neurodata Without Borders, NWB) defines and modularizes the interdependent, yet separable, components of data specification language, format specification, storage format, translation, and interfaces. Our storage format of choice is HDF5. In this talk, I will discuss how we use features of HDF5 to store complex neurophysiology data within this ecosystem. |
11:35 a.m. | Experiences Integrating HDF5 into DREAM.3D | Mr. Michael A. Jackson, Owner/Software Engineer, BlueQuartz Software | DREAM.3D is an open source materials science and engineering focused analytics application. DREAM.3D fosters collaboration between groups of researchers due to its use of openly available file formats, most notably, the HDF5 file format. This presentation will cover the use of HDF5 as the “native” file format for DREAM.3D files, how the use of HDF5 promoted the interchange of data between various research groups, experiences in getting instrument OEMs to support HDF5, and thoughts on how the community can help and contribute to the HDF5 project. We will offer perspectives from a small business dedicated to open source software (both use and release of) and how that impacts business activities. |
11:55 a.m. | BREAK – 10 minutes | ||
12:05 p.m. | The MACSio parallel I/O proxy application and HDF5 plugin | Mark C. Miller, Computer Scientist, Lawrence Livermore National Lab | MACSio is a Multi-purpose, Application Centric, Scalable I/O proxy application. MACSio generates test data in terms of fields defined on domain-decomposed meshes. Individual I/O plugins, each supporting a specific I/O library such as HDF5, Silo, or Exodus, then marshal MACSio’s mesh and field data to and from disk files using a variety of parallel I/O paradigms such as Multiple Independent File (MIF), Single Shared File (SSF) and File Per Processor (FPP – just a special case of MIF). This talk will motivate why MACSio is designed as it is and discuss key design features and capabilities. It will focus on the design of the HDF5 plugin and present some performance data from testing MIF and SSF I/O paradigms on parallel file systems at various LCFs including the use and impact of compression filters including H5Z-ZFP. |
12:25 p.m. | Characterizing and Understanding the Behavior of HDF5 I/O Workloads with Darshan | Shane Snyder, Argonne National Laboratory | Understanding the I/O behavior of HPC applications is critical to ensuring their efficient use of storage system resources. However, this is a challenging task given the growing depth and complexity of the I/O stack on these systems, where multiple software layers often coordinate to optimize I/O workloads for the underlying storage hardware. Darshan is a lightweight I/O characterization tool that helps users navigate this complex landscape by producing condensed summaries of an application’s I/O behavior, including total I/O operation counts, histograms of file access sizes, and cumulative timers, among other statistics. In this presentation, we introduce new Darshan extensions to enable the characterization of HDF5 file and dataset accesses, providing a better understanding of how HPC applications use HDF5 and how HDF5 utilizes underlying MPI-IO and file system interfaces. We further demonstrate how HDF5 users can leverage Darshan instrumentation to quickly diagnose and optimize inefficient I/O access patterns. |
12:45 p.m. | Closing Remarks
(Video) |
Elena Pourmal, The HDF Group |