Technical Insights

Francesc Alted, Freelance Consultant, HDF guest blogger

The HDF Group has a long history of collaboration with Francesc Alted, creator of PyTables.  Francesc was one of the first HDF5 application developers who successfully employed external compressions in an HDF5 application (PyTables). The first two compression methods that were registered with The HDF Group were LZO and BZIP2 implemented in PyTables; when Blosc was added to PyTables, it became a winner.

While HDF5 and PyTables address data organization and I/O needs for many applications, solutions like the Blosc meta-compressor presented in this blog, are simpler, achieve great I/O performance, and are alternative solutions to HDF5 in cases when portability and data organization are not critical, but compression is still desired.  Enjoy the read!

Why compression?

Compression is a hot topic in data handling. The largest database players have recently (or not-so-recently) implemented support for different kinds of compression libraries. Why is that? It’s all about efficiency: modern CPUs are so fast in comparison with storage write speeds that compression not only offers the opportunity to store more with less space, but to improve storage bandwidth also:

The HDF5 library is an excellent example of a data container that supported out-of-the-box compression in the very first release of HDF5 in November 1998. Their innovation was to introduce support for compression of chunked datasets in a way that permitted the developer to apply compression to each of the chunks individually, resulting in reasonably fast and transparent compression using different codecs. HDF5 also introduced pluggable compression filters that allowed external developers to implement support for different codecs for HDF5. Then with release 1.8.11, they added the ability to discover, load and register filters at run time. More recently, in release 1.8.15 (and fully documented in 1.8.16), HDF5 has a new Plugin Interface that provides a complete programmatic control of dynamically loaded plugins. HDF5’s filter features now offer much-desired flexibility, giving users the freedom to choose the codec that best suits their needs.

Why Blosc?

In the last decade the trend has been to implement faster codecs at the expense of reduced compression ratios. The idea is to reduce compression/decompression time overhead

Gerd Heber, The HDF Group and Haymo Kutschbach,* ILNumerics

Metaphorically speaking, this blog post is about a frog trying to climb out of a well, a damp and unsightly corner of the HDF5 ecosystem called HDF5.NET. People who know more about its genesis tell us that it was never intended as what it became to be perceived as, an “aspirational” .NET interface for HDF5 that would one day be complete and fully supported. Be that as it may, it’s important to ask, “What can we do today to better serve the needs of the .NET community?” We believe, as the title suggests, we need to take a step back to move forward. 

John Readey, The HDF Group

Editor’s Note: Since this post was written in 2015, The HDF Group has developed HDF Cloud, a new product that addresses the challenges of adapting large scale array-based computing to the cloud and object storage while intelligently handling the full data management life cycle. If this is something that interests you, we’d love to hear from you.

HDF Server is a new product from The HDF Group which enables HDF5 resources to be accessed and modified using Hypertext Transfer Protocol (HTTP).

HDF Server [1], released in February 2015, was first developed as a proof of concept that enabled remote access to HDF5 content using a RESTful API.  HDF Server version 0.1.0 wasn’t yet intended for use in a production environment since it didn’t initially provide a set of security features and controls.  Following its successful debut, The HDF Group incorporated additional planned features.  The newest version of HDF Server provides exciting capabilities for accessing HDF5 data in an easy and secure way.

John Readey, The HDF Group

We’re pleased to announce that The HDF Group is now a member of the Open Commons Consortium (formerly Open Cloud Consortium), a not for profit that manages and operates cloud computing and data commons infrastructure to support scientific, medical, health care and environmental research.

The HDF Group will be participating in the NOAA Data Alliance Working Group (WG) on the WG committee that will determine the datasets to be hosted in the NOAA data commons as well as tools to be used in the computational ecosystem surrounding the NOAA data commons.

OSDC website

“The Open Commons Consortium (OCC) is a truly innovative concept for supporting scientific computing,” said Mike Folk, The HDF Group’s President. “Their cloud computing and data commons infrastructure supports a wide range of research, and OCC’s membership spans government, academia, and the private sector.  This is a good opportunity for us to learn about how we can best serve these communities.”

The HDF Group will also participate in the Open Science Data Cloud working group and receive resource allocations on the OSDC Griffin resource.  The HDF Group’s John Readey is working with the OCC and others to investigate ways to use Griffin effectively.  Readey says, “Griffin is a great testbed for cloud-based systems.  With access to object storage (using the AWS/S3 api) and the ability to programmatically create VM’s, we will explore new methods for the analysis of scientific datasets.” 

Joel Plutchak, The HDF Group The HDF Group’s support for and use of the Java Programming Language consists of Java wrappers for the HDF4 and HDF5 C libraries, an Object Model definition and implementation, and HDFView, a graphical file viewing application. In this article we'll discuss what we’re doing now with Java, and look toward the future. [caption id="attachment_10769" align="alignright" width="300"] The screen capture shows some of the capabilities of the HDFView application. Displayed is a JPSS Mission VIIRS (Visible Infrared Imaging Radiometer Suite) Day-Night band dataset in table form and image form with false color palette attached.[/caption] By the time the first public version of the Java Programming Language was released in 1995, various groups at the University of Illinois were already...

Anthony Scopatz, Assistant Professor at the University of South Carolina, HDF guest blogger "Python is great and its ecosystem for scientific computing is world class. HDF5 is amazing and is rightly the gold standard for persistence for scientific data. Many people use HDF5 from Python, and this number is only growing due to pandas’ HDFStore. However, using HDF5 from Python has at least one more knot than it needs to.  Let’s change that." Almost immediately when going to use HDF5 from Python you are faced with a choice between two fantastic packages with overlapping capabilities: h5py and PyTables.  h5py wraps the HDF5 API more closely using autogenerated Cython.  PyTables, while also wrapping HDF5, focuses more on a Table data structure and adds...

Lindsay Powers, The HDF Group

The 2015 HDF workshop held during the ESIP Summer Meeting was a great success thanks to more than 40 participants throughout the four sessions.  The workshop was an excellent opportunity for us to interact with HDF community members to better understand their needs and introduce them to new technologies. You can view the slide presentations from the workshop here.

From my perspective, the highlight of the workshop was the Vendors and Tools Session where we heard from Ellen Johnson (Mathworks), Christine White (Esri), Brian Tisdale (NASA), and Gerd Heber (The HDF Group) talk about new, and improved applications of HDF technologies.  For example:  

Quincey Koziol, The HDF Group

“A supercomputer is a device for turning compute-bound problems into I/O-bound problems.” – Ken Batcher, Prof. Emeritus, Kent State University.

HDF5 began out of a collaboration between the National Center for Supercomputing Applications (NCSA) and the US Department of Energy’s Advanced Simulation and Computing Program (ASC), so high-performance computing (HPC) I/O has been in our focus from the very beginning.  As we are starting our 20th year of development on HDF5, HPC I/O continues to be a critical driver of new features.

Los Alamos National Laboratory is home to two of the world’s most powerful supercomputers, each capable of performing more than 1,000 trillion operations per second. Here, ASC is examining the effects of a one-megaton nuclear energy source detonated on the surface of an asteroid. Image from ASC at http://www.lanl.gov/asci/

The HDF5 development team has focused on three things when serving the HPC community: performance, freedom of choice and ease of use.