h5py: A bridge between HDF5 and Python

Thomas Kluyver, European XFEL

I'm one of the maintainers of h5py, one of two Python bindings for HDF5. The other is Pytables. h5py is focused on exposing HDF5 ideas cleanly in Python, while Pytables more uses HDF5 as part of its own data model (see more about the difference).


Prelude: NumPy

NumPy is a bit like HDF5 datasets in memory: multidimensional arrays, with a datatype, and hyperslab selection. This is the basis of most scientific computing in Python.

In [1]:
import numpy as np
In [2]:
a = np.arange(30).reshape((5, 6))
a
Out[2]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])
In [3]:
a / 10
Out[3]:
array([[0. , 0.1, 0.2, 0.3, 0.4, 0.5],
       [0.6, 0.7, 0.8, 0.9, 1. , 1.1],
       [1.2, 1.3, 1.4, 1.5, 1.6, 1.7],
       [1.8, 1.9, 2. , 2.1, 2.2, 2.3],
       [2.4, 2.5, 2.6, 2.7, 2.8, 2.9]])
In [4]:
a[:2, :4]
Out[4]:
array([[0, 1, 2, 3],
       [6, 7, 8, 9]])
In [5]:
a.shape
Out[5]:
(5, 6)
In [6]:
a.dtype
Out[6]:
dtype('int64')

Storing & reading data in HDF5

h5py documentation

In [7]:
import h5py
In [8]:
print(h5py.version.info)
Summary of the h5py configuration
---------------------------------

h5py    3.0.0rc1
HDF5    1.12.0
Python  3.8.5 (default, Aug 12 2020, 00:00:00) 
[GCC 10.2.1 20200723 (Red Hat 10.2.1-1)]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.18.4
cython (built with) 0.29.21
numpy (built against) 1.17.5
HDF5 (built against) 1.12.0

In the simplest case, we can assign a NumPy array into a File object, creating a dataset:

In [9]:
with h5py.File('demo.h5', 'w') as f:
    f['a'] = a
In [10]:
!h5dump demo.h5
HDF5 "demo.h5" {
GROUP "/" {
   DATASET "a" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 5, 6 ) / ( 5, 6 ) }
      DATA {
      (0,0): 0, 1, 2, 3, 4, 5,
      (1,0): 6, 7, 8, 9, 10, 11,
      (2,0): 12, 13, 14, 15, 16, 17,
      (3,0): 18, 19, 20, 21, 22, 23,
      (4,0): 24, 25, 26, 27, 28, 29
      }
   }
}
}
In [11]:
f = h5py.File('demo.h5', 'r+')
f['a']
Out[11]:
<HDF5 dataset "a": shape (5, 6), type "<i8">

Slicing the dataset like a NumPy array reads data into memory. It's easy to read all or part of a dataset:

In [12]:
f['a'][:]
Out[12]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])
In [13]:
f['a'][:2, :4]
Out[13]:
array([[0, 1, 2, 3],
       [6, 7, 8, 9]])
In [14]:
f['a'][2]
Out[14]:
array([12, 13, 14, 15, 16, 17])

The dataset has attributes like a NumPy array:

In [15]:
f['a'].shape
Out[15]:
(5, 6)
In [16]:
f['a'].dtype
Out[16]:
dtype('int64')

Working with big datasets

Assigning an array as above only works for data that fits in memory. We can create a bigger dataset without data using the create_dataset method. This could be a stack of 10000 square images:

In [17]:
ds = f.create_dataset('big', shape=(10000, 1024, 1024), dtype=np.uint16)
ds
Out[17]:
<HDF5 dataset "big": shape (10000, 1024, 1024), type "<u2">

To write data, we can assign to slices of the dataset - this mirrors the interface for reading:

In [18]:
ds[0] = np.random.uniform(high=1000, size=(1024, 1024))
In [19]:
ds[0]
Out[19]:
array([[439, 450, 799, ..., 138, 183, 850],
       [643, 335, 954, ..., 596, 280, 795],
       [974, 296, 695, ..., 698, 620,  98],
       ...,
       [119, 748, 238, ..., 419, 218, 884],
       [455,  17, 519, ..., 156, 752, 183],
       [180, 186, 160, ..., 340, 740, 892]], dtype=uint16)
In [20]:
ds[:5].mean(axis=(1, 2))
Out[20]:
array([499.40815258,   0.        ,   0.        ,   0.        ,
         0.        ])

Chunking & compression

You can also specify chunking and compression options with create_dataset. If you use features that require chunks and don't specify a chunk shape, h5py will try to guess a reasonable chunk shape for you.

In [21]:
ds = f.create_dataset(
    'chunked',
    shape=(10000, 1024, 1024),
    dtype=np.uint16,
    # 
    chunks=(1, 1024, 1024),
    compression='gzip',
    compression_opts=1,
)
ds
Out[21]:
<HDF5 dataset "chunked": shape (10000, 1024, 1024), type "<u2">
In [22]:
ds.chunks
Out[22]:
(1, 1024, 1024)
In [23]:
ds.compression
Out[23]:
'gzip'
In [24]:
ds[0] = np.random.uniform(high=1000, size=(1024, 1024))

The hdf5plugin package is a separate project which packages several extra filters for convenient use with h5py - e.g. ZFP lossy floating point compression.

In [25]:
import hdf5plugin
In [26]:
f.create_dataset(
    'zfp',
    data=np.random.random(100),
    compression=hdf5plugin.Zfp(accuracy=0.1),
)
Out[26]:
<HDF5 dataset "zfp": shape (100,), type "<f8">

Of course, h5py can also use any plugins available to HDF5 on your system, identifying a filter with an integer ID.

Creating virtual datasets

Creating virtual datasets is particularly convenient in Python code. This is an HDF5 feature that can stitch data together in one big dataset, without copying it.

In [27]:
# Create 1D data in 4 source files (0.h5 to 3.h5)
for n in range(4):
    with h5py.File(f"{n}.h5", "w") as src_file:
        src_file.create_dataset("data", (100,), "i4", np.arange(0, 100) + n)
In [28]:
# Make a 2D virtual layout - this is a special h5py object to help
# with creating virtual datasets.
layout = h5py.VirtualLayout(shape=(4, 100), dtype="i4")

# Map each source dataset into the layout
for n in range(4):
    vsource = h5py.VirtualSource(f"{n}.h5", "data", shape=(100,))
    # Both layout and source can be sliced like a dataset or an array
    # Here, we're mapping every second element (::2) from the source
    # into the first 50 cells of the row for this file.
    layout[n, :50] = vsource[::2]

# Finally, turn the layout into a dataset in a file
vds_file = h5py.File("VDS.h5", "w")
vds_file.create_virtual_dataset("vdata", layout, fillvalue=-5)
Out[28]:
<HDF5 dataset "vdata": shape (4, 100), type "<i4">

Reading a virtual dataset is just like reading any other dataset:

In [29]:
vds_file['vdata'][:, :5]
Out[29]:
array([[ 0,  2,  4,  6,  8],
       [ 1,  3,  5,  7,  9],
       [ 2,  4,  6,  8, 10],
       [ 3,  5,  7,  9, 11]], dtype=int32)

Low-level API

Everything above is h5py's high-level API, which exposes the concepts of HDF5 in convenient, intuitive ways for Python code.

Each high-level object has a .id attribute to get a low-level object. The h5py low-level API is largely a 1:1 mapping of the HDF5 C API, made somewhat 'Pythonic'. Functions have default parameters where appropriate, outputs are translated to suitable Python objects, and HDF5 errors turned into Python errors.

In [30]:
ds = f['chunked']
In [31]:
ds.id.get_num_chunks()
Out[31]:
1
In [32]:
ds.id.get_chunk_info(0)
Out[32]:
StoreInfo(chunk_offset=(0, 0, 0), filter_mask=0, byte_offset=20971527472, size=1614347)

What's new in h5py 3.0?

Performance improvements, releasing the GIL.

https://docs.h5py.org/en/latest/whatsnew/3.0.html

String conversions have been redesigned:

In [33]:
sample = h5py.File('sample.h5', 'r')
sample['group/strings'].asstr()[:]
Out[33]:
array(['some', 'variable-length', 'string', 'data'], dtype=object)

Chunk iteration (thanks to John Readey):

In [34]:
for chunk_selection in ds.iter_chunks():
    chunk_data = ds[chunk_selection]
    break