Thomas Kluyver, European XFEL
I'm one of the maintainers of h5py, one of two Python bindings for HDF5. The other is Pytables. h5py is focused on exposing HDF5 ideas cleanly in Python, while Pytables more uses HDF5 as part of its own data model (see more about the difference).
NumPy is a bit like HDF5 datasets in memory: multidimensional arrays, with a datatype, and hyperslab selection. This is the basis of most scientific computing in Python.
import numpy as np
a = np.arange(30).reshape((5, 6))
a
a / 10
a[:2, :4]
a.shape
a.dtype
import h5py
print(h5py.version.info)
In the simplest case, we can assign a NumPy array into a File
object, creating a dataset:
with h5py.File('demo.h5', 'w') as f:
f['a'] = a
!h5dump demo.h5
f = h5py.File('demo.h5', 'r+')
f['a']
Slicing the dataset like a NumPy array reads data into memory. It's easy to read all or part of a dataset:
f['a'][:]
f['a'][:2, :4]
f['a'][2]
The dataset has attributes like a NumPy array:
f['a'].shape
f['a'].dtype
Assigning an array as above only works for data that fits in memory. We can create a bigger dataset without data using the create_dataset
method. This could be a stack of 10000 square images:
ds = f.create_dataset('big', shape=(10000, 1024, 1024), dtype=np.uint16)
ds
To write data, we can assign to slices of the dataset - this mirrors the interface for reading:
ds[0] = np.random.uniform(high=1000, size=(1024, 1024))
ds[0]
ds[:5].mean(axis=(1, 2))
You can also specify chunking and compression options with create_dataset. If you use features that require chunks and don't specify a chunk shape, h5py will try to guess a reasonable chunk shape for you.
ds = f.create_dataset(
'chunked',
shape=(10000, 1024, 1024),
dtype=np.uint16,
#
chunks=(1, 1024, 1024),
compression='gzip',
compression_opts=1,
)
ds
ds.chunks
ds.compression
ds[0] = np.random.uniform(high=1000, size=(1024, 1024))
The hdf5plugin package is a separate project which packages several extra filters for convenient use with h5py - e.g. ZFP lossy floating point compression.
import hdf5plugin
f.create_dataset(
'zfp',
data=np.random.random(100),
compression=hdf5plugin.Zfp(accuracy=0.1),
)
Of course, h5py can also use any plugins available to HDF5 on your system, identifying a filter with an integer ID.
Creating virtual datasets is particularly convenient in Python code. This is an HDF5 feature that can stitch data together in one big dataset, without copying it.
# Create 1D data in 4 source files (0.h5 to 3.h5)
for n in range(4):
with h5py.File(f"{n}.h5", "w") as src_file:
src_file.create_dataset("data", (100,), "i4", np.arange(0, 100) + n)
# Make a 2D virtual layout - this is a special h5py object to help
# with creating virtual datasets.
layout = h5py.VirtualLayout(shape=(4, 100), dtype="i4")
# Map each source dataset into the layout
for n in range(4):
vsource = h5py.VirtualSource(f"{n}.h5", "data", shape=(100,))
# Both layout and source can be sliced like a dataset or an array
# Here, we're mapping every second element (::2) from the source
# into the first 50 cells of the row for this file.
layout[n, :50] = vsource[::2]
# Finally, turn the layout into a dataset in a file
vds_file = h5py.File("VDS.h5", "w")
vds_file.create_virtual_dataset("vdata", layout, fillvalue=-5)
Reading a virtual dataset is just like reading any other dataset:
vds_file['vdata'][:, :5]
Everything above is h5py's high-level API, which exposes the concepts of HDF5 in convenient, intuitive ways for Python code.
Each high-level object has a .id
attribute to get a low-level object. The h5py low-level API is largely a 1:1 mapping of the HDF5 C API, made somewhat 'Pythonic'.
Functions have default parameters where appropriate, outputs are translated to suitable Python objects, and HDF5 errors turned into Python errors.
ds = f['chunked']
ds.id.get_num_chunks()
ds.id.get_chunk_info(0)
Performance improvements, releasing the GIL.
String conversions have been redesigned:
sample = h5py.File('sample.h5', 'r')
sample['group/strings'].asstr()[:]
Chunk iteration (thanks to John Readey):
for chunk_selection in ds.iter_chunks():
chunk_data = ds[chunk_selection]
break