hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py
.
h5py
is a thin, pythonic wrapper around HDF5.
Presenter: Thomas VINCENT
European HDF Users Group Summer 2021, July 7-8, 2021
from h5glance import H5Glance # Browsing HDF5 files
H5Glance("data.h5")
import h5py # Pythonic HDF5 wrapper: https://docs.h5py.org/
h5file = h5py.File("data.h5", mode="r") # Open HDF5 file in read mode
data = h5file["/data"][()] # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar() # Display data
<matplotlib.colorbar.Colorbar at 0x11410d198>
data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access compressed dataset
--------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-5-4bb532391a0f> in <module> ----> 1 data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access compressed dataset h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() ~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype) 760 mspace = h5s.create_simple(selection.mshape) 761 fspace = selection.id --> 762 self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl) 763 764 # Patch up the output for NumPy h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5d.pyx in h5py.h5d.DatasetID.read() h5py/_proxy.pyx in h5py._proxy.dset_rw() OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
hdf5plugin
usage¶To enable reading compressed datasets not supported by libHDF5
and h5py
:
Install hdf5plugin & import it.
%%bash
pip3 install hdf5plugin
Or: conda install -c conda-forge hdf5plugin
import hdf5plugin
data = h5file["/compressed_data_bitshuffle_lz4"][()] # Access datset
plt.imshow(data); plt.colorbar() # Display data
<matplotlib.colorbar.Colorbar at 0x115d6a5c0>
h5file.close() # Close the HDF5 file
When writing datasets with h5py
, compression can be specified with: h5py.Group.create_dataset
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
# Create a compressed dataset
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
compression=32008, # bitshuffle/lz4 HDF5 filter identifier
compression_opts=(0, 2) # options: default number of elements/block, enable LZ4
)
h5file.close()
hdf5plugin
provides some helpers to ease dealing with compression filter and options:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
**hdf5plugin.Bitshuffle() # Or: **hdf5plugin.BitShuffle(lz4=True)
)
h5file.close()
hdf5plugin.Bitshuffle?
H5Glance("new_file_bitshuffle_lz4.h5")
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data_bitshuffle_lz4"][()]); plt.colorbar()
h5file.close()
!ls -l new_file*.h5
-rw-r--r-- 1 tvincent staff 4278852 Jul 8 11:21 new_file_bitshuffle_lz4.h5 -rw-r--r-- 1 tvincent staff 5832257 Jul 8 11:20 new_file_uncompressed.h5
h5py
¶Compression filters provided by h5py :
libhdf5
: "gzip"
and eventually "szip"
(optional)h5py
: "lzf"
Pre-compression filter: Byte-Shuffle
h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
"/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()
hdf5plugin
¶Additional compression filters provided by hdf5plugin
: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.
6 out of the 25 HDF5 registered filter plugins as of June 2021.
h5file = h5py.File("new_file_blosc.h5", mode="w")
h5file.create_dataset(
"/compressed_data_blosc",
data=data,
**hdf5plugin.Blosc(cname='zlib', clevel=5, shuffle=hdf5plugin.Blosc.SHUFFLE)
)
h5file.close()
blosclz
, lz4
, lz4hc
, snappy
(optional, requires C++11), zlib
, zstd
(u)int8
or (u)int16
float32
, float64
, (u)int32
, (u)int64
hdf5plugin
built from sourceBlosc
includes pre-compression filters and algorithms provided by other HDF5 compression filters:
LZ4()
=> Blosc("lz4", 9)
Zstd()
=> Blosc("zstd", 2)
Blosc
with shuffle=hdf5plugin.Blosc.SHUFFLE
Bitshuffle()
=> Blosc("lz4", 5, hdf5plugin.Blosc.BITSHUFFLE)
...
Except for OpenMP support with Bitshuffle
!
Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.
Also to keep in mind availability/compatibility: "gzip"
as included in libHDF5
is the most compatible one (and also "lzf"
as included in h5py
).
hdf5plugin
filters with other applications¶Note: With notebook, using ! enables running shell commands
!h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" { DATASET "/compressed_data_bitshuffle_lz4" { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) } SUBSET { START ( 0, 0 ); STRIDE ( 1, 1 ); COUNT ( 5, 10 ); BLOCK ( 1, 1 ); DATA { } } } }
A solution: Set HDF5_PLUGIN_PATH
environment variable to: hdf5plugin.PLUGINS_PATH
# Directory where HDF5 compression filters are stored
hdf5plugin.PLUGINS_PATH
# Retrieve hdf5plugin.PLUGINS_PATH from the command line
!python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"
!ls `python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
libh5blosc.dylib libh5fcidecomp.dylib libh5zfp.dylib libh5bshuf.dylib libh5lz4.dylib libh5zstd.dylib
# Set HDF5_PLUGIN_PATH environment variable to hdf5plugin.PLUGINS_PATH
!HDF5_PLUGIN_PATH=`python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"` h5dump -d /compressed_data_bitshuffle_lz4 -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" { DATASET "/compressed_data_bitshuffle_lz4" { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) } SUBSET { START ( 0, 0 ); STRIDE ( 1, 1 ); COUNT ( 5, 10 ); BLOCK ( 1, 1 ); DATA { (0,0): 53, 52, 53, 54, 54, 55, 55, 56, 56, 57, (1,0): 49, 50, 54, 55, 53, 54, 55, 56, 56, 58, (2,0): 50, 51, 54, 54, 53, 55, 56, 57, 58, 57, (3,0): 51, 54, 55, 54, 54, 55, 56, 57, 58, 59, (4,0): 53, 55, 54, 54, 56, 56, 58, 57, 57, 58 } } } }
Note: Only works for reading compressed datasets, not for writing!
For reading compressed datasets, compression filters do NOT need information from libHDF5
. They work with the compressed stream.
For writing compressed datasets, some information about the dataset (e.g., data type size) can be needed by the filter (e.g., to shuffle the data). This information is retrieve through libHDF5
C-API (e.g., H5Tget_size
).
Access to libHDF5
C-API is needed, but linking compression filters with libHDF5
is cumbersome in a dynamic environment like Python.
Symbols from dynamically loaded Python modules and libraries are accessible to others.
Register compression filter at C-level with H5Zregister
(see src/register_win32.c
)
In Python, symbols from dynamically loaded modules and libraries are NOT visible to others.
libHDF5
.Instead, provide some function wrappers to replace libHDF5
C-API and link the compression filter with those.
libHDF5
corresponding functions that are dynamically loaded at runtime.At runtime, we need to initialize the compression filter to load symbols dynamically from libHDF5
used by h5py
and use them from the function wrappers.
typedef size_t (* DL_func_H5Tget_size)(hid_t type_id);
static struct { /* Structure storing HDF5 function pointers */
DL_func_H5Tget_size H5Tget_size;
} DL_H5Functions = {NULL};
/* Init wrapper by loading symbols from `libHDF5` */
int init_filter(const char* libname) {
void * handle = dlopen(libname, RTLD_LAZY | RTLD_LOCAL); /*Load libHDF5*/
DL_H5Functions.H5Tget_size = (DL_func_H5Tget_size)dlsym(handle, "H5Tget_size");
}
/* H5Tget_size libHDF5 C-API wrapper*/
size_t H5Tget_size(hid_t type_id) {
if(DL_H5Functions.H5Tget_size != NULL) {
return DL_H5Functions.H5Tget_size(type_id);
} else {
return 0;
}
}
In the event the HDF5 compression filter API evolves, it would be great to take this into account to ease distribution of compression filters.
hdf5plugin
license¶The source code of hdf5plugin
itself is licensed under the MIT license...
It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib...) and copyrights.
hdf5plugin
provides additional HDF5 compression filters (namely: Bitshuffle
, Blosc
, FciDecomp
, LZ4
, ZFP
, Zstandard
) mainly for use with h5py
but not only.
pip
and conda
Credits to the contributors: Thomas Vincent, Armando Sole, @Florian-toll, @fpwg, Jerome Kieffer, @Anthchirp, @mobiusklein, @junyuewang
Partially funded by the PaNOSC EU-project.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.