How does HSDS work?
HSDS is implemented as a set of containers where the number of containers can be scaled up or down based on the desired performance and expected number of clients. The containers can either be run on a single server (using Docker) or run across a cluster using Kubernetes. The architecture of HSDS, explained in detail below.
There are two classes of containers, the service node and the data node.
The service nodes are the containers that receive requests from the clients. For each request, the service node, authenticates the request, verifies the action is authorized (e.g. delete an object, add an object), then then forwards the request to one or more data nodes
For their part, the data nodes, access the object storage system (e.g. S3) to read or write the requested data. Each data node maintains a cache, so that recently accessed data can be returned directly without having to go to the storage system. The data nodes maintain a virtual partition of the storage system, so that each node is responsible for a distinct subset of the objects (a files objects will typically be distributed across all partitions).
For HDF5 datasets, one object is created for each chunk in the dataset. When clients send requests to read or write data across multiple chunks, each data node can be performing IO in parallel. So while with the regular HDF5 library, these types of operations require each chunk to be processed sequentially, with HSDS the operation can be performed much faster.
In addition to providing functional equivalents to the HDF5 library, like compression, hyperslab and point selection, HSDS also supports many features not yet implemented in the library, including
- multi-reader/multi-writer support,
- compression for datasets using variable length types,
- SQL-like queries for datasets, and
- asynchronous processing.
Additionally, HSDS offers its own set of client library tools and utilities:
- REST VOL, a HDF5 library plug-in that clients can use to connect with HSDS
- H5pyd, a Python package for clients to connect with HSDS
- HS command line interface, standard Linux and HDF5 utilities for importing, exporting, listing content, etc.
Existing code can be easily repurposed to read/write data from the cloud using the REST VOL (for C/C++/Fortran applications) or h5pyd (for Python scripts) allowing organizations to easily move their data to the cloud without a large expenditure in re-writing code.
This 2020 webinar covers the basics of the Highly Scalable Data Service. Support for HSDS can be obtained through The HDF Group’s Forum (requires registration) or through a Kita subscription.