Highly Scalable Data Service (HSDS)

What is the Highly Scalable Data Service (HSDS)?

As many organizations face the challenges of moving their data to the cloud, for many, major code changes and cost increases are top concerns.

HSDS is a REST-based solution for reading and writing complex binary data formats within object-based storage environments such as the Cloud. Developed to make large datasets accessible in a manner that’s both fast and cost-effective, HSDS stores HDF5 file using a sharded data schema, but provides the functionality traditionally offered by the HDF5 library as accessible by any HTTP client. HSDS is open source software, licensed under the Apache License 2.0. Managed HSDS products, support, and consulting is offered through HDF Group’s Kita Data Products & Solutions.

Kita Data Products & Solutions provide users turnkey implementations of HSDS and customized training, support, and consulting. Our experts can provide guidance and assistance from early adoption of HSDS to implementing and deploying enterprise solutions. We can help you ensure project success and avoid missteps so you can quickly overcome your unique data challenges. Contact us to learn more.

Why use HSDS?

The benefits of the Highly Scalable Data Service (HSDS) extend over almost every need organizations have for their data, regardless of file format:

HSDS Benefits Details
Scalability
  • Store petabytes of data
  • Scale across multiple servers
  • Dynamically change the number of server nodes to meet client demands
Performance
  • Leverage smart data caching to accelerate object storage queries
  • Process single requests in parallel on the server
  • Run existing HDF5 applications faster by utilizing the automatic parallelization features of HSDS
Concurrency
  • Supports multiple writers/multiple readers (even to the same file)
  • Support simultaneous use from thousands of clients
  • Applications can use multithreading
Simplicity
  • Deploy with one-click
  • Access only the HDF5 data you need
  • Work with your cloud provider of choice
Compatibility
  • Rapidly shift large HDF5 files, applications, and infrastructure to the cloud
  • Compatible with any HDF5 based data (e.g. NetCDF4, Energistics, etc.)
  • Existing applications can use HSDS with minimal changes (HDF5 API and Python h5py API compatibility)
Security
  • HTTP and HTTPs supported
  • Clients don’t need access to cloud storage
  • Role Base Access control (RBAC) can be used to easily manage group access
  • Access Control Lists (ACLs) enable control on which users have access to individual data files
Reliability
  • Multiple copies of each object are stored (no danger of data being lost)
  • Object updates are atomic, so no danger of files being corrupted
Portability
  • Use “AWS S3”, “Azure Blob”, “OpenIO”, or Posix storage
  • Docker, Kubernetes, Azure managed Kubernetes (AKS), AWS Kubernetes (EKS), and DC/OS container management systems are all supported
  • Move between on-prem deployments to the cloud without needing any application code changes
Cost
  • Uses lost cost AWS S3 storage (or other object storage systems)
  • Use GZIP or BLOSC compression to reduce storage costs
  • No proprietary software required

How does HSDS work?

HSDS is implemented as a set of containers where the number of containers can be scaled up or down based on the desired performance and expected number of clients. The containers can either be run on a single server (using Docker) or run across a cluster using Kubernetes. The architecture of HSDS, explained in detail below.

There are two classes of containers, the service node and the data node.

The service nodes are the containers that receive requests from the clients. For each request, the service node, authenticates the request, verifies the action is authorized (e.g. delete an object, add an object), then then forwards the request to one or more data nodes

For their part, the data nodes, access the object storage system (e.g. S3) to read or write the requested data. Each data node maintains a cache, so that recently accessed data can be returned directly without having to go to the storage system. The data nodes maintain a virtual partition of the storage system, so that each node is responsible for a distinct subset of the objects (a files objects will typically be distributed across all partitions).

For HDF5 datasets, one object is created for each chunk in the dataset. When clients send requests to read or write data across multiple chunks, each data node can be performing IO in parallel. So while with the regular HDF5 library, these types of operations require each chunk to be processed sequentially, with HSDS the operation can be performed much faster.

In addition to providing functional equivalents to the HDF5 library, like compression, hyperslab and point selection, HSDS also supports many features not yet implemented in the library, including

  • multi-reader/multi-writer support,
  • compression for datasets using variable length types,
  • SQL-like queries for datasets, and
  • asynchronous processing.

Additionally, HSDS offers its own set of client library tools and utilities:

  • REST VOL, a HDF5 library plug-in that clients can use to connect with HSDS
  • H5pyd, a Python package for clients to connect with HSDS
  • HS command line interface, standard Linux and HDF5 utilities for importing, exporting, listing content, etc.

Existing code can be easily repurposed to read/write data from the cloud using the REST VOL (for C/C++/Fortran applications) or h5pyd (for Python scripts) allowing organizations to easily move their data to the cloud without a large expenditure in re-writing code.

This 2020 webinar covers the basics of the Highly Scalable Data Service. Support for HSDS can be obtained through The HDF Group’s Forum (requires registration) or through a Kita subscription.

Next Steps

HSDS Source Code

The source code for Highly Scalable Data Service (HSDS) is available on Github. You’re welcome to access the source code under the permission Apache License 2.0. As a non-profit, The HDF Group does offer commercial product services around HSDS to help with the setup, installation, maintenance, and training around HSDS. These services are offered under the Kita™ umbrella.

NREL Case Study

The U.S. Energy Department’s National Renewable Energy Laboratory (NREL) released the WIND Toolkit – a portion of a 500 TB dataset accessible to the public using Kita. Among other functions, users can interact with the dataset with this interactive online visualization tool which also utilizes HSDS to quickly serve slices from the massive dataset through a web browser. Read the press release and case study.

Kita Data Products & Solutions

The HDF Group provides Kita™ to help organizations make the most of their use of HSDS, and get their services up and running quickly and efficiently. Kita is available in three variants:

Kita Lab is a managed JupyterLab environment with a command line interface. This service is available with a 30-day free trial and priced at $10/month or $100/year after the free trial. It’s great for small groups of researchers or those looking to experiment with Kita before making a larger commitment.

Kita Server for AWS Marketplace can be accessed through the marketplace for those using AWS services.

HDF Cloud Support Services. The HDF Group staff will install and customize HSDS on your existing infrastructure.  Contact us for a quote