Webinar Followup: Highly Scalable Data Service and HDF Lab
Thank you to everyone that was able to join us for our webinar on the Highly Scalable Data Service (HSDS) and the demo of HDF Lab.
The recording of the webinar is now available:
A couple of links mentioned in the presentation:
- Highly Scalable Data Service (HSDS) on Github: https://github.com/HDFGroup/hsds
- Written description, etc. of HSDS: https://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/
- The HDF Group’s commercial product services based on HSDS fall under Kita: https://www.hdfgroup.org/solutions/hdf-kita
- Try out HDF Lab for free: https://www.hdfgroup.org/hdfkitalab/
There were a few questions at the end of the presentation. For your convenience, a transcript of those questions follows. If you have additional questions, please contact us.
Q: What kind of performance do you see on writing?
A: Yep, pretty much equivalent to what you see with reading. Again, if you’re doing writing in chunk size blocks, I think performance would be a little bit worse than you’d have with local files. If you’re doing writing in terms of strided-access across many chunks, you’ll get the acceleration of being able to harness the power of multiple containers to do your writing. This is what we did when we were creating this NREL example, a 50-terabyte file. If we had to create that content through one process it would take some time, but since with the HDF server, you can have multiple writers working simultaneously, what we did is we spawned up multiple machines with multiple processes and had all those processes writing sections of the file simultaneously so this greatly speeded up the operation. So that’s a capability you have with HDF server that you don’t have with HDF5 library unless you’re using the parallel HDF5 library with MPI and such.
Q: I’d love to hear your thoughts on the Zarr library and how it compares to HSDS, particularly with h5pyd direct blob access you were talking about?
A: It’s interesting that the HDF schema was developed fairly contemporaneously with the Zarr format and they’re very similar, in fact. Zarr has a similar scheme where there’s JSON objects that store metadata, and chunks are stored as objects, and they even use the same type of chunk indexing scheme, except in Zarr, I think it’s zero comma zero comma zero where we have zero underscore zero underscore zero, but the same kind of structure. One difference is that in Zarr, they actually map the HDF5 hierarchy directly into the s3 storage schema, meaning that if I have a data set with an h5path of /g1/g1.1/dset1.1.1, that gets stored in the key, <filename>/g1/g1.1/dset1.1.1. You’ll notice with the HSDS schema we don’t do that. It’s basically a more flat hierarchy. You have this folder which stores all the content for that file, and it’s stored by the file ID not by the file name. This lets us do things like renaming object in the file without moving the contents around or supporting multiple links to the same object.
Rather than having objects by their hdf5 path, we have a subfolder D, there’s another subfolder called G that you don’t see here because you won’t have any subgroups, but the D folder has all the data sets in the file, again stored by ID. The advantage of this approach is that in cases where you have multilink setups, so in HDF5, you don’t need to have a strict tree of objects, you could have a folder group G1 and a group G2 and have both G1 and G2 linked to the same data set. You have cycles in your tree. If you’re directly mapping those H5 paths to S3 paths, you couldn’t do that.
The other advantage is again, that we can rename links or create new links without moving the contents of the datasets when we do that.
Other than that, I think as we move on to the direct access project, I think the performance compared to Zarr will be fairly equivalent.
The schema is similar and the difference, I think though with Zarr, is that Zarr is not a service, Zarr is just a way of storing content. You won’t have the capability of using a service access, direct access is good especially when your workers are in the cloud, close to the storage system. If you are accessing data remotely though, then the cost of getting all the data directly from s3 downloaded through the internet will be prohibitive, so with HSDS having a server and then having this direct access capability, you get to choose whatever is optimal for your particular application. The other thing is that because we have the VOL, you can access HSDS-sharded data through C or Fortran applications and I think currently, there’s only a Python package for Zarr access.
Q: Does HSDS integrate with the Dask distributed compute framework?
A: Yes, it does. I think HSDS works well with Dask because especially for Dask running in a Kubernetes cluster. In the cluster, you’d have your Dask workers that need to read HDF5 data. Having workers access content through a server means that we can have parallel requests come in from the Dask workers to the server and the only trick then would be you want to scale up the server to meet the load of the number of Dask workers you have. With HSDS and Kubernetes, there’s a single command to scale up or scale down the number of pods, so that should be fairly easy to do. The trick will be kind of calibrating you know what the respective workload you want for the server and for the Dask workers.
Thank you so much for joining us for this webinar, and please watch for future webinar announcements!