HSDS Streaming - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

John Readey, The HDF Group

Note: this is an updated and expanded version of material that originally appeared in an earlier HDF Newsletter.

A long-standing constraint with the Highly Scalable Data Service (HSDS) has been how much data could be read or written with one HTTP request. This limit is configurable, but the default has been 100 MB. This means if you tried to read or write any request with more than that many bytes, you would run into a 413 – Payload Too Large error. The max_request_size config could be increased, but you would also need to be sure that the HSDS Docker containers or Kubernetes pods had sufficient RAM as well, otherwise the container would die with an out-of-memory exception.
`
A related concern is that for very large read requests, the client could time out before any bytes were received from the server—HSDS would need time to assemble all the bytes for the response before sending any data to the client. If this took longer than 30 seconds or so, the client may close the connection thinking the server was down.

The good news is that the max request size limit no longer applies with the latest HSDS update. In the new version, large requests are streamed back to the client as the bytes are fetched from storage. Regardless of the size of the read request, the amount of memory used by the service is limited and clients will start to see bytes coming back while the server is still processing the tail chunks in the selection. The same applies for write operations—the service will fetch some bytes from the connection, update the storage, and fetch more bytes until the entire request is complete.

A couple of provisos… First, this only works with binary requests. Reading or writing with a JSON content type will still be limited to the max_request_size value. Secondly, requests that use variable length types will still be limited to max_request_size regardless if the request is binary or JSON. (Variable length types make the calculation of how much data to read or write to the stream a bit more complicated.)

An example of this in practice can be found on github. This test does the following:

Initialize a nrow x ncol array (12000 x 2200 or 193MB by default)
Write the entire array to a HDF5 dataset in one http binary request
Read the array back in one http request
Verify we got the same values we sent

Running on an AWS m5.2xlarge instance with a 4-node HSDS in Docker and writing to S3, I got the following output running the test with a 200 MB target (12000 x 2200 dataset with an 8 byte int):

$ python stream_test.py
testStream2D /home/test_user1/stream/bigfile.h5
dataset shape: [12000, 2200]
got dset_id: d-75afdf3e-1465316d-f8c9-dd6083-d806b2
initializing test data (211200000 bytes, 201.42 MiB)
writing...
elapsed: 2.66 s, 75.77 MB/s
Reading…
elapsed: 1.13 s, 177.93 MB/s
comparing sent vs. received
passed!

I also tried this test on two other machines: a 2018 Macbook Pro with 6 cores and 16 GB of memory, and my (just received) HP Dev One laptop with 8 cores and 16 GB of memory. The HP Dev One is an interesting piece of hardware—it’s a laptop custom designed to run Linux and targeted (as the name implies) for developers. You can read more about it at https://hpdevone.com. For our purposes, the advantage of running Docker on Linux as compared with Mac OS, is that you don’t need a VM or Docker Desktop as containers run directly on the hardware.

If I run the test above on the Mac Pro (using the onboard SSD for storage) I get:

writing: 42.2 MB/s
reading: 54.3 MB/s

Now let’s try on the DevOne machine:

writing: 290.0 MB/s
reading: 276.0 MB/s

More than a 5x speedup! The AMD Ryzen 7 on the Devone is somewhat faster than the Intel i7 on the MacBook, but I suspect the bulk of the difference is due to the extra time needed to translate I/O calls through the hypervisor layer when running Docker on MacOS.

To support streaming, there were some tweaks needed on the client side as well, so for h5pyd, you’ll want to update to version 0.10.3 (you can build yourself or download from PyPI with pip install h5pyd --upgrade). You won’t need to make any changes in your h5pyd applications, but if you’ve previously been breaking up writes into smaller pieces to get around the request size limit, you can now just directly write any size selection.

Let’s look at an example to see how this works. This program creates a 2D dataset (2200 x 12000 by default), and then writes a numpy array to it in one call: dset2d[:, :] = arr[:, :]. Next, we read back the data again: arr_copy = dset2d[:, :]. Finally, we verify the original array and the copy actually are equivalent.

Also, with this program we can compare the performance of using h5py with the HDF5 library and h5pyd with HSDS (if the file path starts with “hdf5://” h5pyd is used, otherwise h5py). The program also has an option to use compression for the HDF5 dataset (run: python write_example.py --help to see all the available options).

How did it go with the three test machines (m5.2xlarge, MacPro, and DevOne)? Here are the results:

	HSDS		HDF5Lib
test machine	write MB/s	read MB/s	write MB/s	read MB/s
m5.2xlarge	100.9	97.6	1416.8	1308.1
macpro	40.1	40.0	848.1	821.8
devone	240.5	191.6	2021.9	2071.1

Not surprisingly, the HDF5 library was much faster for this test on all three test machines–the library can pretty much go as fast as the disk will let it, while with HSDS the data has to be marshaled from the client to the service and then to the disk. Comparatively though, the DevOne system does best ~4x slower than the library compared with ~20x slow on the Mac (due to the virtualization overhead) or ~14x slower on the EC2 instance (due to S3 being much higher latency than an EBS volume).

Let’s try the same test using gzip compression. In this case we get:

	HSDS		HDF5Lib
test machine	write MB/s	read MB/s	write MB/s	read MB/s
m5.2xlarge	104.3	81.5	50.2	305.7
macpro	35.6	42.7	76.3	245.8
devone	233.8	187.3	84.5	424.1

Here the relative performance of HSDS and HDF5Lib were much more equivalent. On the EC2 instance, write speed (to S3!) with HSDS was twice as fast as with the HDF5 lib (to an SSD drive). In this case, the overhead involved in marshaling the data to HSDS is made up by the HSDS’ ability to take advantage of multiple cores to do the data compression. The library pulled ahead on the read portion though with more than 3x the performance.

Comparing the MacBook Pro with the Dev One machine, we see the library performance was just slightly worse (due to generally older hardware). HSDS however performed much better on the DevOne compared with the Mac (again mostly due to the virtualization penalty). As with the EC2 instance, HSDS writing compressed data worked quite well with the Dev One machine; almost 3x faster with HSDS compared with the HDF5 library.

For this file (both with HSDS and HDF5Lib) the uncompressed size was 277 MB and the compressed size was only 39 MB. Turning on compression hardly affected performance with HSDS (though many datasets will not compress as well as this one). Compression did have a larger performance impact with HDF5, but it may not matter that much in practice (even with compression, the read throughput was quite good).

I hope this provides a helpful overview of the streaming feature and an interesting performance snapshot. Let us know your thoughts on this feature and if you are seeing performance improvements with this release. If the performance of your HDF application is not all you would hope it be, sharing a test case that illustrates it will be quite helpful for The HDF Group to better understand what the problem might be. Oftentimes we can offer suggestions for simple changes that will make a big difference in performance. Other times, such a test case can point the spotlight at areas of the software that could use some optimization. In any case, the HDF forum is a good place to get the conversation going.

John Readey, The HDF Group

Leave a Comment Cancel Reply