HDF5 C++ Webinar Followup - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

As promised, here is the follow-up materials from the January 24th HDF5 C++ Webinar.

As a reminder, these were the presentations:

H5CPP from Steven Varga (Slide Deck)
h5cpp Wrapper from Martin Shetty and Eugen Wintersberger
Ntuple: Tabular Data in HDF5 with C++ from Chris Green and Marc Paterno (Slide Deck)

You can read more about the presentations at in this blog post.

Questions and Answers from the session:

If you have more questions, don’t hesitate to post them on the HDF5 C++ Users Group on the forum (https://forum.hdfgroup.org/c/hdf5/cpp-users-group/), or send them to help@hdfgroup.org and we will answer those questions.

Q: Are you Visual Studio 2017 compatible?

Steven Varga: As far as I know, it works with Windows platform. I personally do not use Visual Studio, however, it has been confirmed that it works.

Eugen Wintersberger: You can build the code with Visual Studio 2017, most probably, you have to use the 2015 compiler tool chain which can be integrated quite easily. The reason is incompatibility between the version of Boost? We currently use on the windows build, but once this is fixed… I would have to look it up. Is there a fixed Boost version, there should be no issue.

Martin Shetty: Looking forward, C++ 17 has file system becoming available and supported through C++ 2017 and that will eliminate the need for Boost, so in fact, things will become more compatible.

Chris: Our software is built with CMake and provided the compiler can handle the standard compliance that we require, there is no reason why not, but we are manifestly not a windows shop, I’m afraid, so it’s not something we test.

Q: Are any of the projects represented here today looking at possibility of having HDF5 I/O requests occurring from different classes of memory (e.g. GPU memory)?

Steven Varga: If a linear algebra system supports automatic transfer from GPU->system mem then it may/will work. The rationale is that Bandicoot is Armadillo based linear algebra system for GPU-s by Conrad Sanderson for GPGPU-s, similarly to MAGMA. If they implement an automatic feature between device memory to system memory then at some point it becomes possible to support these.

The correct way to handle this in H5CPP is to move the memory region from GPGPU memory to system memory with

cudaMemcpy(host_ptr, device_ptr, bytes, cudaMemcpyDeviceToHost);

then from the obtained pointer to disks:

auto fd = h5::create(“file_name.h5”, H5F_ACC_TRUNC );

h5write<double>(fd,”dataset_name”, host_ptr, h5::curent_dims{ bytes/sizeof(double) });

However the question was probably related to GPU RDMA and PCI peer to peer to provide very low latency access to data. HDF5 does not support this access pattern nor does H5CPP. It is an interesting topic, and direction for a research project and requires funding.

Chris Green: I think I can say that as simple as our system is, the insert function there requires pointers and where that memory comes from is somewhat up to the user, so provide that it can be specified as a pointer, I don’t think we care.

Q: Can h5cpp handle big data (load files that are bigger than RAM)?

Eugen Wintersberger: A file bigger than RAM? There is no issue. It’s a file. So, I don’t think that’s an issue.

Elena Pourmal: Yeah, because you can read by hyperslabs, right?

Eugen Wintersberger: yeah, if you can remember, this was one of the big use cases we had in the synchrotron community, dealing with data far bigger than RAM.

Q: Hey, is h5cpp or H5CPP supporting multithreaded reading of files?

Steven Varga: AS far as I know, no in our universe of operating systems. This question requires some context, so no, but it doesn’t answer the question. The regular way to access multiple disc would be something like RAID and through one device you would access to its storage. I don’t really understand the question how that would be applicable.

Chris Green: From my point of view, partly agree with Steve, this is a very difficult question to answer. If I make an assumption about the use case where you have a multithreaded application, which wants to read from a single file, independent now where this file is stored, then no, we don’t provide explicit … the API is not thread safe, in that sense, or what you would consider thread safe from a textbook definition. It usually doesn’t make too much sense to do it.

Q: What’s your comment on partial I/O row versus column wise?

Marc Paterno: The library we have accepts data from the user row wise, because that is our use case. In our particular analysis, we see one event at a time and so want to write out the data for that event, and then move on to the next event, write out the data ofr the next event. So that’s the user facing part of the API, have to accept the data row wise, just based on our use case.

The data when written to the HDF file underneath are written column-wise, and our analysis use cases, our style of analysis, is using a system like Pandas for example, in python, where we operate on whole columns of data at a time. So the style of the interface is driven by our usage pattern.

Q: Is there an equivalent of h5repack, built into the new API?

Elena Pourmal: I’m not sure how to interpret this question, if there are APIs to do what repack does? But I will let presenters…

Eugen Wintersberger: As far as I know, this is not going to happen, I don’t know. I mean, I’m not sure. Do you have an API for this in the C code?

Elena Pourmal: Unfortunately, no, but as a rule of thumb we encourage the best practice to build the tools from using public interfaces then can be kind of high-level interfaces, meaning they do a lot of things in one call. Its very possible that, for example, if we look at current code of our command line tools that I read them in C and compare what tool does and create the same code using h5Py, it will be much less code to write, but there are reasons why we are doing those codes in C APIs. So, using H5cpp or another CPP API may facilitate creation of the tools for sure and provide some functionality. So I guess the answer is no, but people are really welcome to use those APIs to create tools, like productivity that was mentioned int eh first presentation, that is the key.

Eugen Wintersberger: No, we don’t have any plans in this direction for now.

Q: Under which licenses are these interfaces released?

Steven Varga: MIT, very liberal, you can use the headers for whatever purpose you would like to. The compiler does have an advertising clause, inherited from LLVM project, is in binary format. The output code is MIT licensed and you can use it the way you would like.

Eugen Wintersberger: We license under the LGPL, so the idea is you can link dynamically with our library, but if you add changes to the code, do modifications to the code, we would like you to share this with others.

Chris Green: Our license is officially the Fermilab license, but I believe it essentially is the three paragraph BSD license, would you agree with that, Marc?

Marc Paterno: Yes, it’s essentially the BSD 3 clause.

Q: Does quantum computing work with the energy saving for time and consumption? for the us department of energy Feemilab

Marc Paterno: I don’t quite understand this question… currently the library we’re talking about here has nothing to do with quantum computing. However, Fermilab is involved with quantum computing research and there is a Fermilab quantum science initiative, so the Fermilab website has information about that and contact points to ask questions about Fermilab’s involvement in quantum computing.

UPDATE from Marc: Fermilab’s Quantum Science Program (https://qis.fnal.gov)is pursuing a program to leverage the power of quantum science to address problems in data analysis and theoretical physics. High-energy physicists are also extending their expertise in sensor and accelerator technology for quantum software and computing. Much more information is available at the public web site, linked above.

Q: Seems like we need a presentation from the C++ interface that is integrated into the core repo. If a “different purpose”, what’s the purpose of the one integrated vs. these 3rd party libraries?

Elena Pourmal: I will try to start answering, but I will welcome presenters to reflect on this. Our HDF5 C++ interfaces were created first in the very late 90s, I believe, they were added to the library. At that time, we found that many compilers, we couldn’t use some powerful features of C++ because of code portability. And after that we went into the maintenance mode for HDF5 C++ APIS. We cannot, as HDF5 software is now more than 20 years old, as with any legacy software, you cannot simply remove APIs, which sometimes we do, and the forum rightfully acknowledges our mistakes and keeps us responsible for what we’re doing. We cannot remove those APIs because there are applications, and sometimes critical applications that use those APIs. People cannot write the application with no introduction, so we are committed to continue supporting what we have, enhancing as we have time and bandwidth. We absolutely accept contributions and if you want to contribute to HDF5 C++ APIs, please contact me or just send email to help@hdfgroup.org with your patch or suggestion or put it on the forum. We will be glad to work with you.

As you notice, those representations, came from different angles. And they are essentially trying to solve different problems, and all of those approaches are very valid. The HDF Group is not endorsing anything, and we would like the community to really find what you need and what is working for you and use whatever is working, what you think is the best solution for your application. But I will ask presenters to reflect.

Steven Varga: Luckily, I can be blamed only for one of the three projects, I’ll take full responsibility. The very reason is the C++ API provided by The HDF Group is somewhat outdated. Its based on a coding style of the early 2000s and things changed over time. The approach that this project took is much lighter in terms of coding and linking. It only requires the C API and the math library, I believe—that’s correct—so it is very lean and linguistically has features much similar to Python, that’s C++ 17 standards, using template data programming, and this technique is missing from the original C++ library.

Eugen Wintersberger: I guess partially Steven already said the important the things. The original C++ wrapper is basically outdated, and C++ has changed a lot over the years. It’s more or less in our local C++ user group, the general notion is that consider C++ 11 as a new language, and of course the upcoming standards. One thing, why we started with this project, is that the original C++ wrapper did not cover the entire C functionality, and one thing came to the other, and I already did once a wrapper and another guy did a wrapper, so we put everything together and started over again. With maybe a bit different approach then Steven enough, but fair enough, that’s ok.

Chris Green: Our attempt at this is very narrowly focused to a very particular use case that we needed to satisfy for our users, so we could easily re-implement our interface on top of probably either of the two other systems presented here today without great issue, but we did it to satisfy the use case we had.

Q: Do you have any plans to add reading functionality to Ntuple? A write-only approach will not work for me.

Steven Varga: Ntuples can be represented as POD or plain order data structure types, and that is functionality that is implemented seamlessly. You basically do operations with arrays of standard vectors currently and use the compiler to get the necessarily shims for the compound data types descriptors, so yes.

Chris Green: We certainly don’t have any immediate plans. If there was a feature request, then we would look at it.

Q: What is the latency of reading and writing?

Steven Varga: That’s a good question. Very low latency, but this can be controlled. There are caching mechanisms, and it really depends on what the intermediate cache sizes on, it’s definitely within the range for high frequency trading, from the top of my head. I don’t really know numbers for a particle collider, but it can be controlled and refined to the point it would satisfy the engineering lower and upper bounds.

Additional questions answered outside the Q&A period:

Q: That is yet a third C++ hdf5 interface?

Elena Pourmal: Yes, different purpose!

Q: Which one is better? 🙂

Elena Pourmal: Try both!

Q: Follow up on multiple C++ projects: what about the on in the hdf5 source tree like https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5/browse/c%2B%2B?at=refs%2Fheads%2Fhdf5_1_8_20

Elena Pourmal: The HDF Group will continue to support C++ APIs that come with the HDF5 library.

Q: How does one read the ntuple data in C++? I understand that it is easy to read with python, but if I write with C++, I will often need to also read with C++.

Marc Paterno: One could read the code using the C API, or perhaps h5cpp or H5CPP. We have been using only Python (h5py) for reading.

Q: This is used for experimental data gathering in real time, yes? What kind of data throughput were you getting?

Marc Paterno: We do not currently have good measurements of the writing speed of the data, because our data-writing programs are very far from io-bound. We have been primarily concerned with the MPI-parallel reading speed for the data, which Chris will address in a later slide. Marc Paterno

Q: Sorry I didn’t understood; h5cpp and H5CPP are two projects?

Elena Pourmal: Yes, those are two different projects

Questions and Answers from the session:

Additional questions answered outside the Q&A period:

Leave a Comment Cancel Reply