Albert Cheng
March 6, 2001
When PHDF5 was first implemented, the implementation requirements include that no threads could be used and no process could be set aside as a server daemon to coordinate file structure (metadata) changes among MPI processes. The restrictions were because threads were not commonly available then. Using threads as part of the implementation would hinder its portability to different platforms. No process was allowed to be set aside because some parallel platforms might allocate compute nodes in multiples of certain numbers, e.g., powers of two. Many parallel algorithms are designed to run with powers of two of compute nodes. Reserving one compute node would conflict with the parallel algorithm. Therefore all HDF5 function calls that involve structural changes (e.g., H5Dcreate) have to be called in the collective manner so that all processes could synchronize their structural information of the file.
This is working for the current implementation but proved to be clumsy for some applications. E.g., an application of 100 processes opens one HDF5 file that contains 100 datasets. Each process, P-i, wants to access dataset, DS-i, only. The current PHDF5 API requires each process to make 100 H5Dopen calls in a collective manner. It also requires all 100 processes to close each dataset via H5Dclose collectively. That makes the programming algorithm clumsy and causes some performance loss because the processes have to work in the collective mode unnecessarily.
Permit independent datasets creation and open. Each process needs to call H5Dopen only for datasets it wants to access. As in the above example, each process, P-i, after opening the HDF5 file collectively, will make only one independent H5Dopen call gain access to dataset, DS-i. Each of them can close its dataset independently. They still need to close the file collectively.
Two solutions are proposed to manage the file structural changes. One is via the use of Pthreads and the other is via a set aside process (SAP).
The PHDF5 is implemented with an option to use PTHREADS. For each HDF5 file opened, one of the processes that open the file spawns an internal thread. The internal thread does not return control to the user application but stays in its own "loops", coordinate all future requests for file structural changes from all the processes that open this file. The processes all returns to the user application with file handle ID. From then on, all processes can issue independent structural changes (e.g., creation of new datasets, extension of dataset sizes,) and the internal thread is there to coordinate all changes. For the above example application, each of the 100 processes will need to make only one independent dataset open call. This will make the programming model more flexible and closer to the original algorithm and may gain some performance because of the independent calls. Notice that this makes the function calls easier to use but does not change the overall I/O performance.
Advantages of the Pthreads Option
Disadvantages of the Pthreads Option
The PHDF5 is implemented with an option to reserve an MPI process for the purpose of coordinating file structural changes for ALL parallel HDF5 files of the application. A new function, say H5Parallel_Init(), is defined. An application must call this function first before making any other parallel HDF5 related function calls. H5Parallel_Init will reserve one of the processes (e.g., process 0 of MPI_COMM_WORLD) as the Set-Aside-Process (SAP) to coordinate structural changes for all parallel HDF5 files. (Notice this is different from the Pthreads option in which one internal thread coordinates one file whereas the SAP does it for all files.) The SAP does not return to the user application space until the function H5Parallel_Close() is called by all other processes of the MPI_COMM_WORLD.
All other processes beside the SAP will create a new sub-communicator (let us name it as H5MPI_COMM_WORLD) and return it to the user application. So, H5MPI_COMM_WORLD consists of one less process than MPI_COMM_WORLD. From then all, the user application uses H5MPI_COMM_WORLD as if it is the MPI_COMM_WORLD until it calls H5Parallel_Close(). After H5Parallel_Close() returns, (including the return of SAP for the call of H5Parallel_Init()), the application may use the MPI_COMM_WORLD again.
Advantages of the SAP option
Disadvantages of the SAP option
Since the Pthreads option takes more software requirements and is less likely to work on as many platforms as the SAP option may, I am in favor to try the SAP option implementation first.