Mohamad Chaarawi, The HDF Group
First in a series: parallel HDF5
What costs applications a lot of time and resources rather than doing actual computation? Slow I/O. It is well known that I/O subsystems are very slow compared to other parts of a computing system. Applications use I/O to store simulation output for future use by analysis applications, to checkpoint application memory to guard against system failure, to exercise out-of-core techniques for data that does not fit in a processor’s memory, and so on. I/O middleware libraries, such as HDF5, provide application users with a rich interface for I/O access to organize their data and store it efficiently. A lot of effort is invested by such I/O libraries to reduce or completely hide the cost of I/O from applications.
Parallel I/O is one technique used to access data on disk simultaneously from different application processes to maximize bandwidth and speed things up. There are several ways to do parallel I/O, and I will highlight the most popular methods that are in use today.
First, to leverage parallel I/O, it is very important that you have a parallel file system; otherwise the file system will most likely process the I/O requests it gets serially, yielding no real benefit from doing parallel I/O. In fact, optimizations done by middleware layers thinking they are accessing a parallel file system are likely just overhead, since parallel access is happening serially. You might as well send all your data to one process and have that process write the data serially to disk.
Parallel file systems stripe data over multiple storage servers for high performance in parallel access. There are many types of parallel file systems (LUSTRE, GPFS, PVFS2 …), each with their pros and cons depending on system architectures and applications’ data access patterns. Parallel file systems and parallel I/O are utilized mostly in high performance computing systems (HPC) and science applications running on those systems.
Parallel I/O can be done in a variety of ways from an application:
A very fast way is to do I/O to one file per process. Each process creates and accesses a file that no other process is going to access, which means file/block locking on the file system level is avoided. This technique, however, does not do well when you scale to very large process counts where the number of files created would burden the file system’s metadata server(s). Managing the data and files created is also cumbersome. This is due to the pre/post-processing steps needed for restarting the application, or running an analysis application with a different number of processes than the original one that created all the files.
Another way is for all processes to access a single shared file. This is the most convenient and frequently, the preferred way for applications. This usually can be done in one of two modes, independent or collective mode.
With the independent mode, each process accesses the data directly from the file system without communicating or coordinating with the other processes. This usually works best if the application is reading or writing large contiguous blocks of data in the file with one I/O request. Parallel file systems do very well with an access pattern like that, but experience shows that this is not always the case with applications. Most applications end up doing small, fragmented I/O accesses to the file system, resulting in very poor performance that has application scientists regretting they ever attempted to use parallel I/O in the first place.
But fear not, collective I/O is here to help. Collective I/O leverages the typically very fast network interconnect in HPC systems to move data around between processes. This is done to aggregate the small fragmented accesses to the file system into larger ones that result in much better access times. Any application can implement such technique when doing I/O itself using a POSIX I/O interface, but fortunately, I/O middleware libraries such as MPI-I/O and HDF5 already provide a collective I/O interface for accessing data. They also do all the data movement internally, taking this burden from the application. Furthermore, the application could specify a subset of processes to be responsible for doing I/O to the shared file rather than all of the processes.
So far, we saw either an N-1 or N-N (number of processes – number of files) technique for doing parallel I/O. An alternative way is an N-M approach, where N>M. This can be accomplished in a variety of ways, like grouping a set of processes together to create and access one file, while another group accesses a different file. The Parallel Log Structured File System (PLFS) provides an abstract way for doing this, by providing a driver for MPI-I/O that translates the access to a single file from an application perspective to M or N files underneath.
I/O forwarding is another I/O technique that sends I/O requests to a remote server for processing. Client-side collective I/O and aggregation would not be very beneficial here. Server-side optimizations would yield better performance in such situations. Recent HPC systems architectures and early designs for some of the future Exascale systems will employ I/O forwarding, since compute nodes might not even have local disks to store data. They will just use a fast network interconnect to forward their I/O requests to I/O servers to handle them. Moving the data through the network could be done asynchronously in the background of the applications running on the compute nodes.
Tiered storage architectures employing a “Burst Buffer” area using fast storage such as Solid State Drives (SSDs) are emerging and are likely to be present in most HPC systems of the future. SSDs, for example, do much better with small fragmented data access than magnetic hard drives, which means that current optimizations to do collective I/O might not be needed on such architectures. In fact, it is becoming more apparent that the application will have less control over the location of the data on the file system, but rather, trust middleware libraries and parallel file systems to do the right thing with its data.
The HDF Group is involved in several projects targeting current and future HPC systems. A lot of research is invested for adding new features and optimizations in HDF5 for users to utilize. The DOE Exascale FastForward Storage Project is one effort that aimed at building an entire I/O stack for future Exascale systems. The ExaHDF5 project is another effort aimed at current and near-future systems dealing with HDF5, autotuning, and application optimizations for data access.
Stay tuned to the HDF Blog for how HDF5 supports parallel I/O, or click here.
Additional Resources:
Parallel I/O, Analysis, and Visualization of a Trillion Particle Simulation
A Framework for Auto-Tuning HDF5 Applications
The Lustre file system
The Parallel Virtual File System