From HDF5 Datasets to Apache Spark RDDs

… HDF% and Spark: Balancing the workload among tasks is a concern in any parallel environment. However, that does not mean that all datasets have to be the same size. HDF5 can help with partial I/O: Instead of reading entire datasets, one could just read hyperslabs or other selections. Sampling is…