Hadoop (HDFS) HDF5 Connector

Overview

The purpose of this document is to describe the process of building a version of the HDF5 library which contains a virtual file driver (VFD) that provides read-only access to HDF5 files stored in a Hadoop File System (HDFS). We list the necessary prerequisites and show how to verify build correctness.

This document is intended for users who are familiar with building libraries from source code using the GNU Autotools. The instructions are applicable to a wide variety of GNU/Linux systems.

Prerequisites

In addition to the standard prerequisites for building the HDF5 library, the build process requires the Java Development Kit (JDK), the libhdfs header and library (part of Hadoop), and the HDFS VFD source code (distributed as a patch for HDF5 1.10.4 but included in later versions).

To be specific, we show the build process step-by-step on a “vanilla” Debian GNU/Linux 9 system.

uname -a

Tools

The build process depends on a C compiler, the GNU Autotools, the JDK, and libhdfs. On a Debian 9 system, these dependencies can be installed as follows:

sudo apt-get -y update --fix-missing
sudo apt-get -y install build-essential automake default-jdk
wget http://apache.osuosl.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -zxvf hadoop-3.1.1.tar.gz

In this document, we use Java 1.8 and Hadoop 3.1.1 to build the HDFS VFD. Other combinations / versions may work equally well, but no systematic testing has been done.

gcc -v 2>&1 >&2

java -version 2>&1 >&2

Environment

The build process depends on the following environment variables:

export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=$HOME/hadoop-3.1.1
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

Files

Currently, the HDF5 VFD for HDFS is built into the HDF5 library. (A model where VFDs can be loaded dynamically like filters is under consideration and might be available in the future.) You need a copy of the HDF5 library source code and a copy of the HDFS VFD source code, which is distributed as a patch file for HDF5 1.10.4 but included in later versions..

In this document, we assume that the HDF5 library source code is available in $HOME/hdf5-1.10.4 and that the patch (if applicable) vfds_hdf5_1_10_4.patch is located in $HOME/hdf5-1.10.4-vfds.

find . -maxdepth 1 -type d

Patching the HDF5 Library Source Code

This section is only necessary for HDF5 1.10.4. Later versions have this included. The patch command shown below should be run in the directory containing the HDF5 library source code. It should run to completion without errors.

cd $HOME/hdf5-1.10.4
patch -p 1 < $HOME/hdf5-1.10.4-vfds/vfds_hdf5_1_10_4.patch

Building the HDF5 Library

The compilation of the HDFS VFD is enabled through the --with-libhdfs option of configure. Its expected value is the Hadoop installation directory where configure is supposed to find the libhdfs header and library.

./configure --prefix $HOME/.local --with-libhdfs=$HADOOP_HOME | grep HDFS

make -j3 &> build.log
tail build.log

Testing the HDFS Support

An HDFS instance is required to verify the proper functioning of the VFD. Fortunately, HDFS can be deployed in standalone or pseudo-distributed mode in non-clustered environments for testing and development. The HDFS VFD bundle includes a set of tests and the scripts to set up such transient HDFS deployments.

The tests depend on the Hadoop/HDFS configuration stored in core-site.xml and hdfs-site.xml. After copying these files to $HADOOP_HOME/etc/hadoop, a transient HDFS instance can be created by invoking the setup.sh script
in the hadoop-testing-bundle folder of the test bundle. Note that this and the matching teardown.sh script are intended to be run from that sub-directory.

cd $HOME/hdf5-1.10.4-vfds/hadoop-testing-bundle
cp *.xml $HADOOP_HOME/etc/hadoop
bash ./setup.sh

To run all HDF5 tests, we could now run make check in $HOME/hdf5-1.10.4. This would include the HDFS-related tests. However, if only the HDFS VFD tests are needed, it is sufficient to invoke the hdfs executable in the
test sub-directory.

cd $HOME/hdf5-1.10.4/test
./hdfs

Finally, the resources associated with the transient HDFS instances should be released by invoking the teardown.sh in the hadoop-testing-bundle directory.

cd $HOME/hdf5-1.10.4-vfds/hadoop-testing-bundle
bash ./teardown.sh

Summary

In this document, we have shown how to configure, build, and test a version of the HDF5 library that includes a VFD for HDFS in GNU/Linux environments. Please see the document USERS GUIDE for information on how to use the VFD with the HDF5 tools and in your applications.

For additional help and with further questions, please contact support@hdfgroup.org.

Overview

The purpose of this document is to describe the use of the HDF5 VFD for HDFS with existing HDF5 tools and in application development. This document does not cover the installation of the HDFS VFD. Please refer to the INSTALLATION GUIDE for information on how to build and install the HDF5 VFD for HDFS.

Prerequisites

A version of the HDF5 library that was built with the HDF5 VFD for HDFS enabled.
An HDFS instance such as an HDFS cluster or a transient deployment such as described in the INSTALLATION GUIDE.
The following variables must be set for your environment: JAVA_HOME, HADOOP_HOME, CLASSPATH, LD_LIBRARY_PATH.

Example

In this document, we use the following setup.

The HDF5 library and tools are installed in $HOME/.local

ls -lR --hide=share $HOME/.local

/home/admin/.local:
total 12
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 bin
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 include
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 lib

/home/admin/.local/bin:
total 9760
-rwxr-xr-x 1 admin admin 523528 Sep 27 12:07 gif2h5
-rwxr-xr-x 1 admin admin 498256 Sep 27 12:07 h52gif
-rwxr-xr-x 1 admin admin  13281 Sep 27 12:07 h5cc
-rwxr-xr-x 1 admin admin 489184 Sep 27 12:07 h5clear
-rwxr-xr-x 1 admin admin 495952 Sep 27 12:07 h5copy
-rwxr-xr-x 1 admin admin 103704 Sep 27 12:07 h5debug
-rwxr-xr-x 1 admin admin 909592 Sep 27 12:07 h5diff
-rwxr-xr-x 1 admin admin 767808 Sep 27 12:07 h5dump
-rwxr-xr-x 1 admin admin 489848 Sep 27 12:07 h5format_convert
-rwxr-xr-x 1 admin admin 653672 Sep 27 12:07 h5import
-rwxr-xr-x 1 admin admin 493016 Sep 27 12:07 h5jam
-rwxr-xr-x 1 admin admin 582088 Sep 27 12:07 h5ls
-rwxr-xr-x 1 admin admin 481088 Sep 27 12:07 h5mkgrp
-rwxr-xr-x 1 admin admin 585632 Sep 27 12:07 h5perf_serial
-rwxr-xr-x 1 admin admin   5913 Sep 27 12:07 h5redeploy
-rwxr-xr-x 1 admin admin 785888 Sep 27 12:07 h5repack
-rwxr-xr-x 1 admin admin  45832 Sep 27 12:07 h5repart
-rwxr-xr-x 1 admin admin 540704 Sep 27 12:07 h5stat
-rwxr-xr-x 1 admin admin 482552 Sep 27 12:07 h5tools_utils
-rwxr-xr-x 1 admin admin 489248 Sep 27 12:07 h5unjam
-rwxr-xr-x 1 admin admin 509408 Sep 27 12:07 h5watch

/home/admin/.local/include:
total 544
-rw-r--r-- 1 admin admin  25488 Sep 27 12:07 H5ACpublic.h
-rw-r--r-- 1 admin admin   9720 Sep 27 12:07 H5api_adpt.h
-rw-r--r-- 1 admin admin   5471 Sep 27 12:07 H5Apublic.h
-rw-r--r-- 1 admin admin   1760 Sep 27 12:07 H5Cpublic.h
-rw-r--r-- 1 admin admin   1931 Sep 27 12:07 H5DOpublic.h
-rw-r--r-- 1 admin admin   8656 Sep 27 12:07 H5Dpublic.h
-rw-r--r-- 1 admin admin   2590 Sep 27 12:07 H5DSpublic.h
-rw-r--r-- 1 admin admin  21845 Sep 27 12:07 H5Epubgen.h
-rw-r--r-- 1 admin admin   9023 Sep 27 12:07 H5Epublic.h
-rw-r--r-- 1 admin admin   1490 Sep 27 12:07 H5FDcore.h
-rw-r--r-- 1 admin admin   1955 Sep 27 12:07 H5FDdirect.h
-rw-r--r-- 1 admin admin   1506 Sep 27 12:07 H5FDfamily.h
-rw-r--r-- 1 admin admin   3691 Sep 27 12:07 H5FDhdfs.h
-rw-r--r-- 1 admin admin   3322 Sep 27 12:07 H5FDlog.h
-rw-r--r-- 1 admin admin   2573 Sep 27 12:07 H5FDmpi.h
-rw-r--r-- 1 admin admin   2440 Sep 27 12:07 H5FDmpio.h
-rw-r--r-- 1 admin admin   1800 Sep 27 12:07 H5FDmulti.h
-rw-r--r-- 1 admin admin  17115 Sep 27 12:07 H5FDpublic.h
-rw-r--r-- 1 admin admin   3420 Sep 27 12:07 H5FDros3.h
-rw-r--r-- 1 admin admin   1339 Sep 27 12:07 H5FDsec2.h
-rw-r--r-- 1 admin admin   1368 Sep 27 12:07 H5FDstdio.h
-rw-r--r-- 1 admin admin  14540 Sep 27 12:07 H5Fpublic.h
-rw-r--r-- 1 admin admin   7233 Sep 27 12:07 H5Gpublic.h
-rw-r--r-- 1 admin admin   3274 Sep 27 12:07 H5IMpublic.h
-rw-r--r-- 1 admin admin   4934 Sep 27 12:07 H5Ipublic.h
-rw-r--r-- 1 admin admin   1362 Sep 27 12:07 H5LDpublic.h
-rw-r--r-- 1 admin admin  10160 Sep 27 12:07 H5Lpublic.h
-rw-r--r-- 1 admin admin  14182 Sep 27 12:07 H5LTpublic.h
-rw-r--r-- 1 admin admin   1775 Sep 27 12:07 H5MMpublic.h
-rw-r--r-- 1 admin admin  12103 Sep 27 12:07 H5Opublic.h
-rw-r--r-- 1 admin admin 116154 Sep 27 12:07 H5overflow.h
-rw-r--r-- 1 admin admin   1528 Sep 27 12:07 H5PLextern.h
-rw-r--r-- 1 admin admin   2402 Sep 27 12:07 H5PLpublic.h
-rw-r--r-- 1 admin admin  28815 Sep 27 12:07 H5Ppublic.h
-rw-r--r-- 1 admin admin   3861 Sep 27 12:07 H5PTpublic.h
-rw-r--r-- 1 admin admin  20413 Sep 27 12:07 H5pubconf.h
-rw-r--r-- 1 admin admin  12256 Sep 27 12:07 H5public.h
-rw-r--r-- 1 admin admin   3794 Sep 27 12:07 H5Rpublic.h
-rw-r--r-- 1 admin admin   7526 Sep 27 12:07 H5Spublic.h
-rw-r--r-- 1 admin admin   8419 Sep 27 12:07 H5TBpublic.h
-rw-r--r-- 1 admin admin  27251 Sep 27 12:07 H5Tpublic.h
-rw-r--r-- 1 admin admin  22479 Sep 27 12:07 H5version.h
-rw-r--r-- 1 admin admin  11250 Sep 27 12:07 H5Zpublic.h
-rw-r--r-- 1 admin admin   3467 Sep 27 12:07 hdf5.h
-rw-r--r-- 1 admin admin   1543 Sep 27 12:07 hdf5_hl.h

/home/admin/.local/lib:
total 64204
-rw-r--r-- 1 admin admin 43972818 Sep 27 12:07 libhdf5.a
-rw-r--r-- 1 admin admin   973650 Sep 27 12:07 libhdf5_hl.a
-rwxr-xr-x 1 admin admin     1122 Sep 27 12:07 libhdf5_hl.la
libhdf5_hl.so.100.1.1
libhdf5_hl.so.100.1.1
-rwxr-xr-x 1 admin admin   581688 Sep 27 12:07 libhdf5_hl.so.100.1.1
-rwxr-xr-x 1 admin admin     1067 Sep 27 12:07 libhdf5.la
-rw-r--r-- 1 admin admin     3982 Sep 27 12:07 libhdf5.settings
libhdf5.so.103.0.0
libhdf5.so.103.0.0
-rwxr-xr-x 1 admin admin 20195896 Sep 27 12:07 libhdf5.so.103.0.0

An HDFS namenode instance runs on localhost at port 8020.

The environment is configured as follows:

export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=$HOME/hadoop-3.1.1
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

We’ve copied a sample HDF5 file t.h5 into the /tmp directory on the HDFS instance.

$HADOOP_HOME/bin/hdfs dfs -ls /tmp/*.h5

-rw-r--r--   1 admin supergroup      18528 2018-09-27 12:06 /tmp/t.h5

HDF5 Tool Support

Several HDF5 tools support the use of alternative VFDs. Typically, the tools that do have a command-line option to select the desired VFD. Unfortunately, the naming of these options varies between tools. In this document, we use
h5ls and h5dump as examples.

`h5ls`

This tool uses the --vfd=DRIVER option to select an alternative VFD.

$HOME/.local/bin/h5ls --vfd=hdfs -r hdfs://localhost/tmp/t.h5

/                        Group
/Dataset1                Dataset {128, 32}

Additional HDFS options, such as non-default host name and port number can be specified via the --hdfs-attrs option.

`h5dump`

This tool uses the --filedriver=DRIVER option to select an alternative VFD.

$HOME/.local/bin/h5dump --filedriver=hdfs -pBH hdfs://localhost/tmp/t.h5

HDF5 "hdfs://localhost/tmp/t.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   DATASET "Dataset1" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 128, 32 ) / ( 128, 32 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16384
         OFFSET 2144
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}
}

$HOME/.local/bin/h5dump --filedriver=hdfs -d /Dataset1 -s "64,8" -k "10,10" hdfs://localhost/tmp/t.h5

HDF5 "hdfs://localhost/tmp/t.h5" {
DATASET "/Dataset1" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 128, 32 ) / ( 128, 32 ) }
   SUBSET {
      START ( 64, 8 );
      STRIDE ( 1, 1 );
      COUNT ( 1, 1 );
      BLOCK ( 10, 10 );
      DATA {
      (64,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (65,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (66,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (67,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (68,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (69,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (70,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (71,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (72,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (73,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2
      }
   }
}
}

Additional HDFS options, such as non-default host name and port number can be specified via the --hdfs-attrs option.

Application Development

The run-time behavior of the HDF5 library is highly configurable, and the VFD used to access an HDF5 file can be specified as a property in a file access property list. This way the changes to existing applications
to accommodate other VFDs are minimal.

HDFS VFD C-API

The HDFS VFD C-API consists of a structure which holds parameters that control the interaction with an HDFS instance (e.g., port, credentials, etc.) and a function that adds a matching file access property to an
existing file access property list.

typedef struct H5FD_hdfs_fapl_t {
  int32_t version;
  char    namenode_name[H5FD__HDFS_NODE_NAME_SPACE + 1];
  int32_t namenode_port;
  char    user_name[H5FD__HDFS_USER_NAME_SPACE + 1];
  char    kerberos_ticket_cache[H5FD__HDFS_KERB_CACHE_PATH_SPACE + 1];
  int32_t stream_buffer_size;
} H5FD_hdfs_fapl_t;

herr_t H5Pset_fapl_hdfs(hid_t fapl_id, H5FD_hdfs_fapl_t *fa);

The use of this API is illustrated below.

A Sample Program

The following program opens an HDF5 file and uses the HDF5 library to:

Determine its size (H5Fget_filesize)
Determine the address and number of attributes of the root group (H5Oget_info)
Determine the number of links in the root group (H5Gget_info)

Notice that the only “deviation” from doing the same for an HDF5 file stored in a POSIX file system is the fapl initialization (lines 22 to 28).

 1: #include "hdf5.h"
 2: 
 3: #include <assert.h>
 4: #include <string.h>
 5: 
 6: #define NAMENODE_NAME "localhost"
 7: #define NAMENODE_PORT 8020
 8: #define STREAM_BUFFER_SIZE 4096
 9: #define FILE_NAME "hdfs://localhost/tmp/t.h5"
10: 
11: void main(int argc, char** argv)
12: {
13:   H5FD_hdfs_fapl_t param;
14:   hid_t            fapl, file;
15:   hsize_t          size;
16:   H5G_info_t       ginfo;
17:   ssize_t          obj_count;
18:   H5O_info_t       oinfo;
19: 
20:   /* Create and initialize a file access property list. */
21: 
22:   assert((fapl = H5Pcreate(H5P_FILE_ACCESS)) >= 0);
23: 
24:   param.version = 1;
25:   strcpy(param.namenode_name, NAMENODE_NAME);
26:   param.namenode_port = NAMENODE_PORT;
27:   param.stream_buffer_size = STREAM_BUFFER_SIZE;
28:   assert(H5Pset_fapl_hdfs(fapl, &param) >= 0);
29: 
30:   /* Open the file in read-only mode with the access property list. */
31: 
32:   assert((file = H5Fopen(FILE_NAME, H5F_ACC_RDONLY, fapl)) >= 0);
33: 
34:   /* Make HDF5 API calls as usual. */
35: 
36:   assert(H5Fget_filesize(file, &size) >= 0);
37:   printf("File size\t\t: %llu [bytes]\n", size);
38: 
39:   assert(H5Oget_info(file, &oinfo) >= 0);
40:   printf("Root group address\t: %llu\n", oinfo.addr);
41:   printf("Root group attr. count\t: %llu\n", oinfo.num_attrs);
42: 
43:   assert(H5Gget_info(file, &ginfo) >= 0);
44:   printf("Root group link count\t: %llu\n", ginfo.nlinks);
45: 
46:   /* Release resources. */
47: 
48:   assert(H5Fclose(file) >= 0);
49:   assert(H5Pclose(fapl) >= 0);
50:   return;
51: }

The code can be build with h5cc, which links it against libhdfs.

$HOME/.local/bin/h5cc -show hdfs-vfd-sample.c -o doit

gcc -I/home/admin/.local/include -I/home/admin/hadoop-3.1.1/include -c hdfs-vfd-sample.c
gcc -I/home/admin/hadoop-3.1.1/include hdfs-vfd-sample.o -o doit -L/home/admin/.local/lib /home/admin/.local/lib/libhdf5_hl.a /home/admin/.local/lib/libhdf5.a -L/home/admin/hadoop-3.1.1/lib/native -L/usr/lib/jvm/default-java/jre/lib/ -L/usr/lib/jvm/default-java/jre/lib//server -lhdfs -ldl -lm -Wl,-rpath -Wl,/home/admin/.local/lib

$HOME/.local/bin/h5cc hdfs-vfd-sample.c -o doit

Running the executable yields the expected result:

./doit

File size		: 18528 [bytes]
Root group address	: 96
Root group attr. count	: 0
Root group link count	: 1

Summary

In this document, we have shown how to use the HDF5 VFD for HDFS with existing HDF5 tools and in application development. Because of the ease with which the run-time behavior of the HDF5 library can be configured, changes to existing applications are minimal.

For additional help and with further questions, please contact support@hdfgroup.org.

Hadoop (HDFS) HDF5 Connector

Demo Video

Video Script

Introduction

Prerequisites

Environment

Copying an HDF5 file to HDFS

Listing the HDF5 file structure via h5ls

Printing detailed information with h5dump

Extracting data from a dataset with h5dump

Use Hadoop Streaming to collect data from multiple HDF5 files

Summary

Additional Documentation

Installation Guide

Overview

Prerequisites

Tools

Environment

Files

Patching the HDF5 Library Source Code

Building the HDF5 Library

Testing the HDFS Support

Summary

User's Guide

Overview

Prerequisites

Example

HDF5 Tool Support

h5ls

h5dump

Application Development

HDFS VFD C-API

A Sample Program

Summary

Latest Posts

​Release of HDF5 1.14.4 (Newsletter #202)

Updates on portal.hdfgroup.org

Release of HDF 4.2.16-2, a patch release (Newsletter #195)

Latest Tweets

Connect

Get Started

`h5ls`

`h5dump`

Release of HDF5 1.14.4 (Newsletter #202)