Cloud Storage (Amazon S3) HDF5 Connector

The HDF5 read-only virtual file driver for Amazon AWS Simple Storage Service, or S3. In this video, we will give an overview of the HDF5 Virtual File Layer, show you how to use HDF5 command-line tools h5ls and h5dump with the S3 VFD, and how you can easily utilize the S3 VFD in your own C programs.

Demo Video

Video Script

For your convenience, the video is divided into the following sections. Expand the sections below for text and code samples.

Introduction

This demonstration will show you how to:

  1. Copy an HDF5 file into an S3 bucket
  2. List the file structure by running h5ls
  3. Print detailed information by running h5dump
  4. Extract data from a dataset with h5dump

Before you proceed, please review the following prerequisites.

Prerequisites
  • A version of the HDF5 library (HDF5 1.10.6 or later) with the S3 Virtual File Driver (VFD) enabled
  • A version of the AWS Command Line Interface (CLI)
  • An S3 bucket containing on or more HDF5 or NetCDF-4 files
  • To copy an HDF5 file to an S3 bucket you need write permissions

With these prerequisites in place, make sure that your environment is configured, i.e., you have your AWS Access Key Id and Secret Access Key in your profile (~/.aws/config)

Environment

The HDF5 VFD for S3, whether used in your own application or with HDF5 command line tools, depends on certain environment variables and shared libraries.

The S3 VFD depends on the following shared libraries: libhdf5.so, libssl.so, libcurl.so. Please locate these dependencies and configure the LD_LIBRARY_PATH accordingly. An example is shown below:

export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH

For convenience, we add the directories containing the HDF5 command line tools to our PATH environment variable.

export PATH=$HOME/.local/bin:$PATH

A simple test is to list accessible S3 buckets:

aws s3 ls
2018-05-31 06:58:07 my-pile-of-files
2018-05-31 06:41:18 pile-of-files
2018-03-02 07:38:18 raw-bar
2018-03-05 08:07:56 raw-bar-one

If you receive an error message, your setup/environment may need adjustments.

Copying an HDF5 file to an S3 Bucket

If you have an existing HDF5 or NetCDF-4 file, you can skip this step. If you don’t have an HDF5 file, it’s easy to create one via h5mkgrp:

h5mkgrp -p sample.h5 /A/few/groups
ls -al sample.h5
-rw-r--r--. 1 gheber hdf 3896 Nov  5 07:29 sample.h5

Copy your example file or sample.h5 to an S3 bucket where you have write permissions:

aws s3 cp --acl public-read sample.h5 s3://pile-of-files
aws s3 ls s3://pile-of-files/sample.h5
Completed 3.8 KiB/3.8 KiB (12.4 KiB/s) with 1 file(s) remaining
upload: ./sample.h5 to s3://pile-of-files/sample.h5
2018-11-05 07:32:47       3896 sample.h5

In this example, we’ve made sample.h5 available for public read via the --acl public-read option. Do not do that with your own files!

If you do not have the permission to write to the bucket, you will see an error message.

Listing the HDF5 file structure via h5ls

(video)

The h5ls command line tool lists information about objects in an HDF5 file. There is no difference in the behavior of h5ls between listing information about objects in an HDF5 file that is stored in a local file system vs. an S3 bucket. There currently one additional required argument, --vfd=ros3 (read-only S3) to tell h5ls to use the S3 VFD instead of the default POSIX VFD. Unless your are dealing with a public S3 bucket,

h5ls --vfd=ros3 -r https://s3.amazonaws.com/pile-of-files/sample.h5
/                        Group
/A                       Group
/A/few                   Group
/A/few/groups            Group
  1. Unlike the AWS CLI, the HDF5 command line tools currently do not support the authority s3 , and we must specify the URL with the https authority.
  2. The S3 bucket pile-of-files is readable by the public. If that’s not the case, it is necessary to specify the AWS credentials with the --s3-cred option: --s3-cred="(<aws-region>,<access-id>,<access-key>)".

A more complex example is shown below. Depending on the network latency of the host to the Amazon cloud, running this example may take a while (e.g., ~40 sec). It will perform better, for example, when run on an EC2 instance in the Amazon cloud. It is not the size of the file (S3 object) that matters, but the number of HDF5 objects (groups, datasets, etc.) in the file.

h5ls --vfd=ros3 -r https://s3.amazonaws.com/pile-of-files/efitOut.nc | tail -n 50
/output/numericalDetails/degreesOfFreedom/offsetPressure Dataset {22845}
/output/numericalDetails/degreesOfFreedom/offsetRotationalPressure Dataset {22845}
/output/numericalDetails/degreesOfFreedom/pPrimeDim Dataset {2}
/output/numericalDetails/degreesOfFreedom/pfCircuitDim Dataset {10}
/output/numericalDetails/degreesOfFreedom/pfCurrents Dataset {22845, 10}
/output/numericalDetails/degreesOfFreedom/pprimeCoeffs Dataset {22845, 2}
/output/numericalDetails/degreesOfFreedom/wPrimeDim Dataset {0/Inf}
/output/numericalDetails/degreesOfFreedom/wprimeCoeffs Dataset {22845, 0/Inf}
/output/numericalDetails/finalChiSquared Dataset {22845}
/output/numericalDetails/finalIronSegmentCurrentsError Dataset {22845}
/output/numericalDetails/finalPoloidalFluxError Dataset {22845}
/output/numericalDetails/ironSegmentCurrentsError Dataset {22845}
/output/numericalDetails/iterationCount Dataset {22845}
/output/numericalDetails/maximumIterationCount Dataset {30}
/output/numericalDetails/poloidalFluxError Dataset {22845, 30}
/output/profiles2D       Group
/output/profiles2D/jphi  Dataset {22845, 33, 33}
/output/profiles2D/poloidalFlux Dataset {22845, 33, 33}
/output/profiles2D/r     Dataset {22845, 33}
/output/profiles2D/rgrid Dataset {33}
/output/profiles2D/z     Dataset {22845, 33}
/output/profiles2D/zgrid Dataset {33}
/output/radialProfiles   Group
/output/radialProfiles/Br Dataset {22845, 33}
/output/radialProfiles/Bt Dataset {22845, 33}
/output/radialProfiles/Bz Dataset {22845, 33}
/output/radialProfiles/jphi Dataset {22845, 33}
/output/radialProfiles/normalizedPoloidalFlux Dataset {22845, 33}
/output/radialProfiles/q Dataset {22845, 33}
/output/radialProfiles/r Dataset {22845, 33}
/output/radialProfiles/radialCoord Dataset {33}
/output/radialProfiles/staticPressure Dataset {22845, 33}
/output/separatrixGeometry Group
/output/separatrixGeometry/boundaryClassification Dataset {22845}
/output/separatrixGeometry/boundaryCoords Dataset {22845, 105}
/output/separatrixGeometry/boundaryCoordsDim Dataset {105}
/output/separatrixGeometry/boundaryType Type
/output/separatrixGeometry/elongation Dataset {22845}
/output/separatrixGeometry/geometricAxis Dataset {22845}
/output/separatrixGeometry/limiterCoords Dataset {22845}
/output/separatrixGeometry/lowerTriangularity Dataset {22845}
/output/separatrixGeometry/minorRadius Dataset {22845}
/output/separatrixGeometry/strikepointCoords Dataset {22845, 4}
/output/separatrixGeometry/strikepointDim Dataset {4}
/output/separatrixGeometry/upperTriangularity Dataset {22845}
/output/separatrixGeometry/xpointCoords Dataset {22845, 2}
/output/separatrixGeometry/xpointCount Dataset {22845}
/output/separatrixGeometry/xpointDim Dataset {2}
/time                    Dataset {22845}
/unityDim                Dataset {1}
Printing detailed information with h5dump

(video)

The h5dump command line tool lists detailed information about objects in an HDF5 file. There is no difference in the behavior of h5dump between listing information about objects in an HDF5 file that is stored in a local file system vs. in an S3 bucket. There currently one additional required argument, --filedriver=ros3 to tell h5dump to use the S3 VFD instead of the default POSIX VFD.

h5dump --filedriver=ros3 \
       -pB https://s3.amazonaws.com/pile-of-files/sample.h5
HDF5 "https://s3.amazonaws.com/pile-of-files/sample.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   GROUP "A" {
      GROUP "few" {
         GROUP "groups" {
         }
      }
   }
}
}

A more elaborate example is shown below:

h5dump --filedriver=ros3 \
       -pBH https://s3.amazonaws.com/pile-of-files/efitOut.nc | head -n 50
> HDF5 "https://s3.amazonaws.com/pile-of-files/efitOut.nc" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 2
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   ATTRIBUTE "Conventions" {
      DATATYPE  H5T_STRING {
         STRSIZE 20;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "codeVersion" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "pulseNumber" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   DATATYPE "coords" H5T_COMPOUND {
      H5T_IEEE_F64LE "R";
      H5T_IEEE_F64LE "Z";
   }
   DATASET "equilibriumStatus" {
      DATATYPE  "/equilibriumStatusType"
      DATASPACE  SIMPLE { ( 22845 ) / ( 22845 ) }
      STORAGE_LAYOUT {
Extracting data from a dataset with h5dump

For a full list of h5dump options run h5dump -h. The h5dump command to extract a 10 by 10 block (-s "0,0'" -k "10,10") of elements from a two-dimensional dataset is shown below:

h5dump --filedriver=ros3 \
       -d /output/fluxFunctionProfiles/poloidalFluxArea \
       -s "0,0" -k "10,10" \
       https://s3.amazonaws.com/pile-of-files/efitOut.nc
> > > HDF5 "https://s3.amazonaws.com/pile-of-files/efitOut.nc" {
DATASET "/output/fluxFunctionProfiles/poloidalFluxArea" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SIMPLE { ( 22845, 33 ) / ( 22845, 33 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 1, 1 );
      BLOCK ( 10, 10 );
      DATA {
      (0,0): 0, 0.105177, 0.211194, 0.317949, 0.425508, 0.53394, 0.643315,
      (0,7): 0.753705, 0.865191, 0.97786,
      (1,0): 0, 0.102888, 0.206772, 0.311557, 0.417309, 0.524091, 0.631973,
      (1,7): 0.741024, 0.851326, 0.962959,
      (2,0): 0, 0.104089, 0.209173, 0.315166, 0.422121, 0.53011, 0.639202,
      (2,7): 0.749467, 0.860991, 0.973854,
      (3,0): 0, 0.10498, 0.210948, 0.317801, 0.425605, 0.534427, 0.644339,
      (3,7): 0.755412, 0.867731, 0.981378,
      (4,0): 0, 0.10605, 0.213064, 0.320959, 0.429793, 0.539632, 0.65055,
      (4,7): 0.76262, 0.875928, 0.990556,
      (5,0): 0, 0.108181, 0.217261, 0.327153, 0.437919, 0.549636, 0.66237,
      (5,7): 0.776199, 0.891211, 1.00749,
      (6,0): 0, 0.107151, 0.215293, 0.324331, 0.43433, 0.545359, 0.65749,
      (6,7): 0.7708, 0.885371, 1.00129,
      (7,0): 0, 0.110211, 0.22127, 0.33308, 0.445711, 0.559235, 0.673728,
      (7,7): 0.789269, 0.905944, 1.02384,
      (8,0): 0, 0.112895, 0.226538, 0.340842, 0.455874, 0.571715, 0.688438,
      (8,7): 0.806128, 0.924873, 1.04476,
      (9,0): 0, 0.112122, 0.225054, 0.338708, 0.453155, 0.568472, 0.68473,
      (9,7): 0.802015, 0.920413, 1.04001
      }
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (DATASET 9769 /time ),
      (1): (DATASET 240402 /output/fluxFunctionProfiles/normalizedPoloidalFlux )
      }
   }
   ATTRIBUTE "_FillValue" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): nan
      }
   }
   ATTRIBUTE "title" {
      DATATYPE  H5T_STRING {
         STRSIZE 16;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "poloidalFluxArea"
      }
   }
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "m^2"
      }
   }
}
}

Summary

The HDF5 VFD for S3 provides transparent read access to HDF5 files stored in S3 buckets. No code changes other than loading the S3 VFD and linking against an updated version of the HDF5 library are required, This applies to the HDF5 command line tools as well as existing applications. You can use this VFD to bulk process HDF5 (and NetCDF-4) files stored in S3 with frameworks such as Hadoop streaming.

Additional Documentation

Getting Started Guide

This guide describes how to install and start using the Read-Only S3 (ROS3) VFD for HDF5. The Read-Only S3 VFD transparently accesses HDF5-format files hosted remotely on Amazon’s Simple Storage Service (S3). This VFD supplies bytes transparently to the HDF5 library through the AWS REST API.

Supported Operating Systems

This VFD is currently supported only on Linux. Windows and macOS will be available in a later release. 

Software Prerequisites

The following libraries and tools are required prior to installing the Read-Only S3 VFD:

  • openssl
  • cURL
  • Autotools

Installation Instructions

The following describes the process to install the Read-Only S3 VFD and the associated unit and regression tests.

Step 1. Extract source tarball or download the source

tar -zxf hdf5-1.11.0-of20180219.tar.gz

cd hdf5-1.11.0-of20180219/

or

or

 

If you are not provided with a source tarball, download it from bitbucket. This requires the use of Autotools to prepare for build.

git clone https://bitbucket.hdfgroup.org/scm/hdffv/hdf5.git hdf5_ros3

cd hdf5_ros3

./autogen.sh

Step 2. Modify environment variables to include new libraries

Depending on the installation of OpenSSL and libcurl, it may be necessary to manually set system variables for CPPFLAGS and LDFLAGS:

e.g.:

export CPPFLAGS="-I/usr/local/opt/openssl/include -I/usr/local/opt/curl/include"
export LDFLAGS="-L/usr/local/opt/openssl/lib -L/usr/local/opt/curl/lib"

 

Step 3. (optional) Set up S3 test bucket and credentials

Pull down test files from somewhere
Put files on S3
Set environment variable for bucket URL
Set credentials

If this step is not done or not done properly, some S3 tests will skip during make-check.

Step 4. Configure and build HDF5 with the Read-Only S3 VFD

Modify the HDF5 build flags as appropriate. The --enable-ros3-vfd flag is required to enable Read-Only S3 VFD features.

./configure --enable-ros3-vfd --enable-shared --enable-java
make
make check
make install

HDF5 APIs for use with the Read-Only HDFS VFD

The following new APIs were added to HDF5 for use with the Read-Only S3 VFD, H5Pget_fapl_ros3() and H5Pset_ros3_fapl(). Man pages for these can be found in ros3_api_reference.txt. Further usage information can be found in usage.txt.

H5Pget_fapl_ros3()

This gets the information of the given Read-Only S3 VFD. The information from the fapl fapl_id is copied to the H5FD_ros3_fapl_t structure pointed to by fa. This returns a non-negative value if successful and returns a negative value otherwise.

 

herr_t H5Pget_fapl_ros3(hid_t fapl_id, H5FD_ros3_fapl_t *fa)

 

Parameters:

 

hid_t fapl_id        IN: File access property list identifier.

H5FD_ros3_fapl_t *fa OUT: H5FD_ros3_fapl_t structure destination

 

Example:

 

/* assumes fapl_id has been created and set with HtPset_fapl_ros3() */

H5FD_ros3_fapl_t fa;

fa.authenticate = 16; /* neither TRUE (0) nor FALSE (-1) */

assert( 0 >= H5Pget_fapl_ros3(fapl_id, &fa) );

assert( fa.authenticate == 0 || fa.authenticate == -1 );

 

H5Pset_ros3_fapl()

This sets up the Read-Only S3 VFD. It sets the file access property list fapl_id to use the Read-Only S3 VFD. In addition to requiring very different underlying operation, files on S3 may have restricted access, requiring that attempts to access and read provide “credentials” to authenticate the recipient and message integrity. The structure H5FD_ros3_fapl_t contains a flag to indicate whether or not this authentication is to take place, as well as to supply credentials to the virtual file driver.

 

If the configuration structure is set to _not_ authenticate, e.g., fa.authenticate == (hbool_t)FALSE, then the credential fields aws_region, secret_id, and secret_key are ignored.

 

If configuration structure is set to authenticate, e.g., fa.authenticate== (hbool_t)TRUE, then credential fields must be populated with null-terminated strings. Each component is an array of characters, the size of which is determined by a constant in H5FDros3.c, e.g., H5FD__ROS3_MAX_REGION_LEN. If the string exceeds the defined length, an error has likely occurred and behavior is undefined.

 

herr_t H5Pset_fapl_ros3(hid_t fapl_id, H5FD_ros3_fapl_t *fa)

 

Parameters:

 

hid_t fapl_id                   IN: File access property list identifier.

H5FD_ros3_fapl_t *fa   IN: Structure containing fapl configuration information.

 

Example:

 

hid_t fapl_id = -1;

 

/* default, non-authenticating, “anonymous” fapl info */

H5FD_ros3_fapl_t fa = { 1, 0, “”, “”, “” };

 

#if AUTHENTICATE_STATIC_VARS

 

/* fapl info with authentication credentials provided statically */

fa = {

1,                                         /* version           */

1,                                         /* authenticate      */

“us-east-2”,                               /* aws_region        */

“AKIAIMC3D3XLYXLN5COA”,                    /* access_key_id     */

“ugs5aVVnLFCErO/8uW14iWE3K5AgXMpsMlWneO/+” /* secret_access_key */

};

 

#elif AUTHENTICATE_DYNAMIC_VARS

 

/* fapl info populated dynamically

* Assumes variables `should_authenticate`, `the_region`,

* `the_access_key_id`, and `the_secret_access_key` have been set somewhere

*/

fa.authenticate = should_authenticate; /* 0 (FALSE) or 1 (TRUE) */

strncpy(fa.aws_region, the_region, H5FD__ROS3_MAX_REGION_LEN);

strncpy(fa.secret_id, the_access_key_id, H5FD__ROS3_MAX_SECRET_ID_LEN);

strncpy(fa.secret_key, the_secret_access_key,

H5FD__ROS3_MAX_SECRET_KEY_LEN);

 

#endif /* set authenticating fapl info statically or dynamically */

 

/* create and set fapl entry */

fapl_id = H5Pcreate(H5P_FILE_ACCESS);

assert( 0 >= fapl_id );

assert( 0 >= H5Pset_fapl_ros3(fapl_id, &fa) );

 

Using HDF5 Tools with the Read-Only HDFS VFD

The following tools have been modified to use the Read-Only S3 VFD. See demo_tools.txt for case examples.

 

h5dump

h5ls

h5stat

 

A new command-line argument has been provided in for accepting AWS credentials in h5dump, h5ls, and h5stat.

–s3-cred=(<aws_region>,<access_key_id>,<secret_key>)

 

Escape parentheses as appropriate for your shell, e.g. BASH, wrap the entire tuple in quotations:

–s3-cred=”(…)”

 

Please read the tools’ provided help messages for further details. These commands interact correctly with the existing flags for each command. Credentials, via –s3-cred, may be omitted for anonymous access.

 

h5dump

 

h5dump [ -f ros3 | –filedriver=ros3 ] [ –s3-cred=”(…)” ]

 

h5ls

 

h5ls [ –vfd=ros3 ] [ –s3-cred=”(…)” ]

 

h5stat

 

h5stat [ –s3-cred=”(…)” ]

 

 

Known Issues

 

Anonymous access with authenticating FAPL. An authenticating fapl can be used to open an anonymously-accessible file, but incurs some overhead in the application – authentication is performed to create requests to S3, but the authentication information is ignored by the server.

 

API subject to change. The API calls and tool command-line interfaces for the Read-Only HDFS VFD may change when this VFD is made available in a future release as a plug-in VFD module.

 

 

Technical Support

 

For assistance with this product, please contact The HDF Group’s customer support team at help@hdfgroup.org.