Hadoop (HDFS) HDF5 Connector

The Hadoop Distributed File System (HDFS) HDF5 Connector is a virtual file driver (VFD) that allows you to use HDF5 command line tools to extract metadata and raw data from HDF5 and netCDF4 files on HDFS, and use Hadoop streaming to collect data from multiple HDF5 files. The HDFS HDF5 Connector is available with Enterprise Support. Watch the demo video for more information—an index of each command is listed after the video.

Demo Video

Video Script

For your convenience, the video is divided into the following sections. Expand the sections below for text and code samples.

Introduction

This demonstration will show you how to:

  1. Copy an HDF5 file into an HDFS file system
  2. List the file structure by running h5ls
  3. Print detailed information by running h5dump
  4. Extract data from a dataset with h5dump
  5. Use Hadoop Streaming to collect data from multiple HDF5 files

Before you proceed, please review the following prerequisites.

Prerequisites

(video)

  • A version of the HDF5 library (HDF5 1.10.4 or later) with the HDFS Virtual File Driver (VFD) enabled
  • A compatible version of Hadoop (e.g., 3.1.1)
  • An HDFS file system (a cluster or local installation)

With these prerequisites in place, make sure that your environment is configured for the dependencies (Java, Hadoop).

Environment

(video)

The HDF5 VFD for HDFS, whether used in your own application or with HDF5 command line tools, depends on certain environment variables and shared libraries.

Please ensure that JAVA_HOME, HADOOP_HOME, and CLASSPATH reflect your environment. An example is provided below:

export JAVA_HOME=/usr/lib/jvm/java-openjdk
export HADOOP_HOME=$HOME/work/hadoop-3.1.1
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

The HDFS VFD depends on the following shared libraries: libhdf5.solibhdfs.so, libjvm.so. Please locate these dependencies and configure the LD_LIBRARY_PATH accordingly. An example is shown below:

export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH

For convenience, we add the directories containing the Hadoop and HDF5 command line tools to our PATH environment variable.

export PATH=$HADOOP_HOME/bin:$HOME/.local/bin:$PATH

To get started you need to know the host name and port of the HDFS namenode. In the examples below, the host name is jelly.ad.hdfgroup,org and the port number is 8020. Verify the availability of an HDFS file system via the
hdfs dfs -ls command:

hdfs dfs -ls hdfs://jelly.ad.hdfgroup.org:8020/
Found 5 items
drwxrwxrwx   - hdfs  supergroup          0 2016-04-05 21:26 hdfs://jelly.ad.hdfgroup.org:8020/benchmarks
drwxr-xr-x   - hbase supergroup          0 2016-04-05 21:26 hdfs://jelly.ad.hdfgroup.org:8020/hbase
drwxrwxrwt   - hdfs  supergroup          0 2018-10-25 09:55 hdfs://jelly.ad.hdfgroup.org:8020/tmp
drwxr-xr-x   - hdfs  supergroup          0 2018-10-10 16:14 hdfs://jelly.ad.hdfgroup.org:8020/user
drwxr-xr-x   - hdfs  supergroup          0 2016-04-05 21:27 hdfs://jelly.ad.hdfgroup.org:8020/var
Copying an HDF5 file to HDFS

(video)

If you have an existing HDF5 or NetCDF-4 file, you can skip this step. If you don’t have an HDF5 file, it’s easy to create one via h5mkgrp:

h5mkgrp -p sample.h5 /A/few/groups
ls -al sample.h5
-rw-r--r--. 1 gheber hdf 3896 Oct 25 10:22 sample.h5

Copy your example file or sample.h5 to HDFS as follows:

hdfs dfs -copyFromLocal sample.h5 hdfs://jelly.ad.hdfgroup.org/tmp
hdfs dfs -ls hdfs://jelly.ad.hdfgroup.org/tmp/sample.h5
-rw-r--r--   3 gheber supergroup       3896 2018-10-25 10:23 hdfs://jelly.ad.hdfgroup.org/tmp/sample.h5

If you do not have the permission to write to HDFS’s /tmp directory, you will see an error message and must choose a target directory, such as your home directory (in HDFS), where you have write permission.

Listing the HDF5 file structure via h5ls

(video)

The h5ls command line tool lists information about objects in an HDF5 file. There is no difference in the behavior of h5ls between listing information about objects in an HDF5 file that is stored in a local file system vs. HDFS. There currently one additional required argument, --vfd=hdfs to tell h5ls to use the HDFS VFD instead of the default POSIX VFD. Unless the HDFS file system is running on localhost, it is also necessary to specify the host name and port number of the HDFS namenode via the --hdfs-attrs option.

h5ls --vfd=hdfs --hdfs-attrs="(hdfs://jelly.ad.hdfgroup.org,8020,,,)" \
     -r /tmp/sample.h5
/                        Group
/A                       Group
/A/few                   Group
/A/few/groups            Group

A slightly more complex example is shown below:

h5ls --vfd=hdfs --hdfs-attrs="(hdfs://jelly.ad.hdfgroup.org,8020,,,)" \
     -r /tmp/efitOut.nc | head -n 50
/                        Group
/coords                  Type
/equilibriumStatus       Dataset {22845}
/equilibriumStatusInteger Dataset {22845}
/equilibriumStatusType   Type
/errorMessageDim         Dataset {20}
/errorMessages           Dataset {22845, 20}
/input                   Group
/input/bVacRadiusProduct Group
/input/bVacRadiusProduct/values Dataset {22845}
/input/codeControls      Group
/input/codeControls/alpgamSwitch Dataset {1}
/input/codeControls/computeChi2WithWeights Dataset {1}
/input/codeControls/computeConstraintsWithDz Dataset {1}
/input/codeControls/fcurrtFit Dataset {1}
/input/codeControls/fitDzAlgorithm Dataset {1}
/input/codeControls/lcfsTol Dataset {1}
/input/constraints       Group
/input/constraints/diamagneticFlux Group
/input/constraints/diamagneticFlux/computed Dataset {22845}
/input/constraints/diamagneticFlux/sigma Dataset {22845}
/input/constraints/diamagneticFlux/target Dataset {22845}
/input/constraints/diamagneticFlux/weights Dataset {22845}
/input/constraints/fluxLoops Group
/input/constraints/fluxLoops/computed Dataset {22845, 36}
/input/constraints/fluxLoops/fluxLoopDim Dataset {36}
/input/constraints/fluxLoops/fluxLoopElementDim Dataset {2}
/input/constraints/fluxLoops/id Dataset {36}
/input/constraints/fluxLoops/rValues Dataset {36, 2}
/input/constraints/fluxLoops/sigmas Dataset {22845, 36}
/input/constraints/fluxLoops/target Dataset {22845, 36}
/input/constraints/fluxLoops/toroidalAngleBegin Dataset {36, 2}
/input/constraints/fluxLoops/toroidalAngleEnd Dataset {36, 2}
/input/constraints/fluxLoops/weights Dataset {22845, 36}
/input/constraints/fluxLoops/zValues Dataset {36, 2}
/input/constraints/ironModel Group
/input/constraints/ironModel/absoluteError Dataset {1}
/input/constraints/ironModel/fittingAbsoluteError Dataset {1}
/input/constraints/ironModel/fittingRelativeError Dataset {1}
/input/constraints/ironModel/geometry Group
/input/constraints/ironModel/geometry/Dang Dataset {1, 96}
/input/constraints/ironModel/geometry/Dang2 Dataset {1, 96}
/input/constraints/ironModel/geometry/Eang Dataset {1, 96}
/input/constraints/ironModel/geometry/Eang2 Dataset {1, 96}
/input/constraints/ironModel/geometry/boundaryCoordsR Dataset {1, 96}
/input/constraints/ironModel/geometry/boundaryCoordsZ Dataset {1, 96}
/input/constraints/ironModel/geometry/boundaryIntervalCount Dataset {1}
/input/constraints/ironModel/geometry/boundaryLength Dataset {1}
/input/constraints/ironModel/geometry/boundaryNodeCount Dataset {1}
/input/constraints/ironModel/geometry/boundaryNodeDim Dataset {96}
Printing detailed information with h5dump

(video)

The h5dump command line tool lists detailed information about objects in an HDF5 file. There is no difference in the behavior of h5dump between listing information about objects in an HDF5 file that is stored in a local file system vs. HDFS. There currently one additional required argument, --filedriver=hdfs to tell h5dump to use the HDFS VFD instead of the default POSIX VFD. Unless the HDFS file system is running on localhost, it is also necessary to specify the host name and port number of the HDFS namenode.

h5dump --filedriver=hdfs \
       --hdfs-attrs="(hdfs://jelly.ad.hdfgroup.org,8020,,,)" \
       -pB \
       /tmp/sample.h5
> > HDF5 "/tmp/sample.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   GROUP "A" {
      GROUP "few" {
         GROUP "groups" {
         }
      }
   }
}
}

A slightly more complex example is shown below:

h5dump --filedriver=hdfs \
       --hdfs-attrs="(hdfs://jelly.ad.hdfgroup.org,8020,,,)" \
       -pBH /tmp/efitOut.nc | head -n 50
> HDF5 "/tmp/efitOut.nc" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 2
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   ATTRIBUTE "Conventions" {
      DATATYPE  H5T_STRING {
         STRSIZE 20;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "codeVersion" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "pulseNumber" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   DATATYPE "coords" H5T_COMPOUND {
      H5T_IEEE_F64LE "R";
      H5T_IEEE_F64LE "Z";
   }
   DATASET "equilibriumStatus" {
      DATATYPE  "/equilibriumStatusType"
      DATASPACE  SIMPLE { ( 22845 ) / ( 22845 ) }
      STORAGE_LAYOUT {
Extracting data from a dataset with h5dump

(video)

For a full list of h5dump options run h5dump -h. The h5dump command to extract a 10 by 10 block (-s "0,0'" -k "10,10") of elements from a two-dimensional dataset is shown below:

h5dump --filedriver=hdfs \
       --hdfs-attrs="(hdfs://jelly.ad.hdfgroup.org,8020,,,)" \
       -d /output/fluxFunctionProfiles/poloidalFluxArea \
       -s "0,0" -k "10,10" \
    /tmp/efitOut.nc
> > > HDF5 "/tmp/efitOut.nc" {
DATASET "/output/fluxFunctionProfiles/poloidalFluxArea" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SIMPLE { ( 22845, 33 ) / ( 22845, 33 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 1, 1 );
      BLOCK ( 10, 10 );
      DATA {
      (0,0): 0, 0.105177, 0.211194, 0.317949, 0.425508, 0.53394, 0.643315,
      (0,7): 0.753705, 0.865191, 0.97786,
      (1,0): 0, 0.102888, 0.206772, 0.311557, 0.417309, 0.524091, 0.631973,
      (1,7): 0.741024, 0.851326, 0.962959,
      (2,0): 0, 0.104089, 0.209173, 0.315166, 0.422121, 0.53011, 0.639202,
      (2,7): 0.749467, 0.860991, 0.973854,
      (3,0): 0, 0.10498, 0.210948, 0.317801, 0.425605, 0.534427, 0.644339,
      (3,7): 0.755412, 0.867731, 0.981378,
      (4,0): 0, 0.10605, 0.213064, 0.320959, 0.429793, 0.539632, 0.65055,
      (4,7): 0.76262, 0.875928, 0.990556,
      (5,0): 0, 0.108181, 0.217261, 0.327153, 0.437919, 0.549636, 0.66237,
      (5,7): 0.776199, 0.891211, 1.00749,
      (6,0): 0, 0.107151, 0.215293, 0.324331, 0.43433, 0.545359, 0.65749,
      (6,7): 0.7708, 0.885371, 1.00129,
      (7,0): 0, 0.110211, 0.22127, 0.33308, 0.445711, 0.559235, 0.673728,
      (7,7): 0.789269, 0.905944, 1.02384,
      (8,0): 0, 0.112895, 0.226538, 0.340842, 0.455874, 0.571715, 0.688438,
      (8,7): 0.806128, 0.924873, 1.04476,
      (9,0): 0, 0.112122, 0.225054, 0.338708, 0.453155, 0.568472, 0.68473,
      (9,7): 0.802015, 0.920413, 1.04001
      }
   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (DATASET 9769 /time ),
      (1): (DATASET 240402 /output/fluxFunctionProfiles/normalizedPoloidalFlux )
      }
   }
   ATTRIBUTE "_FillValue" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): nan
      }
   }
   ATTRIBUTE "title" {
      DATATYPE  H5T_STRING {
         STRSIZE 16;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "poloidalFluxArea"
      }
   }
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "m^2"
      }
   }
}
}
Use Hadoop Streaming to collect data from multiple HDF5 files

(video)

For more information on Hadoop streaming see the official documentation. Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

We have created a simple example of a mapper and reducer written in C which together determine the number of HDF5 objects (groups, datasets, datatypes) in each file in a collection of HDF5 (and NetCDF-4) files. The source code can be obtained from GitHub.

The HDF5 files to be examined are listed in two text files, input1 and input2. (The reason we use two files is to create more than one split to be processed. The input text files are just too small for Hadoop to create more than one split on its own.)

$HADOOP_HOME/bin/hdfs dfs -cat hdfs://jelly.ad.hdfgroup.org:8020/tmp/input*
/tmp/GSSTF_NCEP.3.1987.12.07.he5
/tmp/GSSTF_NCEP.3.1987.12.08.he5
/tmp/GSSTF_NCEP.3.1987.12.09.he5
/tmp/GSSTF_NCEP.3.1987.12.10.he5
/tmp/GSSTF_NCEP.3.1987.12.11.he5
/tmp/GSSTF_NCEP.3.1987.12.12.he5
/tmp/GSSTF_NCEP.3.1987.12.13.he5
/tmp/GSSTF_NCEP.3.1987.12.14.he5
/tmp/GSSTF_NCEP.3.1987.12.15.he5
/tmp/GSSTF_NCEP.3.1987.12.16.he5
/tmp/GSSTF_NCEP.3.1987.12.17.he5
/tmp/GSSTF_NCEP.3.1987.12.18.he5
/tmp/GSSTF_NCEP.3.1987.12.19.he5
/tmp/GSSTF_NCEP.3.1987.12.20.he5
/tmp/GSSTF_NCEP.3.1987.12.21.he5
/tmp/GSSTF_NCEP.3.1987.12.22.he5
/tmp/GSSTF_NCEP.3.1987.12.23.he5
/tmp/GSSTF_NCEP.3.1987.12.24.he5
/tmp/GSSTF_NCEP.3.1987.12.25.he5
/tmp/GSSTF_NCEP.3.1987.12.26.he5
/tmp/GSSTF_NCEP.3.1987.12.27.he5
/tmp/GSSTF_NCEP.3.1987.12.28.he5
/tmp/GSSTF_NCEP.3.1987.12.29.he5
/tmp/GSSTF_NCEP.3.1987.12.30.he5
/tmp/GSSTF_NCEP.3.1987.12.31.he5
/tmp/foo.h5
/tmp/sample.h5
/tmp/t.h5
/tmp/efitOut.nc
/tmp/GSSTF_NCEP.3.1987.12.01.he5
/tmp/GSSTF_NCEP.3.1987.12.02.he5
/tmp/GSSTF_NCEP.3.1987.12.03.he5
/tmp/GSSTF_NCEP.3.1987.12.04.he5
/tmp/GSSTF_NCEP.3.1987.12.05.he5
/tmp/GSSTF_NCEP.3.1987.12.06.he5

The mapper, implemented in hdfs-vfd-mapper.c and wrapped in mapper.sh, generates key-value pairs of the form

<FILENAME> [G,D,T]
...

where the codes represent groups (G), datasets (D), or datatypes (T). The reducer, implemented in hdfs-vfd-reducer.c, just counts the number of codes in each category and presents the final result as records of the form

<FILENAME> G #G D #D T #T
...

Hadoop streaming can be invoked as follows:

HDFS_DIR=hdfs://jelly.ad.hdfgroup.org:8020/tmp
INPUT1=$HDFS_DIR/input1
INPUT2=$HDFS_DIR/input2
OUTPUT=$HDFS_DIR/hdfs-vfd-output

MAPPER=./mapper.sh
REDUCER=./hdfs-vfd-reducer
MAPTASKS=2
REDTASKS=3

# Delete output from previous runs
$HADOOP_HOME/bin/hdfs dfs -rm $OUTPUT/*
$HADOOP_HOME/bin/hdfs dfs -rmdir $OUTPUT

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-*streaming*.jar \
  -D mapred.map.tasks=$MAPTASKS  \
  -D mapred.reduce.tasks=$REDTASKS \
  -input $INPUT1  -input $INPUT2 \
  -output $OUTPUT \
  -mapper $MAPPER \
  -reducer $REDUCER

$HADOOP_HOME/bin/hdfs dfs -cat $OUTPUT/part-*
[gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ [gheber@jelly ESE]$ Deleted hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output/_SUCCESS
Deleted hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output/part-00000
Deleted hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output/part-00001
Deleted hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output/part-00002
[gheber@jelly ESE]$ [gheber@jelly ESE]$ > > > > > > > 2018-10-25 10:28:31,885 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-10-25 10:28:31,935 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-10-25 10:28:31,935 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2018-10-25 10:28:31,948 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2018-10-25 10:28:32,508 INFO mapred.FileInputFormat: Total input files to process : 2
2018-10-25 10:28:32,542 INFO mapreduce.JobSubmitter: number of splits:2
2018-10-25 10:28:32,562 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2018-10-25 10:28:32,563 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2018-10-25 10:28:32,628 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local970602474_0001
2018-10-25 10:28:32,629 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-10-25 10:28:32,708 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2018-10-25 10:28:32,709 INFO mapreduce.Job: Running job: job_local970602474_0001
2018-10-25 10:28:32,711 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2018-10-25 10:28:32,713 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
2018-10-25 10:28:32,719 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:32,719 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:32,836 INFO mapred.LocalJobRunner: Waiting for map tasks
2018-10-25 10:28:32,840 INFO mapred.LocalJobRunner: Starting task: attempt_local970602474_0001_m_000000_0
2018-10-25 10:28:32,872 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:32,872 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:32,888 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-10-25 10:28:32,898 INFO mapred.MapTask: Processing split: hdfs://jelly.ad.hdfgroup.org:8020/tmp/input1:0+825
2018-10-25 10:28:32,917 INFO mapred.MapTask: numReduceTasks: 3
2018-10-25 10:28:32,956 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-10-25 10:28:32,956 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-10-25 10:28:32,956 INFO mapred.MapTask: soft limit at 83886080
2018-10-25 10:28:32,956 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-10-25 10:28:32,956 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
MapOutputBuffer
2018-10-25 10:28:32,965 INFO streaming.PipeMapRed: PipeMapRed exec [/mnt/wrk/gheber/Bitbucket/ghorg/ESE/././mapper.sh]
2018-10-25 10:28:32,970 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
2018-10-25 10:28:32,970 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
2018-10-25 10:28:32,971 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2018-10-25 10:28:32,971 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2018-10-25 10:28:32,972 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2018-10-25 10:28:32,972 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
2018-10-25 10:28:32,972 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
2018-10-25 10:28:32,972 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2018-10-25 10:28:32,972 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
2018-10-25 10:28:32,973 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
2018-10-25 10:28:32,973 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
2018-10-25 10:28:32,973 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
/usr/bin/bash: ml: line 1: syntax error: unexpected end of file
/usr/bin/bash: error importing function definition for `BASH_FUNC_ml'
/usr/bin/bash: module: line 1: syntax error: unexpected end of file
/usr/bin/bash: error importing function definition for `BASH_FUNC_module'
2018-10-25 10:28:33,057 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:33,058 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:33,713 INFO mapreduce.Job: Job job_local970602474_0001 running in uber mode : false
reduce 0%
2018-10-25 10:28:36,212 INFO streaming.PipeMapRed: Records R/W=25/1
2018-10-25 10:28:38,643 INFO streaming.PipeMapRed: MRErrorThread done
2018-10-25 10:28:38,644 INFO streaming.PipeMapRed: mapRedFinished
2018-10-25 10:28:38,648 INFO mapred.LocalJobRunner:
2018-10-25 10:28:38,648 INFO mapred.MapTask: Starting flush of map output
2018-10-25 10:28:38,648 INFO mapred.MapTask: Spilling map output
2018-10-25 10:28:38,648 INFO mapred.MapTask: bufstart = 0; bufend = 11375; bufvoid = 104857600
2018-10-25 10:28:38,648 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26213100(104852400); length = 1297/6553600
2018-10-25 10:28:38,664 INFO mapred.MapTask: Finished spill 0
2018-10-25 10:28:38,678 INFO mapred.Task: Task:attempt_local970602474_0001_m_000000_0 is done. And is in the process of committing
2018-10-25 10:28:38,683 INFO mapred.LocalJobRunner: Records R/W=25/1
2018-10-25 10:28:38,683 INFO mapred.Task: Task 'attempt_local970602474_0001_m_000000_0' done.
2018-10-25 10:28:38,692 INFO mapred.Task: Final Counters for attempt_local970602474_0001_m_000000_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=176593
                FILE: Number of bytes written=688995
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=825
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=1
        Map-Reduce Framework
                Map input records=25
                Map output records=325
                Map output bytes=11375
                Map output materialized bytes=12043
                Input split bytes=96
                Combine input records=0
                Spilled Records=325
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1547698176
        File Input Format Counters
                Bytes Read=825
2018-10-25 10:28:38,692 INFO mapred.LocalJobRunner: Finishing task: attempt_local970602474_0001_m_000000_0
2018-10-25 10:28:38,693 INFO mapred.LocalJobRunner: Starting task: attempt_local970602474_0001_m_000001_0
2018-10-25 10:28:38,694 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:38,694 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:38,695 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-10-25 10:28:38,696 INFO mapred.MapTask: Processing split: hdfs://jelly.ad.hdfgroup.org:8020/tmp/input2:0+251
2018-10-25 10:28:38,699 INFO mapred.MapTask: numReduceTasks: 3
reduce 0%
2018-10-25 10:28:38,734 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-10-25 10:28:38,734 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-10-25 10:28:38,734 INFO mapred.MapTask: soft limit at 83886080
2018-10-25 10:28:38,734 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-10-25 10:28:38,734 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
MapOutputBuffer
2018-10-25 10:28:38,740 INFO streaming.PipeMapRed: PipeMapRed exec [/mnt/wrk/gheber/Bitbucket/ghorg/ESE/././mapper.sh]
/usr/bin/bash: ml: line 1: syntax error: unexpected end of file
/usr/bin/bash: error importing function definition for `BASH_FUNC_ml'
2018-10-25 10:28:38,750 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:38,750 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
/usr/bin/bash: module: line 1: syntax error: unexpected end of file
/usr/bin/bash: error importing function definition for `BASH_FUNC_module'
2018-10-25 10:28:41,126 INFO streaming.PipeMapRed: Records R/W=10/1
2018-10-25 10:28:42,275 INFO streaming.PipeMapRed: MRErrorThread done
2018-10-25 10:28:42,276 INFO streaming.PipeMapRed: mapRedFinished
2018-10-25 10:28:42,277 INFO mapred.LocalJobRunner:
2018-10-25 10:28:42,277 INFO mapred.MapTask: Starting flush of map output
2018-10-25 10:28:42,277 INFO mapred.MapTask: Spilling map output
2018-10-25 10:28:42,277 INFO mapred.MapTask: bufstart = 0; bufend = 9096; bufvoid = 104857600
2018-10-25 10:28:42,277 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26212668(104850672); length = 1729/6553600
2018-10-25 10:28:42,281 INFO mapred.MapTask: Finished spill 0
2018-10-25 10:28:42,283 INFO mapred.Task: Task:attempt_local970602474_0001_m_000001_0 is done. And is in the process of committing
2018-10-25 10:28:42,287 INFO mapred.LocalJobRunner: Records R/W=10/1
2018-10-25 10:28:42,287 INFO mapred.Task: Task 'attempt_local970602474_0001_m_000001_0' done.
2018-10-25 10:28:42,288 INFO mapred.Task: Final Counters for attempt_local970602474_0001_m_000001_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=176804
                FILE: Number of bytes written=699055
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1076
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=1
        Map-Reduce Framework
                Map input records=10
                Map output records=433
                Map output bytes=9096
                Map output materialized bytes=9980
                Input split bytes=96
                Combine input records=0
                Spilled Records=433
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1547698176
        File Input Format Counters
                Bytes Read=251
2018-10-25 10:28:42,288 INFO mapred.LocalJobRunner: Finishing task: attempt_local970602474_0001_m_000001_0
2018-10-25 10:28:42,288 INFO mapred.LocalJobRunner: map task executor complete.
2018-10-25 10:28:42,295 INFO mapred.LocalJobRunner: Waiting for reduce tasks
2018-10-25 10:28:42,296 INFO mapred.LocalJobRunner: Starting task: attempt_local970602474_0001_r_000000_0
2018-10-25 10:28:42,307 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:42,307 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:42,308 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-10-25 10:28:42,314 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@439c1391
2018-10-25 10:28:42,317 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2018-10-25 10:28:42,348 INFO reduce.MergeManagerImpl: The max number of bytes for a single in-memory shuffle cannot be larger than Integer.MAX_VALUE. Setting it to Integer.MAX_VALUE
2018-10-25 10:28:42,348 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=20041957376, maxSingleShuffleLimit=2147483647, mergeThreshold=13227692032, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2018-10-25 10:28:42,352 INFO reduce.EventFetcher: attempt_local970602474_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
1 about to shuffle output of map attempt_local970602474_0001_m_000000_0 decomp: 3850 len: 3854 to MEMORY
2018-10-25 10:28:42,391 INFO reduce.InMemoryMapOutput: Read 3850 bytes from map-output for attempt_local970602474_0001_m_000000_0
map-output of size: 3850, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3850
1 about to shuffle output of map attempt_local970602474_0001_m_000001_0 decomp: 964 len: 968 to MEMORY
2018-10-25 10:28:42,396 INFO reduce.InMemoryMapOutput: Read 964 bytes from map-output for attempt_local970602474_0001_m_000001_0
map-output of size: 964, inMemoryMapOutputs.size() -> 2, commitMemory -> 3850, usedMemory ->4814
2018-10-25 10:28:42,397 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2018-10-25 10:28:42,398 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,398 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
2018-10-25 10:28:42,406 INFO mapred.Merger: Merging 2 sorted segments
2018-10-25 10:28:42,406 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 4744 bytes
2018-10-25 10:28:42,409 INFO reduce.MergeManagerImpl: Merged 2 segments, 4814 bytes to disk to satisfy reduce memory limit
2018-10-25 10:28:42,409 INFO reduce.MergeManagerImpl: Merging 1 files, 4816 bytes from disk
2018-10-25 10:28:42,410 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2018-10-25 10:28:42,410 INFO mapred.Merger: Merging 1 sorted segments
2018-10-25 10:28:42,411 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 4777 bytes
2018-10-25 10:28:42,411 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,415 INFO streaming.PipeMapRed: PipeMapRed exec [/mnt/wrk/gheber/Bitbucket/ghorg/ESE/././hdfs-vfd-reducer]
2018-10-25 10:28:42,417 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2018-10-25 10:28:42,418 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2018-10-25 10:28:42,548 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,549 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,551 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,554 INFO streaming.PipeMapRed: MRErrorThread done
2018-10-25 10:28:42,557 INFO streaming.PipeMapRed: Records R/W=130/1
2018-10-25 10:28:42,557 INFO streaming.PipeMapRed: mapRedFinished
2018-10-25 10:28:42,660 INFO mapred.Task: Task:attempt_local970602474_0001_r_000000_0 is done. And is in the process of committing
2018-10-25 10:28:42,663 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,663 INFO mapred.Task: Task attempt_local970602474_0001_r_000000_0 is allowed to commit now
2018-10-25 10:28:42,700 INFO output.FileOutputCommitter: Saved output of task 'attempt_local970602474_0001_r_000000_0' to hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output
reduce
2018-10-25 10:28:42,701 INFO mapred.Task: Task 'attempt_local970602474_0001_r_000000_0' done.
2018-10-25 10:28:42,702 INFO mapred.Task: Final Counters for attempt_local970602474_0001_r_000000_0: Counters: 29
        File System Counters
                FILE: Number of bytes read=189972
                FILE: Number of bytes written=703871
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1076
                HDFS: Number of bytes written=452
                HDFS: Number of read operations=14
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Map-Reduce Framework
                Combine input records=0
                Combine output records=0
                Reduce input groups=10
                Reduce shuffle bytes=4822
                Reduce input records=130
                Reduce output records=10
                Spilled Records=130
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1547698176
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Output Format Counters
                Bytes Written=452
2018-10-25 10:28:42,702 INFO mapred.LocalJobRunner: Finishing task: attempt_local970602474_0001_r_000000_0
2018-10-25 10:28:42,703 INFO mapred.LocalJobRunner: Starting task: attempt_local970602474_0001_r_000001_0
2018-10-25 10:28:42,705 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:42,705 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:42,705 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-10-25 10:28:42,705 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@578316ce
2018-10-25 10:28:42,706 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2018-10-25 10:28:42,707 INFO reduce.MergeManagerImpl: The max number of bytes for a single in-memory shuffle cannot be larger than Integer.MAX_VALUE. Setting it to Integer.MAX_VALUE
2018-10-25 10:28:42,707 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=20041957376, maxSingleShuffleLimit=2147483647, mergeThreshold=13227692032, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2018-10-25 10:28:42,708 INFO reduce.EventFetcher: attempt_local970602474_0001_r_000001_0 Thread started: EventFetcher for fetching Map Completion Events
2 about to shuffle output of map attempt_local970602474_0001_m_000000_0 decomp: 3850 len: 3854 to MEMORY
2018-10-25 10:28:42,713 INFO reduce.InMemoryMapOutput: Read 3850 bytes from map-output for attempt_local970602474_0001_m_000000_0
map-output of size: 3850, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3850
2 about to shuffle output of map attempt_local970602474_0001_m_000001_0 decomp: 8008 len: 8012 to MEMORY
2018-10-25 10:28:42,716 INFO reduce.InMemoryMapOutput: Read 8008 bytes from map-output for attempt_local970602474_0001_m_000001_0
map-output of size: 8008, inMemoryMapOutputs.size() -> 2, commitMemory -> 3850, usedMemory ->11858
2018-10-25 10:28:42,717 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2018-10-25 10:28:42,717 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,717 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
2018-10-25 10:28:42,719 INFO mapred.Merger: Merging 2 sorted segments
2018-10-25 10:28:42,719 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 11788 bytes
2018-10-25 10:28:42,721 INFO reduce.MergeManagerImpl: Merged 2 segments, 11858 bytes to disk to satisfy reduce memory limit
2018-10-25 10:28:42,722 INFO reduce.MergeManagerImpl: Merging 1 files, 11860 bytes from disk
2018-10-25 10:28:42,722 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2018-10-25 10:28:42,722 INFO mapred.Merger: Merging 1 sorted segments
2018-10-25 10:28:42,722 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 11821 bytes
2018-10-25 10:28:42,722 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,726 INFO streaming.PipeMapRed: PipeMapRed exec [/mnt/wrk/gheber/Bitbucket/ghorg/ESE/././hdfs-vfd-reducer]
reduce 33%
2018-10-25 10:28:42,766 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,766 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,767 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,773 INFO streaming.PipeMapRed: MRErrorThread done
2018-10-25 10:28:42,774 INFO streaming.PipeMapRed: Records R/W=483/1
2018-10-25 10:28:42,775 INFO streaming.PipeMapRed: mapRedFinished
2018-10-25 10:28:42,832 INFO mapred.Task: Task:attempt_local970602474_0001_r_000001_0 is done. And is in the process of committing
2018-10-25 10:28:42,835 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,835 INFO mapred.Task: Task attempt_local970602474_0001_r_000001_0 is allowed to commit now
2018-10-25 10:28:42,857 INFO output.FileOutputCommitter: Saved output of task 'attempt_local970602474_0001_r_000001_0' to hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output
reduce
2018-10-25 10:28:42,858 INFO mapred.Task: Task 'attempt_local970602474_0001_r_000001_0' done.
2018-10-25 10:28:42,859 INFO mapred.Task: Final Counters for attempt_local970602474_0001_r_000001_0: Counters: 29
        File System Counters
                FILE: Number of bytes read=215100
                FILE: Number of bytes written=715731
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1076
                HDFS: Number of bytes written=984
                HDFS: Number of read operations=19
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=5
        Map-Reduce Framework
                Combine input records=0
                Combine output records=0
                Reduce input groups=13
                Reduce shuffle bytes=11866
                Reduce input records=483
                Reduce output records=13
                Spilled Records=483
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1547698176
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Output Format Counters
                Bytes Written=532
2018-10-25 10:28:42,859 INFO mapred.LocalJobRunner: Finishing task: attempt_local970602474_0001_r_000001_0
2018-10-25 10:28:42,859 INFO mapred.LocalJobRunner: Starting task: attempt_local970602474_0001_r_000002_0
2018-10-25 10:28:42,861 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-10-25 10:28:42,861 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-10-25 10:28:42,862 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-10-25 10:28:42,862 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@6aff7da3
2018-10-25 10:28:42,862 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2018-10-25 10:28:42,863 INFO reduce.MergeManagerImpl: The max number of bytes for a single in-memory shuffle cannot be larger than Integer.MAX_VALUE. Setting it to Integer.MAX_VALUE
2018-10-25 10:28:42,863 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=20041957376, maxSingleShuffleLimit=2147483647, mergeThreshold=13227692032, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2018-10-25 10:28:42,864 INFO reduce.EventFetcher: attempt_local970602474_0001_r_000002_0 Thread started: EventFetcher for fetching Map Completion Events
3 about to shuffle output of map attempt_local970602474_0001_m_000000_0 decomp: 4331 len: 4335 to MEMORY
2018-10-25 10:28:42,886 INFO reduce.InMemoryMapOutput: Read 4331 bytes from map-output for attempt_local970602474_0001_m_000000_0
map-output of size: 4331, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->4331
3 about to shuffle output of map attempt_local970602474_0001_m_000001_0 decomp: 996 len: 1000 to MEMORY
2018-10-25 10:28:42,891 INFO reduce.InMemoryMapOutput: Read 996 bytes from map-output for attempt_local970602474_0001_m_000001_0
map-output of size: 996, inMemoryMapOutputs.size() -> 2, commitMemory -> 4331, usedMemory ->5327
2018-10-25 10:28:42,892 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2018-10-25 10:28:42,893 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,894 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
2018-10-25 10:28:42,895 INFO mapred.Merger: Merging 2 sorted segments
2018-10-25 10:28:42,895 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 5257 bytes
2018-10-25 10:28:42,897 INFO reduce.MergeManagerImpl: Merged 2 segments, 5327 bytes to disk to satisfy reduce memory limit
2018-10-25 10:28:42,897 INFO reduce.MergeManagerImpl: Merging 1 files, 5329 bytes from disk
2018-10-25 10:28:42,898 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2018-10-25 10:28:42,898 INFO mapred.Merger: Merging 1 sorted segments
2018-10-25 10:28:42,898 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5290 bytes
2018-10-25 10:28:42,899 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,903 INFO streaming.PipeMapRed: PipeMapRed exec [/mnt/wrk/gheber/Bitbucket/ghorg/ESE/././hdfs-vfd-reducer]
2018-10-25 10:28:42,934 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,934 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,934 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2018-10-25 10:28:42,936 INFO streaming.PipeMapRed: MRErrorThread done
2018-10-25 10:28:42,937 INFO streaming.PipeMapRed: Records R/W=145/1
2018-10-25 10:28:42,938 INFO streaming.PipeMapRed: mapRedFinished
2018-10-25 10:28:42,982 INFO mapred.Task: Task:attempt_local970602474_0001_r_000002_0 is done. And is in the process of committing
2018-10-25 10:28:42,985 INFO mapred.LocalJobRunner: 2 / 2 copied.
2018-10-25 10:28:42,985 INFO mapred.Task: Task attempt_local970602474_0001_r_000002_0 is allowed to commit now
2018-10-25 10:28:43,024 INFO output.FileOutputCommitter: Saved output of task 'attempt_local970602474_0001_r_000002_0' to hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output
reduce
2018-10-25 10:28:43,025 INFO mapred.Task: Task 'attempt_local970602474_0001_r_000002_0' done.
2018-10-25 10:28:43,026 INFO mapred.Task: Final Counters for attempt_local970602474_0001_r_000002_0: Counters: 29
        File System Counters
                FILE: Number of bytes read=225924
                FILE: Number of bytes written=721060
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1076
                HDFS: Number of bytes written=1505
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=7
        Map-Reduce Framework
                Combine input records=0
                Combine output records=0
                Reduce input groups=12
                Reduce shuffle bytes=5335
                Reduce input records=145
                Reduce output records=12
                Spilled Records=145
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=18
                Total committed heap usage (bytes)=1560805376
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Output Format Counters
                Bytes Written=521
2018-10-25 10:28:43,026 INFO mapred.LocalJobRunner: Finishing task: attempt_local970602474_0001_r_000002_0
2018-10-25 10:28:43,026 INFO mapred.LocalJobRunner: reduce task executor complete.
reduce 100%
2018-10-25 10:28:43,741 INFO mapreduce.Job: Job job_local970602474_0001 completed successfully
2018-10-25 10:28:43,769 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=984393
                FILE: Number of bytes written=3528712
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=5129
                HDFS: Number of bytes written=2941
                HDFS: Number of read operations=73
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=17
        Map-Reduce Framework
                Map input records=35
                Map output records=758
                Map output bytes=20471
                Map output materialized bytes=22023
                Input split bytes=192
                Combine input records=0
                Combine output records=0
                Reduce input groups=35
                Reduce shuffle bytes=22023
                Reduce input records=758
                Reduce output records=35
                Spilled Records=1516
                Shuffled Maps =6
                Failed Shuffles=0
                Merged Map outputs=6
                GC time elapsed (ms)=18
                Total committed heap usage (bytes)=7751598080
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1076
        File Output Format Counters
                Bytes Written=1505
2018-10-25 10:28:43,770 INFO streaming.StreamJob: Output directory: hdfs://jelly.ad.hdfgroup.org:8020/tmp/hdfs-vfd-output
[gheber@jelly ESE]$ /tmp/GSSTF_NCEP.3.1987.12.02.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.05.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.08.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.11.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.14.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.17.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.20.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.23.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.26.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.29.he5	G 8	 D 5	 T 0
/tmp/GSSTF_NCEP.3.1987.12.03.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.06.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.09.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.12.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.15.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.18.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.21.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.24.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.27.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.30.he5	G 8	D 5	T 0
/tmp/efitOut.nc	G 35	D 305	T 7
/tmp/sample.h5	G 4	D 0	T 0
/tmp/t.h5	G 1	 D 1	 T 0
/tmp/GSSTF_NCEP.3.1987.12.01.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.04.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.07.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.10.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.13.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.16.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.19.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.22.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.25.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.28.he5	G 8	D 5	T 0
/tmp/GSSTF_NCEP.3.1987.12.31.he5	G 8	D 5	T 0
/tmp/foo.h5	G 2	 D 0	 T 0

Summary

The HDF5 VFD for HDFS provides transparent read access to HDF5 files stored in Hadoop file system. No code changes other than loading the HDFS VFD and linking against an updated version of the HDF5 library are required, This applies to the HDF5 command line tools as well as existing applications. You can use this VFD to bulk process HDF5 (and NetCDF-4) files stored in HDFS with frameworks such as Hadoop streaming.

Additional Documentation

Installation Guide

Overview

The purpose of this document is to describe the process of building a version of the HDF5 library which contains a virtual file driver (VFD) that provides read-only access to HDF5 files stored in a Hadoop File System (HDFS). We list the necessary prerequisites and show how to verify build correctness.

This document is intended for users who are familiar with building libraries from source code using the GNU Autotools. The instructions are applicable to a wide variety of GNU/Linux systems.

Prerequisites

In addition to the standard prerequisites for building the HDF5 library, the build process requires the Java Development Kit (JDK), the libhdfs header and library (part of Hadoop), and the HDFS VFD source code (distributed as a patch).

To be specific, we show the build process step-by-step on a “vanilla” Debian GNU/Linux 9 system.

uname -a

Tools

The build process depends on a C compiler, the GNU Autotools, the JDK, and libhdfs. On a Debian 9 system, these dependencies can be installed as follows:

sudo apt-get -y update --fix-missing
sudo apt-get -y install build-essential automake default-jdk
wget http://apache.osuosl.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -zxvf hadoop-3.1.1.tar.gz

In this document, we use Java 1.8 and Hadoop 3.1.1 to build the HDFS VFD. Other combinations / versions may work equally well, but no systematic testing has been done.

gcc -v 2>&1 >&2
java -version 2>&1 >&2

Environment

The build process depends on the following environment variables:

export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=$HOME/hadoop-3.1.1
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

Files

Currently, the HDF5 VFD for HDFS is built into the HDF5 library. (A model where VFDs can be loaded dynamically like filters is under consideration and might be available in the future.) You need a copy of the HDF5 library source code and a copy of the HDFS VFD source code, which is distributed as a patch file.

In this document, we assume that the HDF5 library source code is available in $HOME/hdf5-1.10.4 and that the patch vfds_hdf5_1_10_4.patch is located in $HOME/hdf5-1.10.4-vfds.

find . -maxdepth 1 -type d

Patching the HDF5 Library Source Code

The patch command shown below should be run in the directory containing the HDF5 library source code. It should run to completion without errors.

cd $HOME/hdf5-1.10.4
patch -p 1 < $HOME/hdf5-1.10.4-vfds/vfds_hdf5_1_10_4.patch

Building the HDF5 Library

The compilation of the HDFS VFD is enabled through the --with-libhdfs option of configure. Its expected value is the Hadoop installation directory where configure is supposed to find the libhdfs header and library.

./configure --prefix $HOME/.local --with-libhdfs=$HADOOP_HOME | grep HDFS
make -j3 &> build.log
tail build.log

Testing the HDFS Support

An HDFS instance is required to verify the proper functioning of the VFD. Fortunately, HDFS can be deployed in standalone or pseudo-distributed mode in non-clustered environments for testing and development. The HDFS VFD bundle includes a set of tests and the scripts to set up such transient HDFS deployments.

The tests depend on the Hadoop/HDFS configuration stored in core-site.xml and hdfs-site.xml. After copying these files to $HADOOP_HOME/etc/hadoop, a transient HDFS instance can be created by invoking the setup.sh script
in the hadoop-testing-bundle folder of the test bundle. Note that this and the matching teardown.sh script are intended to be run from that sub-directory.

cd $HOME/hdf5-1.10.4-vfds/hadoop-testing-bundle
cp *.xml $HADOOP_HOME/etc/hadoop
bash ./setup.sh

To run all HDF5 tests, we could now run make check in $HOME/hdf5-1.10.4. This would include the HDFS-related tests. However, if only the HDFS VFD tests are needed, it is sufficient to invoke the hdfs executable in the
test sub-directory.

cd $HOME/hdf5-1.10.4/test
./hdfs

Finally, the resources associated with the transient HDFS instances should be released by invoking the teardown.sh in the hadoop-testing-bundle directory.

cd $HOME/hdf5-1.10.4-vfds/hadoop-testing-bundle
bash ./teardown.sh

Summary

In this document, we have shown how to configure, build, and test a version of the HDF5 library that includes a VFD for HDFS in GNU/Linux environments. Please see the document USERS GUIDE for information on how to use the VFD with the HDF5 tools and in your applications.

For additional help and with further questions, please contact support@hdfgroup.org.

User's Guide

Overview

The purpose of this document is to describe the use of the HDF5 VFD for HDFS with existing HDF5 tools and in application development. This document does not cover the installation of the HDFS VFD. Please refer to the INSTALLATION GUIDE for information on how to build and install the HDF5 VFD for HDFS.

Prerequisites

  1. A version of the HDF5 library that was built with the HDF5 VFD for HDFS enabled.
  2. An HDFS instance such as an HDFS cluster or a transient deployment such as described in the INSTALLATION GUIDE.
  3. The following variables must be set for your environment: JAVA_HOMEHADOOP_HOME, CLASSPATH, LD_LIBRARY_PATH.

Example

In this document, we use the following setup.

The HDF5 library and tools are installed in $HOME/.local

ls -lR --hide=share $HOME/.local
/home/admin/.local:
total 12
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 bin
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 include
drwxr-xr-x 2 admin admin 4096 Sep 27 12:07 lib

/home/admin/.local/bin:
total 9760
-rwxr-xr-x 1 admin admin 523528 Sep 27 12:07 gif2h5
-rwxr-xr-x 1 admin admin 498256 Sep 27 12:07 h52gif
-rwxr-xr-x 1 admin admin  13281 Sep 27 12:07 h5cc
-rwxr-xr-x 1 admin admin 489184 Sep 27 12:07 h5clear
-rwxr-xr-x 1 admin admin 495952 Sep 27 12:07 h5copy
-rwxr-xr-x 1 admin admin 103704 Sep 27 12:07 h5debug
-rwxr-xr-x 1 admin admin 909592 Sep 27 12:07 h5diff
-rwxr-xr-x 1 admin admin 767808 Sep 27 12:07 h5dump
-rwxr-xr-x 1 admin admin 489848 Sep 27 12:07 h5format_convert
-rwxr-xr-x 1 admin admin 653672 Sep 27 12:07 h5import
-rwxr-xr-x 1 admin admin 493016 Sep 27 12:07 h5jam
-rwxr-xr-x 1 admin admin 582088 Sep 27 12:07 h5ls
-rwxr-xr-x 1 admin admin 481088 Sep 27 12:07 h5mkgrp
-rwxr-xr-x 1 admin admin 585632 Sep 27 12:07 h5perf_serial
-rwxr-xr-x 1 admin admin   5913 Sep 27 12:07 h5redeploy
-rwxr-xr-x 1 admin admin 785888 Sep 27 12:07 h5repack
-rwxr-xr-x 1 admin admin  45832 Sep 27 12:07 h5repart
-rwxr-xr-x 1 admin admin 540704 Sep 27 12:07 h5stat
-rwxr-xr-x 1 admin admin 482552 Sep 27 12:07 h5tools_utils
-rwxr-xr-x 1 admin admin 489248 Sep 27 12:07 h5unjam
-rwxr-xr-x 1 admin admin 509408 Sep 27 12:07 h5watch

/home/admin/.local/include:
total 544
-rw-r--r-- 1 admin admin  25488 Sep 27 12:07 H5ACpublic.h
-rw-r--r-- 1 admin admin   9720 Sep 27 12:07 H5api_adpt.h
-rw-r--r-- 1 admin admin   5471 Sep 27 12:07 H5Apublic.h
-rw-r--r-- 1 admin admin   1760 Sep 27 12:07 H5Cpublic.h
-rw-r--r-- 1 admin admin   1931 Sep 27 12:07 H5DOpublic.h
-rw-r--r-- 1 admin admin   8656 Sep 27 12:07 H5Dpublic.h
-rw-r--r-- 1 admin admin   2590 Sep 27 12:07 H5DSpublic.h
-rw-r--r-- 1 admin admin  21845 Sep 27 12:07 H5Epubgen.h
-rw-r--r-- 1 admin admin   9023 Sep 27 12:07 H5Epublic.h
-rw-r--r-- 1 admin admin   1490 Sep 27 12:07 H5FDcore.h
-rw-r--r-- 1 admin admin   1955 Sep 27 12:07 H5FDdirect.h
-rw-r--r-- 1 admin admin   1506 Sep 27 12:07 H5FDfamily.h
-rw-r--r-- 1 admin admin   3691 Sep 27 12:07 H5FDhdfs.h
-rw-r--r-- 1 admin admin   3322 Sep 27 12:07 H5FDlog.h
-rw-r--r-- 1 admin admin   2573 Sep 27 12:07 H5FDmpi.h
-rw-r--r-- 1 admin admin   2440 Sep 27 12:07 H5FDmpio.h
-rw-r--r-- 1 admin admin   1800 Sep 27 12:07 H5FDmulti.h
-rw-r--r-- 1 admin admin  17115 Sep 27 12:07 H5FDpublic.h
-rw-r--r-- 1 admin admin   3420 Sep 27 12:07 H5FDros3.h
-rw-r--r-- 1 admin admin   1339 Sep 27 12:07 H5FDsec2.h
-rw-r--r-- 1 admin admin   1368 Sep 27 12:07 H5FDstdio.h
-rw-r--r-- 1 admin admin  14540 Sep 27 12:07 H5Fpublic.h
-rw-r--r-- 1 admin admin   7233 Sep 27 12:07 H5Gpublic.h
-rw-r--r-- 1 admin admin   3274 Sep 27 12:07 H5IMpublic.h
-rw-r--r-- 1 admin admin   4934 Sep 27 12:07 H5Ipublic.h
-rw-r--r-- 1 admin admin   1362 Sep 27 12:07 H5LDpublic.h
-rw-r--r-- 1 admin admin  10160 Sep 27 12:07 H5Lpublic.h
-rw-r--r-- 1 admin admin  14182 Sep 27 12:07 H5LTpublic.h
-rw-r--r-- 1 admin admin   1775 Sep 27 12:07 H5MMpublic.h
-rw-r--r-- 1 admin admin  12103 Sep 27 12:07 H5Opublic.h
-rw-r--r-- 1 admin admin 116154 Sep 27 12:07 H5overflow.h
-rw-r--r-- 1 admin admin   1528 Sep 27 12:07 H5PLextern.h
-rw-r--r-- 1 admin admin   2402 Sep 27 12:07 H5PLpublic.h
-rw-r--r-- 1 admin admin  28815 Sep 27 12:07 H5Ppublic.h
-rw-r--r-- 1 admin admin   3861 Sep 27 12:07 H5PTpublic.h
-rw-r--r-- 1 admin admin  20413 Sep 27 12:07 H5pubconf.h
-rw-r--r-- 1 admin admin  12256 Sep 27 12:07 H5public.h
-rw-r--r-- 1 admin admin   3794 Sep 27 12:07 H5Rpublic.h
-rw-r--r-- 1 admin admin   7526 Sep 27 12:07 H5Spublic.h
-rw-r--r-- 1 admin admin   8419 Sep 27 12:07 H5TBpublic.h
-rw-r--r-- 1 admin admin  27251 Sep 27 12:07 H5Tpublic.h
-rw-r--r-- 1 admin admin  22479 Sep 27 12:07 H5version.h
-rw-r--r-- 1 admin admin  11250 Sep 27 12:07 H5Zpublic.h
-rw-r--r-- 1 admin admin   3467 Sep 27 12:07 hdf5.h
-rw-r--r-- 1 admin admin   1543 Sep 27 12:07 hdf5_hl.h

/home/admin/.local/lib:
total 64204
-rw-r--r-- 1 admin admin 43972818 Sep 27 12:07 libhdf5.a
-rw-r--r-- 1 admin admin   973650 Sep 27 12:07 libhdf5_hl.a
-rwxr-xr-x 1 admin admin     1122 Sep 27 12:07 libhdf5_hl.la
libhdf5_hl.so.100.1.1
libhdf5_hl.so.100.1.1
-rwxr-xr-x 1 admin admin   581688 Sep 27 12:07 libhdf5_hl.so.100.1.1
-rwxr-xr-x 1 admin admin     1067 Sep 27 12:07 libhdf5.la
-rw-r--r-- 1 admin admin     3982 Sep 27 12:07 libhdf5.settings
libhdf5.so.103.0.0
libhdf5.so.103.0.0
-rwxr-xr-x 1 admin admin 20195896 Sep 27 12:07 libhdf5.so.103.0.0

An HDFS namenode instance runs on localhost at port 8020.

The environment is configured as follows:

export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=$HOME/hadoop-3.1.1
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

We’ve copied a sample HDF5 file t.h5 into the /tmp directory on the HDFS instance.

$HADOOP_HOME/bin/hdfs dfs -ls /tmp/*.h5
-rw-r--r--   1 admin supergroup      18528 2018-09-27 12:06 /tmp/t.h5

HDF5 Tool Support

Several HDF5 tools support the use of alternative VFDs. Typically, the tools that do have a command-line option to select the desired VFD. Unfortunately, the naming of these options varies between tools. In this document, we use
h5ls and h5dump as examples.

h5ls

This tool uses the --vfd=DRIVER option to select an alternative VFD.

$HOME/.local/bin/h5ls --vfd=hdfs -r hdfs://localhost/tmp/t.h5
/                        Group
/Dataset1                Dataset {128, 32}

Additional HDFS options, such as non-default host name and port number can be specified via the --hdfs-attrs option.

h5dump

This tool uses the --filedriver=DRIVER option to select an alternative VFD.

$HOME/.local/bin/h5dump --filedriver=hdfs -pBH hdfs://localhost/tmp/t.h5
HDF5 "hdfs://localhost/tmp/t.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   DATASET "Dataset1" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 128, 32 ) / ( 128, 32 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 16384
         OFFSET 2144
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}
}
$HOME/.local/bin/h5dump --filedriver=hdfs -d /Dataset1 -s "64,8" -k "10,10" hdfs://localhost/tmp/t.h5
HDF5 "hdfs://localhost/tmp/t.h5" {
DATASET "/Dataset1" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 128, 32 ) / ( 128, 32 ) }
   SUBSET {
      START ( 64, 8 );
      STRIDE ( 1, 1 );
      COUNT ( 1, 1 );
      BLOCK ( 10, 10 );
      DATA {
      (64,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (65,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (66,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (67,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (68,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (69,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (70,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (71,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (72,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2,
      (73,8): 3, 4, 0, 1, 2, 3, 4, 0, 1, 2
      }
   }
}
}

Additional HDFS options, such as non-default host name and port number can be specified via the --hdfs-attrs option.

Application Development

The run-time behavior of the HDF5 library is highly configurable, and the VFD used to access an HDF5 file can be specified as a property in a file access property list. This way the changes to existing applications
to accommodate other VFDs are minimal.

HDFS VFD C-API

The HDFS VFD C-API consists of a structure which holds parameters that control the interaction with an HDFS instance (e.g., port, credentials, etc.) and a function that adds a matching file access property to an
existing file access property list.

typedef struct H5FD_hdfs_fapl_t {
  int32_t version;
  char    namenode_name[H5FD__HDFS_NODE_NAME_SPACE + 1];
  int32_t namenode_port;
  char    user_name[H5FD__HDFS_USER_NAME_SPACE + 1];
  char    kerberos_ticket_cache[H5FD__HDFS_KERB_CACHE_PATH_SPACE + 1];
  int32_t stream_buffer_size;
} H5FD_hdfs_fapl_t;

herr_t H5Pset_fapl_hdfs(hid_t fapl_id, H5FD_hdfs_fapl_t *fa);

The use of this API is illustrated below.

A Sample Program

The following program opens an HDF5 file and uses the HDF5 library to:

  1. Determine its size (H5Fget_filesize)
  2. Determine the address and number of attributes of the root group (H5Oget_info)
  3. Determine the number of links in the root group (H5Gget_info)

Notice that the only “deviation” from doing the same for an HDF5 file stored in a POSIX file system is the fapl initialization (lines 22 to 28).

 1: #include "hdf5.h"
 2: 
 3: #include <assert.h>
 4: #include <string.h>
 5: 
 6: #define NAMENODE_NAME "localhost"
 7: #define NAMENODE_PORT 8020
 8: #define STREAM_BUFFER_SIZE 4096
 9: #define FILE_NAME "hdfs://localhost/tmp/t.h5"
10: 
11: void main(int argc, char** argv)
12: {
13:   H5FD_hdfs_fapl_t param;
14:   hid_t            fapl, file;
15:   hsize_t          size;
16:   H5G_info_t       ginfo;
17:   ssize_t          obj_count;
18:   H5O_info_t       oinfo;
19: 
20:   /* Create and initialize a file access property list. */
21: 
22:   assert((fapl = H5Pcreate(H5P_FILE_ACCESS)) >= 0);
23: 
24:   param.version = 1;
25:   strcpy(param.namenode_name, NAMENODE_NAME);
26:   param.namenode_port = NAMENODE_PORT;
27:   param.stream_buffer_size = STREAM_BUFFER_SIZE;
28:   assert(H5Pset_fapl_hdfs(fapl, &param) >= 0);
29: 
30:   /* Open the file in read-only mode with the access property list. */
31: 
32:   assert((file = H5Fopen(FILE_NAME, H5F_ACC_RDONLY, fapl)) >= 0);
33: 
34:   /* Make HDF5 API calls as usual. */
35: 
36:   assert(H5Fget_filesize(file, &size) >= 0);
37:   printf("File size\t\t: %llu [bytes]\n", size);
38: 
39:   assert(H5Oget_info(file, &oinfo) >= 0);
40:   printf("Root group address\t: %llu\n", oinfo.addr);
41:   printf("Root group attr. count\t: %llu\n", oinfo.num_attrs);
42: 
43:   assert(H5Gget_info(file, &ginfo) >= 0);
44:   printf("Root group link count\t: %llu\n", ginfo.nlinks);
45: 
46:   /* Release resources. */
47: 
48:   assert(H5Fclose(file) >= 0);
49:   assert(H5Pclose(fapl) >= 0);
50:   return;
51: }

The code can be build with h5cc, which links it against libhdfs.

$HOME/.local/bin/h5cc -show hdfs-vfd-sample.c -o doit
gcc -I/home/admin/.local/include -I/home/admin/hadoop-3.1.1/include -c hdfs-vfd-sample.c
gcc -I/home/admin/hadoop-3.1.1/include hdfs-vfd-sample.o -o doit -L/home/admin/.local/lib /home/admin/.local/lib/libhdf5_hl.a /home/admin/.local/lib/libhdf5.a -L/home/admin/hadoop-3.1.1/lib/native -L/usr/lib/jvm/default-java/jre/lib/ -L/usr/lib/jvm/default-java/jre/lib//server -lhdfs -ldl -lm -Wl,-rpath -Wl,/home/admin/.local/lib
$HOME/.local/bin/h5cc hdfs-vfd-sample.c -o doit

Running the executable yields the expected result:

./doit
File size		: 18528 [bytes]
Root group address	: 96
Root group attr. count	: 0
Root group link count	: 1

Summary

In this document, we have shown how to use the HDF5 VFD for HDFS with existing HDF5 tools and in application development. Because of the ease with which the run-time behavior of the HDF5 library can be configured, changes to existing applications are minimal.

For additional help and with further questions, please contact support@hdfgroup.org.