Citations for HDF Data and Software
Ted Habermann, The HDF Group
The topic of software citation has been discussed in many forums recently and several major discovery repositories (e.g. zenodo and DataCite) support metadata for software in addition to datasets and other resource types. HDF5 stradles the boundary between the dataset and software worlds. It is most commonly thought of and referred to as a data format, but, as in any case, data written in the HDF formats can not be read without HDF software. So, the answer to the question: is it a format or is it software? is clearly both.
In this case the specific question is: how do open source software developers (commercial or academic) help their users get recognition and credit for their contribution through software citations?
We explore this question in two ways using DataCite metadata. First: What do we think is the best way to cite HDF as a format and as software? and second: How is it done now?
The DataCite Metadata Schema, supported and used by DataCite and Zenodo, includes an element named Format that is defined as “Technical format of the resource”. The values in this field are free text with the recommendation “Use file extension or MIME type where possible, e.g., PDF, XML, MPG or application/pdf, text/xml, video/mpeg”.
For datasets, the most common type of resource cataloged in DataCite, this is clearly the correct place for listing HDF as the format for a published dataset. This element is available in all versions of the DataCite Metadata Schema and records that use it will be discovered in generic searches. The recommended content is application/x-hdf5 for data in HDF5 or application/x-hdf for data in earlier versions.
Using the format element ensures that datasets written in HDF4 or HDF5 will be discovered when questions like “What data are available in HDF5?” or “What is the format of this dataset?” are asked. Other options help users answer these questions even when they are not asked. For example, referencing HDF in the dataset description ensures that format information appears in lists of search results as shown in this example
Another option is to include HDF in the title of the dataset, again ensuring that HDF appears in a list of search results as shown here.
The remaining question is how to cite HDF as software in this framework. Several changes were made in the most recent revision of the DataCite Metadata Schema (V4.1) specifically to facilitate software citation. Particularly, the relationTypes “IsRequiredBy” and “Requires” were added to indicate software dependencies. As described above, data that are written in HDF (with any of the many community conventions written on top of HDF) require HDF software, so the “Requires” relationType seems appropriate. This would be written as:
<relatedIdentifier relatedIdentifierType="DOI" relationType="Requires"> 10.11578/dc.20180330.1 </relatedIdentifier>
where 10.11578/dc.20180330.1 is a recently minted DOI for HDF5 software.
Background: HDF in Existing DataCite Metadata
A generic DataCite search includes content in many fields and searches for HDF, HDF5, and HDF4 and yield 990, 922 and 32 records. The first search includes the other two (and a variety of other records, see discussion) and those overlaps have been removed from these counts.
These strings occur in nine metadata fields (Table 1). A single metadata record can have occurrences in multiple fields. The total number of occurrences is 990 for HDF, 1002 for HDF5 and 39 for HDF4.
Table 1. Number of occurrences of HDF, HDF5, and HDF4 in fields in DataCite records. Obligations are (M)andatory, (R)ecommended, (O)tional.
The most common location of the HDF* (i.e. HDF or HDF5 or HDF4) occurrences are in description fields with 506 HDF5 occurrences in 503 records (three records have descriptions in multiple languages). This field is described as the “most important” recommended field in the DataCite schema.
The records that include HDF or HDF4 are different. In those cases, the title is the most common location (40%/54%) for the HDF/HDF4 strings with description a distant second (9%/26%). Table 2 shows the publishers of the metadata records included in this study. Note that significant numbers of the HDF records came from several NASA Centers (GSFC). These are records with HDF or HDF4 in the title. Many of the HDF titles include “HDF-EOS”, “HDF File”, or “HDF Binary File” in the titles and were, therefore, missed by the HDF4 and HDF5 searches.
The sources that include HDF5 are much more diverse but dominated by ZENODO which is a community repository with less centralized governance than GSFC. In that case, references to HDF5 are typically in titles, descriptions, or subjects (the Zenodo interface does not support the format field).
|NASA Langley Atmospheric Science Data Center DAAC||342||61||22|
|European Space Agency (ESA) Gaia misson and Gaia Data Processing and Analysis Consortium (DPAC)||0||47|
|NASA Langley Atmospheric Science Data Center||35|
|Dryad Digital Repository||20||4||1|
|NASA NSIDC DAAC||0||15|
|JACoW, Geneva, Switzerland||0||13|
|NASA National Snow and Ice Data Center DAAC||0||10|
|ORNL Distributed Active Archive Center||9|
|UCAR/NCAR – Research Data Archive||0||6|
|NASA DAAC at the National Snow and Ice Data Center||0||6|
|Apollo – University of Cambridge Repository||0||5|
|Interdisciplinary Earth Data Alliance (IEDA)||1||4|
Table 2. Publishers of DataCite records with HDF, HDF5, and HDF4.
Another common location for HDF5 is in the format field, with 288 occurrences in 312 records (format is an optional, repeatable field). Format is free text and the values in these files are shown in Table 3. The DataCite recommendation for format is “Use file extension or MIME type where possible, e.g., PDF, XML, MPG or application/pdf, text/xml, video/mpeg.” and a significant majority (79%/71%) of the HDF/HDF5 records that include format follow this recommendation (pplication/x-hdf/application/x-hdf5 are the correct MIME types) but over 20% do not (the conventional file extension for HDF is .h5). Format is much less common in the HDF4 sets (6 occurrences) and none of them include the recommended content.
|HDF: Hierarchical Data Format (HDF) (unknown version) (application/x-hdf)||2|
|NCL supports the following data formats – NetCDF, HDF4, HDF5, HDF-EOS2, HDF-EOS5, GRIB1, GRIB2, shapefiles.||1||1|
|.xlsx, .isprp, .HDF5, .ISSD, .ISVI||1|
|HDF5: Hierarchical Data Format version 5 (HDF) (application/x-hdf)||1|
Table 3. Values in the format field that include HDF*.
The final location with significant occurrences of HDF5 is subject, a free text field defined as “Subject, keyword, classification code, or key phrase describing the resource”. This field is not used for HDF4. The values in the subject field that include HDF5 are shown in Table 4.
|HPC, parallel I/O, HDF5||8|
|Photon-HDF5, smFRET, simulation||2|
|simulation, hybrid, Hypre library, Curie Tier-0, POSIX VTK, HDF5 I/O||2|
|DG-comp, Tier-0, HDF5-formatted||2|
|I/O, MPI-IO, HDF5||2|
Table 4. Values in the subject field that include HDF5.
Table 1 includes the obligations (Mandatory, Recommended, and Optional) of the elements that include the HDF4 and HDF5 strings (mandatory fields are bold). The title and resourceIdentifier elements are mandatory and must have content, whereas the resourceType has a mandatory attribute (@resourceTypeGeneral) which comes from a controlled list, but the content of the element is free text.
As shown in Table 1, the vast majority of HDF* occurrences are in description, title and format fields which are recommended, mandatory and optional respectively. The description and format field content suggests that the creators of these metadata records are interested in helping users by providing information that goes beyond the bare minimum required by DataCite. At the same time, the vast majority of the records that include HDF* include it in only one location. Table 5 shows that 94% and 84% of the records that include HDF5 and HDF4 have it in one location. This means that metadata providers that want to indicate that the format of their data is HDF* do it once, i.e. at one location.
|Field Count||HDF (%)||Field Count||HDF (%)|
Table 5. Percentage of records that include HDF in some number of fields.
Exploring this in more detail, it is interesting to note that in all cases, formats and titles are completely exclusive. Of the 158 records that include HDF in the title and 274 records that include it in the format, none include it in both locations. The same is true for the HDF4 case although the counts are smaller.
The same is nearly true for the description and format fields. Of the 503 records that include HDF5 in the description and 274 records that include HDF5 in the format, only four records have both. This is not true in the HDF4 case, where the same munbers are 10, 6, and 5, i.e. five of the records that include HDF4 in the format also include it in the description.
The HDF format is based on a very flexible data-model that many communities build on with community-specific domain models to customize their applications for their users. We have identified two such community applications in this study. Fifty records related to the Proton-HDF5 conventions were recognized in the HDF5 set and 169 records related to the HDF-EOS conventions were identified in the HDF set. Developing these conventions and APIs is a significant amount of work that the developers should get credit for. Thus, in situations where multiple layers are used in the data access stack, the metadata records should include related identifiers for each of those layers.
Software citations are an important part of the network of connections that help people understand scientific results as well as a critical mechanism for providing credit to the people that enable science through software development. The HDF format and software are heavily used across many scientific disciplines, so they provides a good case for understanding current and recommending future practice. We explored how HDF was referenced in 954 metadata records from DataCite and found a variety of different approaches.
Moving forward we recommend citing the HDF format from dataset metadata records by:
- including the MIME type for HDF (application/x-hdf5 for data in HDF5 or application/x-hdf for data in earlier versions) in the Format field of the metadata. This option is available in all versions of the DataCite schema but not all interfaces;
- including HDF5/HDF4 in the title and/or description of the dataset;
- include HDF5 as a keyword/subject;
and citing HDF software using the relatedIdentifier field with relatedIdentifierType=”DOI”, relationType=”Requires”, and the value 10.11578/dc.20180330.1. This relation is available in the current version of the DataCite Metadata Schema V4.1.
A number of metadata publishers are already implementing some of these recommendations, including:
- five publishers (AraGWAS Catalog, iCFDdatabase, Interdisciplinary Earth Data Alliance (IEDA), TU Delft, University of Illinois at Urbana-Champaign) include application/x-hdf in 194 metadata records,
- thirty-seven (37) publishers include HDF in descriptions of metadata records,
- eighteen (18) publishers include HDF in titles of metadata records, and
- three (3) publishers include HDF as a subject in a metadata record.
As noted above, hovever, very few of the metadata records include HDF in multiple locations. In fact, most of the publishers (74%) consistently use only one location for the format information. This limits the utility of the references as different metadata fields have different primary target audiences. For example, the description and title fields are aimed at human audiences and the format field, with a MIME Type, is aimed primarily at machine readability. Clearly both of these audiences are important and both can be served using the current DataCite schema.
The current metadata is very weak at accomplishing the important goal of using DOI’s to make connections that facilitate credit for software development. None of the current records take advantage of the “Requires” relationType even though many are written with version 4.1 of the DataCite schema.