hdf images hdf images

About BioHDF

The BioHDF project is a collaborative effort to address the bioinformatics data deluge problem. Based on the established open-source HDF5 binary data storage technology, BioHDF strives to help biologists come to terms with the flood of data that the latest instrumentation can produce.

As we envision it, there are three key parts to BioHDF:

  1. The data model and file organization.
  2. This determines which data will be stored, how it will be arranged in the data file and how it will be queried. Data will be stored as fundamental building blocks such as "sequences", "alignments" and "MS/MS spectra". Unlike most file formats, which are set in stone, BioHDF files will are self-describing, flexible and extensible as they are based on HDF5.

  3. The C application programming interface (API) and library.
  4. This is the library which will determine the basic means of manipulating the data stored in a BioHDF file. Interoperability with existing bioinformatics tools will be provided by functions which allow for import and export of the data from/to existing bioinformatics file formats.

  5. Wrappers around the C API for popular bioinformatics languages like Perl.
  6. C is a useful language for the basic BioHDF API since it allows for easy interfacing with the HDF5 API, can be ported easily to many operating systems and can interoperate with most higher-level languages. Much bioinformatics work is done in higher-level languages, however, and we intend to make the BioHDF API easily wrappable for these languages using packages like SWIG and XS.

In these BioHDF web pages you will find documentation and links to some early attempts at storing biological data in HDF5 files. In particular, the command line tools and data model created to help support Geospiza's products come close to our vision for BioHDF: They have a data model which consists of fundamental building blocks and an "API" of command-line tools which can import, export and manipulate data. Over the coming months we intend to build on that foundation, expanding and refining the data model and constructing a C API/library out of the existing tools while adding new features to support data storage and operations required by the community.

We believe that a key factor to the success of BioHDF is the participation of interested parties in the development of the data model and API. If you are being drowned in data and would like to be a participant in the development of BioHDF we encourage you to follow our progress on this website and to subscribe to our forum (link on the left). We welcome your input!


For more information, click the links on the left.

- - Last modified:November 18th 2009