The HDF Bioinformatics project is a collaborative effort to develop portable, scalable, bioinformatics data storage technologies in HDF5. HDF5 is an open source, binary file storage technology for managing large or complex scientific and engineering data.
The project is led by Geospiza, Inc. and The HDF Group. In his blog, Geospiza CEO Todd Smith makes The case for HDF, which describes what the project is all about.
The first phase of the project was successfully completed in Summer 2006, and was funded by NIH SBIR Phase I Grant 1R41HG3792-1. A Phase II proposal has been submitted to NIH and is awaiting review.
During Phase I, Geospiza and The HDF Group demonstrated the feasibility of HDF5 for genotyping applications by defining the requirements for genotyping software applications and using those requirements to develop a scalable data model in HDF5. Additionally, the groups collaborated with the National Center for Biotechnology Information (NCBI) to further demonstrate HDF5's utility in bioinformatics by building an additional data model that can be used to study linkage disequilibrium (LD) in HapMap data sets.
The project demonstrated that the HapMap data for chromosome 22, which contains nearly 53,000 SNPS, could easily be stored, accessed and visualized in HDF5. It also showed that long range LD matrices spanning entire chromosomes could be generated efficiently on a parallel computing cluster and stored and accessed easily in HDF5.
Perl HDF5 wrappers are available to demonstrate how one might store and access genomic sequence data from the FASTA format in HDF5. This software and documentation can be obtained from the HDF Bioinformatics Software page.
- - Last modified:November 04th 2008
