Linkage Disequilibrium (LD) deals with finding of non random associations of alleles at different loci in population genetics. The complexity of LD calculation is O(n2m), where n is the number of loci being considered and m the population being studied. Owing to large number of SNPs, genome level LD analysis becomes too time consuming. On a single processor this computation would take months, but the same computation could be done on a large scale supercomputer in a matter of days. Our experiments showed that for chromosome 22 the whole LD Analysis could be done within 4 minutes on a 32 node cluster (a small size cluster).
Moreover the storage requirement for the linkage array is n2/2 . In order for storage and exchange and visualization we need technology using which we can compress and efficiently visualize such large arrays. The technologies with the HDF file API, can help with the storage and visualization of large LD Arrays. Our experiments showed the LD array for chromosome 22 can be stored in a compressed matrix of 4.5 MB and with presence of a hierarchy of lower resolution datasets,the visualization of the LD array was memory efficient and interactive.
The main contributions of this work are:
- Proposing the parallelizing of the LD algorithm so that the results can be generated quickly on large supercomputers.
- Proposing compression and chunking for storing the entire array.
- Proposing a hierarchy of images of reducing resolution to allow for efficient interactive visualization.
Examples of LD data in HDF5 file
LD_22.h5 file contains the LD values, calculated using r^2 metric, for chromosome22. The file has 3 different datasets. The Chromosome22 dataset contains the entire LD matrix, where as the other 2 contain the matrix at lower resolutions. We plan to add a functionalities in HDFView by which scientists can make selections in the lower resolution datasets and directly zoom into higher resolution datasets. The file LD_19.h5 follows similar data structure.
Use HDF View to look at the files after downloading it to your computer.
CAUTION: Attempting to open any dataset except the lowest resolution might result in the viewer crashing. As of now the only way to access the higher resolution images is by using the open as functionality and manually subsetting the data. While the size of the dataset is not very high, (acheived through compression) the actual number of elements is still very high. DO NOT CLICK on Chromosome22 dataset to see the data after opening the file with HDFView. Use "Right click" on the dataset, then choose "Open As" from the menu; dialog window with a small image will appear; you may which to choose "Image" to display the dataset; use left mouse button to select a small region.
- - Last modified:February 27th 2008
