Would you like SQL with your HDF5?
Gerd Heber, The HDF Group
In the face of naysayers, the SQL abides. Read in our latest blog post how HDF5 and ODBC tie the room together.
When I order my HDF5 at Moe’s Data Diner, I usually ask for extra napkins. It’s a meal where you need both hands, and it can be messy before you get to the juicy bits. At least that’s the way it used to be. It’s easier to dissect with h5py , but what’s the hungry stranger to do, who is just coming through town and who is clueless? When asked the other night, I didn’t fall off my chair nor did I choke, but the question got my head going. What would it look like, that SQL on-the-side thing?
Say, I have an HDF5 dataset at /group1/A/dset2 and would like to select a few elements like so:
SELECT * FROM /group1/A/dset2 WHERE value > -999.0
Nice, but how do I get the result into my favorite analytics tool? Isn’t there some standard pipe or conduit that helps me over that last mile? Well, it’s kind of embarrassing to admit, but it’s been there since the early 1990s and is called Open Database Connectivity (ODBC) .
Think of ODBC as the “USB of data sources.” If you have a USB driver for a device, it’s game on. If you have an ODBC driver for your data source, then it’s SQL, milk, and honey from here. A growing number of applications come with some module or package for accessing data stored in HDF5 files, but you can be almost certain that your tool of choice has an ODBC client built in.
To build an HDF5/ODBC driver, we need a splash of the “secret sauce” shown in the figure above, and this is the subject of this blog post. Spoiler Alert: We are not giving away the recipe.
Instead, we describe its taste and flavors in somewhat bland technical terms. Notwithstanding that, let’s start with a hands-on example.
Excel’s Other Data Source(s)
Most of us have probably loaded data from a text or CSV file into Excel, but its data import capabilities don’t stop there. On the Excel Ribbon, the “ODBC goodness” is hidden under the Data tab in the From Other Sources menu.
Select From Data Connection Wizard and pick ODBC DSN as the kind of remote data source you want to connect. After hitting Next, you’ll be presented with a dialog similar (the actual list of ODBC data sources might be different for your system) to the one shown below. Select the HiFiveDSII data source and you’ll be prompted for the name of an HDF5 file.
The Data Connection Wizard will then present you with a list of tables from which you can import data, or against which may run more sophisticated SQL queries. See the figure below for a sample table list from an HDF-EOS5 file.
This is essentially a list of HDF5 path names with the dataset names as table names. Pick one, hit Import, and see the data delivered to your doorstep (worksheet). Caution: If you’ve picked an HDF5 dataset whose element count exceeds the maximal number of rows in an Excel worksheet (1,048,576), you’ll be presented with a warning and a choice, to continue with a truncated subset or to abort the import.
Happy days! Why wasn’t it always this easy? Consider this the “Hello, World!” of HDF5/ODBC. Now, what do you say after you say hello?
List of Ingredients and Nutritional Information
Warning: The material presented in this section is subject to change. It gives you an idea of what we have implemented to date, but the final product might be different.
HDF5 Datasets as ODBC Tables
HDF5 datasets seem to be the natural candidates for making an appearance as tables on the relational side. Leaving aside more sophisticated storage layouts , the closest thing to a table in HDF5 is probably a one-dimensional dataset of a compound type, such as a time series, with the fields of the compound type representing table columns. (See the figure below.)
The ODBC standard supports an assortment of scalar types (integers, floating-point numbers, strings, etc.) and thereby delineates the subset of HDF5 datatypes that can be faithfully mapped to SQL types. Currently, if an HDF5 datatype cannot be converted to an SQL type, it is suppressed by the driver, i.e., the corresponding dataset is not exposed at all, or the corresponding field in a compound type is unavailable. You are probably aware that the values of HDF5 datasets are (logically) dense rectilinear arrays. However, the relational model has no notion of table dimensionality; a table holds a set of rows. Most RDBMS vendors support row identifiers, but this is not a relational concept either. It’s unclear what most users would find beneficial. For now, the HDF5/ODBC driver provides tables as row sets of dataset elements with an additional column, CPOS, which for each row contains the linearized C-array index of the underlying dataset element. (See the figure below.)
If this is not what you’d expect, let us know what you think should be the default behavior! As we’ll see in a moment, it’s quite easy to support different behaviors.
The current version of the HDF5/ODBC driver is a one-way street, it’s read-only. Yes, the ODBC standard covers modifying operations such as INSERTs and UPDATEs, but the first version of our driver will be a read-only affair.
One HDF5 File = One ODBC Resource (for now)
To keep this discussion simple, let’s assume there is a one-to-one correspondence between databases (=ODBC data sources) and HDF5 files. (I know, you have 10,000 HDF5 files, and you don’t want 10,000 separate databases. You could get around this limitation, for example, using mechanisms such as external links and virtual datasets , and create a single HDF5 file view of your collection. In the release version of the driver, we might support more convenient methods, but let’s not get carried away.)
Let’s think about file location for a moment. The HDF5 file might just sit on your desktop or in a file system attached your machine. Let’s call this the standaloneconfiguration of the ODBC driver deployment. In the figure below, this is shown on the left-hand side. In this mode, the ODBC driver is just a shared library running in the context of your client application processes.
If the file is stored on a remote server, you will get the same appearance as long as you can connect to the remote host via TCP. In this case, there will be an ODBC server process running on the remote machine, handling requests from ODBC clients. Let’s call this the client/server configuration of the ODBC driver deployment, which is shown on the right in the figure above. The HDF5/ODBC driver supports both configurations.
(Default) Schema on Read
Since HDF5 datasets are not usually created with SQL in mind, there is no prescribed schema for SQL to use to interpret the data. We resolve this by employing so-called schema on read. This refers to an environment’s ability to impose a schema at read time, regardless of how the data is stored.
A table name is a three part construct:
For this discussion, think of CATALOG as synonymous with ‘database.’ By default, in HDF5/ODBC, the CATALOG name is HDF5. SCHEMA represents an individual user’s (or a group of users) view of the data stored in the catalog, including access permissions, etc. The TABLE part is what we usually associate with table names when writing queries. Thus, in our case, TABLE typically represents an HDF5 dataset.
For example, let /group1/A/dset2 be an HDF5 path name of a dataset. By default, the HDF5/ODBC driver would present this dataset to an application as the table
(In SQL, names containing non-alphanumeric characters must be included in double quotation marks.)
What about HDF5 attributes?! The default view exposes HDF5 attributes in a table called HDF5.VRTL.Attributes. This table has three string-valued columns, ObjectPath, AttributeName,
AttributeValue, which represent the HDF5 path name of the attributee, the attribute name, and attribute value, respectively.
For some of you the default view might be of limited use. It’s just a “relational HDF5 baseline.” You can customize the HDF5/ODBC driver’s behavior with an XML-based configuration file. It gives you total control over the catalog name, the schemata, and table names. You choose which datasets to show (or to hide), and which fields of a compound type to expose, and what the column names should be. At the moment, the configuration file creation is a manual process, but there might be some helper tools included in the release version.
Ok, enough technical detail already. Let’s see a more elaborate example!
Make Sure You Get Enough R
If you’ve read the previous sections and had some exposure to R , then the following script should be easy to follow. Otherwise, just read the comments.
# load the RODBC module and create a connection to the ODBC resource “Demo”
# print the table names in schema “/tickdata/”
readline(“Press <return to continue”)
# run a query against an HDFEOS dataset and calculate avg., med. and stdev.
paste(“select VALUE from HDF5.\”/eos/HDFEOS/SWATHS/HIRDLS/Data Fields/\”.O3″,
“where VALUE > -999.0”))
readline(“Press <return to continue”)
# read and plot a time series
data<- sqlQuery(con, “select * from HDF5.\”/tickdata/\”.\”18-09-2011\””)
# count the number of HDF5 attributes and print the distinct attribute names
print(sqlQuery(con, “select count(*) from HDF5.VRTL.Attributes”))
data<-sqlQuery(con, “select distinct AttributeName from HDF5.VRTL.Attributes”)
# close the connection
Personally, I’m not fond of the R syntax, but I could say the same about celery stalks, and look at those nutrition facts!
A Cautionary Tale of Two Bridges
In the book ZeroMQ  , a conversation between two engineers ends like this:
The first engineer was silent. “Funny thing,” he said, “my bridge was demolished about 10 years after we built it. Turns out it was built in the wrong place and no one wanted to use it. Some guys had thrown a rope across the gorge, a few miles further downstream, and that’s where everyone went.”
Without bridges to other ecosystems, HDF5 is just another island with all the joys and eccentricities of island life. At present, HDF5/ODBC is a floating bridge because it might be misplaced, and some of its assumptions might be flawed. Hence this invitation to join the discussion and help us find the right place for that more permanent arch, beam, or suspension bridge to take its place.
We are currently planning for a Q2 2016 release of the product. In the meantime, we are working with a few early adopters on finalizing the initial feature set. If you have additional questions about HDF5/ODBC, or if you would like to become an early adopter, please contact us by sending an email to email@example.com.
ZeroMQ by Pieter Hintjens, O’Reilly 2013
An Introduction to Database Systems by C.J. Date, Pearson 2004
ODBC 3.5 Developer’s Guide by Roger E. Sanders, McGraw-Hill 1999
Python and HDF5 by Andrew Collette, O’Reilly 2013