Liberating Real-Time Data via HDF5: The Fastest Approach for Exposing Embedded Data for Analysis, Machine Learning, and Cloud-Enabled Services

FishEye Software, Inc.

Executive Summary

The HDF Group’s technical mission is to provide rapid, easy, and permanent access to complex data. FishEye’s vision is to Synthesize the world’s real-time data. This white paper is intended for embedded system users, software engineers, integrators, and testers who use or want to use HDF5 to access, collect, use, and analyze machine data. FishEye has developed an innovative application, the Pelagic Real-Time Data Platform, which provides the most efficient method to expose data from embedded systems and liberates data for real-time analysis, machine learning, and cloud-enabled services.

This paper describes FishEye’s approach to accessing a wide variety of complex data schemes and formats found in embedded systems through a simple approach leveraging the high-performance and massively scalable HDF5 data file format. FishEye’s patented Metadata InjectionTM process allows heterogeneous data sources to be homogenized into a single standard to facilitate real-time analysis and machine learning.

Introduction

Real-time embedded systems generate massive flows of complex data that make it difficult to ensure that critical data is identified and captured while avoiding data overflow. This critical data is needed for the analysis of anomalies, for training machine learning systems, and for cloud-enabled services.

FishEye, in a 20-year history building real-time radars, has seen the many challenges in evaluating performance, debugging, and validating complex real-time embedded systems. We observe the engineers who are building and using these systems continuously require access to data for analysis. However, analysis requires access to the appropriate system data without perturbing the system under assessment. Typically, it’s not practical to perform compute-intensive conversion to open interfaces or standard data formats. The ability to analyze systems while they are running can lead to tremendous productivity gains.

Another common problem is that data flows can easily be overloaded with irrelevant data. FishEye has worked to create software that can address this problem by performing data curation while systems are ingesting the data, which allows for massive productivity gains.

More recently, there has been a drive to harness cloud-services and machine learning to gain new insights and capabilities from complex systems. The gains afforded by these processes can put companies on the bleeding edge, but such capabilities require a combination of features: non-intrusive access to internal data; curation to keep only the most relevant data; efficient processing algorithms; and visualization tools.

FishEye found a powerful ally in overcoming data complexities in real-time embedded systems with HDF5. HDF5 is a high-performance data management and storage technology designed for fast I/O processing and storage. It was born out of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. HDF5 began as an architecture-independent software library and file format to address the need to move scientific data among the many different computing platforms in use at NCSA at that time. Today, the purpose of HDF5 is to provide rapid, easy, and permanent access to complex data.

Embedded Systems are Challenging

Real-time embedded systems are challenging to develop, integrate, and sustain. These system challenges span their entire lifecycle and include:

- Too Much Data – Systems try to save it all to ensure access to “the right” data
- Difficult to Access, Share, and Exploit Data – Key data is often hidden, unavailable, or in formats difficult to use
- Slow to Debug – Operation is in real-time, but the analysis is in batch
- Difficult to Compare over Time – Data structures evolve over time causing differences in format
- Expensive to Maintain – System-specific data is complex and costly to process, configure, access, convert, and manage
- Incomplete – Data capture and analysis are not the primary objectives when developing these systems so key data is often not

A Traditional Approach

FishEye developed an innovative approach out of the innate challenges of developing software for phased-array radars where speeds measured approach the speed of light and 4-D data easily overwhelms the computer resources. A key part of this is Metadata InjectionTM which flips the typical approach and does some homework before run-time. This upfront effort maximizes the run-time data pipeline by avoiding data conversion in the real-time processing chain. The approach stages metadata upfront so that data can be piped directly to HDF5 without manipulation and later accessed and curated on demand to allow performance analysis in real-time. HDF5 attributes include handling self-described heterogeneous data and allowing easy cross-platform sharing .

This approach is driven and maintained by following several guiding principles.

Favor Egress over Ingress – Data extraction performance over insertion
Embrace Machine Data – Use data as it exists; zero conversions
Expect Overflow – Too much data to collect countered by proactive curation
Simplify Real-Time – Make real-time data collection easy and unintrusive
Create Information in Real-Time – Gain insights as soon as possible

Figure 1 – FishEye’s Pelagic Real-Time Data Platform

FishEye’s Pelagic Real-Time Data Platform is a set of containerized microservices that collectively provide the capabilities shown in Figure 1. Data flows from left to right. Pelagic captures metadata directly from the code or from the data communication protocol. Pelagic captures data from a wide variety of sources including stored files, sniffed network traffic using varied protocols such as Data Distribution Service (DDS), and/or using an API from within the application code. Capture and curation are performed, all configurable with the added flexibility of complex event processing. Analysis can be performed in real-time. Data can be curated using various filters including those based on time, number of samples, or based on the data itself through configurable logical rules that trigger events through FishEye’s streaming data analytics Complex Event Processing (CEP). Data can be published in varying formats including HDF5, CSV, text, or any other desired format through transformation services driven by Metadata InjectionTM. Data can be stored into a Data Lake or a Cloud service. Data Stakeholders can access data as a source of input to Artificial Intelligence and Machine Learning (AI/ML) for training and/or prediction and prognostics.

How Metadata Injection Liberates Embedded Data

FishEye’s Pelagic Real-Time Data Platform is an automated high-performance bridge between Java, C++, or Ada applications and HDF5. It is a simple way to manage all the complexity of real-time systems while retaining all the associated HDF5 advantages.

Pelagic automatically extracts metadata and provides a high-performance buffered data pipe directly into HDF5. Updates to applications and internal schema are automatically amended and inserted into HDF5 without program maintenance.

FishEye’s MetaGen Toolset automatically enters the program data and extracts data structure information, thus simplifying the process of accessing data. This automation eliminates data dictionary maintenance as the system evolves.

Figure 3 – The Metadata Injection^TM Process

Metadata Injection describes Pelagic’s capability to automatically extract metadata from source code, executables, or even directly from a network through protocols such as DDS. This capability substantially reduces development and continuing software maintenance costs by eliminating the burden of keeping tools and software in sync with data formats.

On startup or at run-time, data structure information can be captured and used to collect and record data as well as to provide the metadata that is stored with the data in the HDF5 file. Data that was captured at various times with evolving data structures can be accessed without a single conversion through the self-describing HDF5 file. Changes to data structures do not require changes to the data collection nor do metadata files need to be maintained in synchrony with the application.

The HDF5 technologies provide a seamless transfer of data between heterogeneous computing platforms. HDF5’s ability to store data in a host system’s “natural” binary format is aligned with FishEye’s philosophy of never converting data, which maximizes performance and minimizes impact on operations. HDF5 also is ideal for storing big data because it does not limit the size or number of data objects, adding increased flexibility . All these demands are found in embedded real-time systems. The Pelagic Real-Time Data Platform maintains a minimal footprint so that data capture can occur without impacting the normal system performance.

Application Programming Interface (API)

Figure 4 – Example of Real-Time Metadata Injection

Figure 4 shows an example of Pelagic’s API, one of the mechanisms that is available for data capture. Simple commands instruct Pelagic to capture applicable metadata, define an ID for an FDA (the term for a data field), and then start capturing the data. Finally, the application shuts down capture after the data has been ingested.

This API is only one of the many approaches that can be used in managing data capture with Pelagic. Data capture through network protocols, as an alternative, is accomplished without any modification of the application.

Analysis + HDF5 in Real-Time

Figure 5 – Real-Time Analysis and HDF5 Data Archive

The Pelagic Real-Time Data Platform allows for analysis and visualization of data in real-time as it is being collected. The figure above is a snapshot of a demonstration where data is collected and stored in HDF5 (right-hand pane) and displayed as it is collected (left-hand pane). For example, this capability can be used by an operator to determine whether a run is successful or if it should be terminated early, saving valuable test time. It can be used by testers to validate that the system is performant as it is actively under test. An integrator can see that data is moving between subsystems correctly and on time.

Benefits of Liberating Data in Real-Time

Liberating data from embedded applications can have many benefits. Eliminating data conversion, providing real-time analysis, and reducing the amount of collected data through careful curation increase system performance and efficiency across the board.

Reductionless Data Increases Test Efficiency – 4x tempo with the same cost
Innovation and Productivity Increased as much as 144x – What would be 12-hour analytics is reduced to 5 minutes
Data Size Cut Down by 50x – 100GB logged into 2GB

Other benefits include:

Accelerated Analysis – No post-processing is required, and analysis can be performed in real-time
Lowering Data and Computing Costs – Reduced storage and reduced maintenance through self-describing data
Enabling Machine Learning – Capturing essential data and using self-described data allows data captured over long periods to be used for training
Enabling Cloud Services – Streamline data exchange between the system and stakeholders, analyze over a historic of data collection, and build multi-system machine learning training data sets
Standardizing Real-Time Data Infrastructure – A system that includes heterogeneous applications that differ in their data approaches can be made homogeneous using HDF5

For More Information

We would like to hear about challenges with real-time embedded systems. Please reach out to contact@fisheye.net. Or you can find out about FishEye’s Pelagic Real-Time Data Platform at FishEyeSoftware.com/Pelagic.