Liberating Real-Time Data via HDF5: The Fastest Approach for Exposing Embedded Data for Analysis, Machine Learning, and Cloud-Enabled Services
FishEye Software, Inc., www.FishEye.net/HDF
The HDF Group’s technical mission is to provide rapid, easy and permanent access to complex data. FishEye’s vision is “Synthesizing the world’s real-time data”. This white paper is intended for embedded system users, software engineers, integrators, and testers that use or want to use HDF5 to access, collect, use and analyze machine data. FishEye has developed an innovative process that provides the most efficient method to expose data from embedded systems that simplifies and liberates data for real-time analysis, machine learning, and cloud-enabled services.
This paper describes FishEye’s approach to liberating the variety of complex data schemes and formats found in embedded systems through a simple approach that leverages the high-performance and massively scalable HDF5 data file format. FishEye’s process allows heterogeneous data sources to be homogenized into a single standard to facilitate real-time analysis and machine learning.
Real-time embedded systems generate massive flows of complex data that make it difficult to ensure that critical data is identified and captured while avoiding data overflow. This critical data is needed for analysis of anomalies, for training machine learning systems, and for cloud-enabled services.
FishEye, in a 20-year history building real-time radars, has seen the many challenges in evaluating performance, debugging, and validating complex real-time embedded systems. We observe Engineers building and using these systems continuously require access to data for analysis. But analysis requires access to the appropriate system data without perturbing the system under assessment. This typically means it’s not practical to perform compute-intensive conversion to open interfaces or standard data formats.
Data flows can easily be overloaded with irrelevant data. The ability to analyze systems while they are running can lead to tremendous productivity gains. More recently, there is a drive to harness cloud-services and machine learning to gain new insight and new capabilities from complex systems. But such capabilities require a combination of features: Non-intrusive access to internal data; Curation to keep only the most relevant data; Efficient processing algorithms; and visualization tools.
FishEye found a powerful ally in overcoming data complexities in real-time embedded systems with a technology called HDF5. HDF5 is a high-performance data management and storage technology designed for fast I/O processing and storage. It was born out of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. HDF5 began as an architecture-independent software library and file format to address the need to move scientific data among the many different computing platforms in use at NCSA at that time. Today, the purpose of HDF5 is to provide rapid, easy and permanent access to complex data.
Embedded Systems are Challenging
Real-time embedded systems are challenging to develop, integrate, and sustain. These system challenges span their entire lifecycle and include:
- Too Much Data – Systems save it all to be sure to have the right data
- Difficult to Access, Share, and Exploit Data – Key data is often hidden, unavailable, or in formats difficult to use
- Slow to Debug – Operation is in real-time but analysis is in batch
- Difficult to Compare over Time – Data structures evolve over time so older data has a different format
- Expensive to Maintain – With system-specific data, it is a complex process to configure, access, convert, and manage data
- Incomplete – Data capture and analysis are not the primary objective when developing these systems so key data is often not available
A Traditional Approach
Most systems save every piece of data that might ever be needed. The archived data is accompanied by data dictionaries that must be maintained separately or by tools that are tightly-coupled to the system being analyzed. System modifications, the need for regression analysis, project-specific tools and interfaces, and data dictionary updates make the analysis process complex, a configuration nightmare, and a costly cascading effort to maintain.
Complex System Big Fast Data
FishEye developed an innovative approach out of the innate challenges developing software for phased-array radars where speeds measured approach the speed of light and 4-D data easily overflows the computer resources. One part of this is “Metadata Injection,” which flips the typical approach to do some homework before run-time. This upfront effort maximizes the run-time data pipeline by avoiding data conversion in the real-time processing chain. The approach stages metadata upfront so that data can be piped directly to HDF5 without manipulation and later accessed and curated on demand to allow performance analysis in real-time. HDF5 attributes include handling self-described heterogeneous data and allows easy cross-platform sharing.
This approach is driven and maintained by following several guiding principles.
- Favor Egress over Ingress –data extraction performance over insertion
- Embrace Machine Data – use data as it exists; don’t convert
- Expect Overflow – Too much data to collect so proactively curate
- Simplify Real-Time – Make real-time data collection easy and unintrusive
- Create Information in Real-Time – Show insights as soon as possible
The FishEye Real-Time Platform, is a set of micro-services that collectively provides the capabilities shown in the figure above. Data flows from left to right. The Platform captures metadata directly from the code or from the data communication protocol. The Platform captures data from a wide variety of sources including stored files, sniffed network traffic using varied protocols such as Data Distribution Service (DDS), or using an API from within the application code. Capture and curation are performed, all configurable with the added flexibility of complex event processing. Analysis can be performed in real-time. Data can be curated using various filters including filters based on time, number of samples, or based on the data itself through configurable logical rules that trigger events through FishEye’s streaming data analytics Complex Event Processing (CEP). Data can be published in varying formats including HDF5, CSV, text, or any other desired format through transformation services driven by Metadata Injection. Data can be stored into a Data Lake or a Cloud service. Data Stakeholders can access data as a source of input to Artificial Intelligence Machine Learning (AI ML) for training and/or prediction/prognostics.
How Metadata Injection Liberates Embedded Data
FishEye’s Real-Time Platform serves as an automated high-performance bridge between Java, C++, or Ada applications and HDF5. It is a simple way to manage all the complexity of real-time systems while retaining all the associated HDF5 advantages.
FishEye’s RTTK automatically extracts metadata and provides a high-performance buffered data pipe directly into HDF5. Updates to an application and internal schema are automatically updated and inserted into HDF5 without program maintenance.
FishEye’s MetaGen automatically goes into the program data and extracts data structure information, thus simplifying the process of accessing data. This automation eliminates data dictionary maintenance as the system evolves.
Typically, complex systems produce huge amounts of data that overflow data collection capabilities. Curation allows selectively filtering out less important data to ensure that important data is kept.
Metadata Injection describes RTTK’s capability to automatically extract metadata from source code, executables, or directly from a network through protocols such as DDS. This capability substantially reduces development and continuing software maintenance costs by eliminating the burden of keeping tools and software in sync with data formats.
On startup or at run-time, data structure information can be captured and used to collect and record data as well as to provide the metadata that is stored with the data in the HDF5 file. Data that was captured at various times with evolving data structures can be accessed without a single conversion through the self-describing HDF5 file. Changes to data structures do not require changes to the data collection nor do metadata files need to be maintained in synchrony with the application.
The HDF5 technologies provide a seamless transfer of data between varying heterogeneous computing platforms. HDF5’s ability to store data in a host system’s “natural” binary format is aligned with FishEye’s philosophy of never converting data, which maximizes performance and minimizes impact to operations. HDF5 also is ideal for storing big data. All these demands are found in embedded real-time systems. The FishEye Real-Time Platform maintains a minimal footprint so that data capture can occur without impact to the normal system performance.
Application Programming Interface (API)
The above Figure 4 shows an example of the Platform’s API, one of the mechanisms that is available for data capture. Simple commands instruct Real-Time Platform to capture applicable metadata, define an ID for an FDA (the term for a data field), then start capturing the data. Finally, the application shuts down capture.
The API is only one of the many approaches that can be used in managing data capture. Data capture through network protocols, as an alternative, is accomplished without modification of the application.
Analysis + HDF5 in Real-Time
The FishEye Real-Time Platform allow analysis and visualization of data in real time as it is being collected. The screen shot above is a snapshot of a demonstration where data is collected and stored in HDF5 (right hand pane) and displayed as it is collected (left hand pane). This capability can be used, for example, for an operator to determine whether a run is successful or perhaps should be terminated early, thus saving test time. It can be used by testers to validate the system is performant as it is actively under test. An Integrator can see that data is moving between subsystems correctly and on time.
Benefits of Liberating Data in Real-Time
Liberating data from embedded applications can lead to many benefits. Increased system performance and efficiency comes from eliminating data conversion, providing real time analysis and reducing the amount of collected data and the storage required to keep it.
- Reductionless data increases test efficiency: 4x tempo, same cost
- Innovation and Productivity increased as much as 144x: 12-hour analytics reduced to 5 mins
- Data size cut 50x: 100GB logged into 2GB
Other benefits include:
- Accelerate Analysis – No post-processing is required and analysis can be performed in real-time
- Lower Data and Computing Costs – Reduced storage, reduced maintenance through self-describing data
- Enable Machine Learning – Capturing essential data and using self-described data allows data captured over long periods of time to be used for training.
- Enable Cloud Services – Streamline data exchange between the system and stakeholders, analyze over a historic of data collection, and build multi-system machine learning training data sets.
- Standardize Real-Time Data Infrastructure – A system that includes heterogeneous applications that differ in their data approaches can be made homogeneous in using HDF5.
For More Information
We would like to hear about challenges with real-time embedded systems. Please reach out to FishEye.net/Ted. Or you can find more information about HDF here FishEye.net/HDF and about the FishEye’s Real-Time Platform here at FishEye.net/RTTK