When producing quality research, research assets should ideally speak for themselves. Research is not an error-free process; mistakes could happen, which may lead to incorrect conclusions. Half-baked research is rarely a strong basis for derivative work. To ensure discipline in scientific reporting, both the data collection procedures and the data-driven workflows that generate results should be transparent, reproducible, and thereby, verifiable.
What can we do about it?
There are several strategies to ensure reusability and reproducibility of research assets. For instance, documenting usage instructions and commenting hard-to-understand areas of code improves reusability. Moreover, having the research process well-documented, and implementing workflows in a modular fashion, improves reproducibility. In fact, improving these aspects have been a broad subject of research. Research contributions in this field can be categorized into three aspects - Metadata, Scientific Workflows, and Visual Programming.
Metadata
By definition, Metadata is "data that provide information about other data". Its purpose is to summarize basic information about data, which, in turn, simplifies how we work with data. Imagine being given a bunch of data without any knowledge of what they are. Here, you may spend some time inspecting the data, just to get an idea of it. Even if you do get an idea, your understanding may be incomplete, or worse, different from what the data actually means. This is where metadata serves its purpose; having metadata lets others know exactly what the creator wants them to know, and lets them build upon that.
Depending on the problem being investigated, researchers may collect fresh data (e.g., user studies, application logs, sensory readings), reuse already collected data (e.g., publicly available datasets), or do both. In either case, the data being collected/reused should contain metadata to convey what the data means, and preferably, how they were collected. In such scenarios, having quality metadata helps to ensure reusability and reproducibility.
Scientific Workflows
Documenting the research process is critical to create reproducible research assets. However, research processes could be fairly complex, and thereby painstaking to document in detail. In cases as such, verifying the integrity of results could be even difficult. This is where 'scientific workflows' help; a scientific workflow, by definition, is "the description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies". Scientific Workflow Management Systems (e.g., Kepler, Pegasus) let users design workflows either visually, e.g., using data flow diagrams, or programmatically, using a domain-specific language. This makes it easy to share workflows that are runnable, thereby enabling others to verify both the research process and the results obtained.
Visual Programming
Visual programming is "a type of programming language that lets humans describe processes using illustration". It allows scientists to develop applications based on visual flowcharting and logic diagramming. Visual programming is beneficial for several reasons; primarily, applications built using visual programming are more accessible to novices. Moreover, having a visual representation makes it easy to understand applications at a conceptual level, analogous to how data is described via metadata.
Software such as Node-RED, Orange, NeuroPype, and Orchest provides visual-programming interfaces to design data-driven workflows. Orange is geared towards exploratory data analysis and interactive data visualization, while NeuroPype is geared towards neuroimaging applications. Node-RED and Orchest, however, is more generic, and allows to build data-driven applications with ease.
StreamingHub
First, I'll introduce StreamingHub. Next, I'll explain how we built scientific workflows for different domains using it, and the lessons we learnt by doing so.
Stream processing is inevitable; the need for stream processing stems from real-time applications such as stock prediction, fraud detection, self-driving cars, and weather prediction. For applications like these, latency is a critical factor that governs their practical use. After all, what good is a self-driving car if it cannot detect and avert hazards in real-time?
StreamingHub is a stream-oriented approach for data analysis and visualization. It consists of four components.
- Data Description System (DDS)
- Data Mux
- Workflow Designer
- Operations Dashboard
Data Description System (DDS)
DDS is a collection of metadata schemas for describing data streams, data sets, and data analytics. It provides three schemas: 1) Datasource schema, 2) Dataset schema, and 3) Analytic schema. Each schema is a blueprint of the fields needed to provide data-level insights.
Metadata created from these schema may look as follows.
Datasource Metadata (Example)
Dataset Metadata (Example)
Analytic Metadata (Example)
Data Mux
The data mux operates as a bridge between connected sensors, datasets, and data streams. It uses DDS metadata to create the data streams needed for a task, and streams both data and DDS metadata.
The data mux provides three modes of execution:
- Replay - stream data from a dataset
- Simulate - generate and stream simulated data from a dataset
- Live - stream real-time data from connected sensors
In replay mode, the Data Mux reads and streams data (files) at their recorded (sampling) frequency. In simulate mode, it generates and streams synthetic data (guided by test cases) at their expected sampling frequency. In live mode, it connects to (live) sensory data sources and streams them. It utilizes DDS dataset/analytic metadata in the replay and simulate modes, and DDS datasource metadata in the live mode. When generating analytic outputs from data streams, the metadata from both the input source(s) and analytic process(es) are propagated into the output data to minimize the required manual labor for making analytics reusable, reproducible, and thus verifiable.
Workflow Designer
Depending on need, users may create sub-flows that implement custom logic. These sub-flows can later be reused within complex flows, or even shared with others. The image above shows an eye movement analysis workflow we designed using Node-RED. Here, the nodes defined as Stream Selector, IVT Filter, and Synthesizer are eye-movement specific sub-flows that we implemented ourselves.
Operations Dashboard
The operations dashboard is also generated using Node-RED. It allows users to perform interactive visualizations and data-stream control actions. Here, users have the option to visualize the data generated at any point in the analytics workflow. All visualizations are dynamic, and updated in real-time as new data is received. Moreover, the available visualization options are determined by data type. The image above shows the operations dashboard that we created for an eye movement analysis task.
The video linked below shows a quick end-to-end demo of StreamingHub in action. Here, I show how a dataset can be replayed using the StreamingHub Data Mux, and subsequently processed/visualized via a domain-specific workflow in Node-RED.
In the future, we plan to include five data-stream control actions in the operations dashboard: start, stop, pause, resume, and seek. By doing this, we hope to enable users to inspect data streams temporally and perform visual analytics; a particularly useful feature when analyzing high-frequency, high-dimensional data. If you're interested in learning more, please refer to my paper titled "StreamingHub: Interactive Stream Analysis Workflows" at JCDL 2022 [Preprint]. Also check out my presentation on StreamingHub at the 2021 WS-DL Research Expo below.