If you read through previous sections you can see some of the benefits of DataTurbine as a “black box” system, separating the sources from the sinks and handling heterogeneous data types in a unified system. However the primary reason to use DataTurbine is the ability to interact with data in real-time or near real-time.
DataTurbine is built around this constant and its limitations for historical data are a direct consequence of its strength and speed at working with streaming real-time data.
In addition to working with live data, DataTurbine can stream archived as if it were live, re-utilizing common data viewers and infrastructure for post-test data analysis and review.
What is Real-Time Data
Real-time data refers to delivering data as soon as it is collected. There is no delay in the timeliness of the information provided. This is in contrast to an archival system that stores data un till a later date.
DataTurbine can handle data sampled millions of times a second or as infrequently as once a century. In practice many uses are somewhere in between with data sampling every second, minute or hour.
As many remote sites can have drastic communication delays and do not require a strict time constraint, it would be more correct to refer to those systems as providing near real-time databut for the sake of simplicity they are often also grouped into the real-time category.
Also note that when we talk about real-time we are focusing on the availability of data not to be confused with real-time computing which focuses on guaranteed response within strict time constraints.
Benefits of Real-time Data
- Failure:The most direct benefit of real-time data is the ability to respond to factors on the fly. If a sensor goes bad the system registers it immediately and can be fixed (before potentially months of data are ruined).
- Important Event: If an event of importance occurs a team can be dispatched immediately to gather additional samples and observe the occurrence first hand.
- Sampling: With a real-time system its possible to change sampling rates and activate and deactivate sensors based on the data they receive.
Example: If one sensor detects an important event perhaps the sensors in that region need to increase their sampling rate temporarily or a camera needs to be activated.
- Analysis: There is a lot of analysis that can be performed on real-time data and in certain cases this is actually the more efficient route. Averages, correlations, and mathematical operations can be performed in real-time with ease. The derived data can be put back into DataTurbine and further utilized. The end result is that summary and analytic data is available on the fly giving an overview of the health of the system and the experiment.
- Public Consumption: Real-time also gives added value to the data. Data can be published publicly as it is gathered. The same sensor network that is monitoring an ecosystem for scientific research can display the tides and temperature of the water, the wind speed and direction, even a video feed showing the view of the forest.
- Portable: Streaming data is very portable. Adding destinations or applications is easy and transparent. Since data is contained as tuples (time,value, source) it is easy for any system to accept it and requires significantly less overhead then trying to read from a rigid structure such as a database. Once a streaming system is set up raw data, and automated analysis and quality assurance and quality control are available to any application and destination that the provider specifies the second it is available. Any additional analysis (which could take weeks or months) can then be amended later.
- Funding Compliance: There is an increasing pressures by funding agencies for data providers to publicly publish data in a timely manner. A real-time system can help satisfy that compliance.
Limitations of Real-Time Data
- Not a Replacement: A real-time data system would ideally be an addition not a replacement for an archival system. It should add to a system but makes a poor replacement for operations that are best suited to an archive such as a relational database.
- Data Quality: Data coming directly from sensors will have inherent imperfections which have to be cleaned away before consumption. Unlike an archival system which often just provides the cleanest most annotated data, a real-time system would ideally have multiple data levels of progressively cleaner data.
- Automated Cleaning: Automated QA/QC can be performed on a real-time stream to identify obvious inconsistencies and potentially problematic parts of the data.
- Levels of Assurance: Different applications require a different level of assurance. For example a local weather site could use nearly raw data, while an intricate carbon dioxide absorption experiment would utilize manually cleaned and validated data.
- Different Paradigm: While traditional analysis would still work on archived data, utilizing the real-time aspect of data often requires a different approach then analysis on archived data.