Mixing real-time and batch data processing

Diego Klabjan Partner
Read Time: 2 minutes apprx.
big data data processing data science internet of things real-time analytics

Want to process real-time data? Web anyone? Or IoT or the industrial internet! You can fire spouts and bolts in Storm and get “whoops, one of our assembly machines is about to experience problems!”

For decades companies were able to use RDMS and data cubes to find out what revenue would be lost if an assembly machine goes down for two hours under the normal throughput. Lately Hadoop has become a de-facto standard for this step.

But what if the management wants to know what is the impact of a sputtering machine likely to break down under its current throughput? Or a plant manager wants an up-to-date health status of the machine, i.e., after the last batch data run, and the assessment requires the information from the last batch data run and all sensor readings since then? A few years ago this was utopia, but computer and data scientists knew what is brewing in some pots.

Lambda architecture enables exactly this and makes it possible (for now, with an army of data scientists) by using several technologies. This architecture requires views for batch data and a streaming process for the data arrived after the last batch run. Typical implementations use the open source stack Kafka, Hadoop, Druid, Storm, but to a certain extend Spark can also serve the purpose. An interesting perspective and experience is provided by Jay Kreps from LinkedIn (who clearly advocated using only Kafka pioneered by LinkedIn).

While successful deployments are encountered at big web giants to process web data, there are definitely use cases outside of Silicon Valley. Possible manufacturing cases are outlined above.