Spark and in-memory databases: Tachyon leading the pack

Diego Klabjan Partner
Read Time: 2 minutes apprx.
big data data processing data science

The biggest grunts about Hadoop is its batch processing focus and the fact that iterative algorithms cannot be written efficiently. For this reason it is mostly used in data lakes for storing huge datasets together with its ETL/ELT capabilities and for running ad-hoc queries with map reduce.

In-memory database on the other hand offer great response time but are limited in their capacity by physical memory. The market is embracing several solutions from Hana by SAP, to VoltDB, memSql, Redis, and other.

Then came Spark with its brilliant idea of resilient distributed datasets (RDDs) which allow it to mimic map reduce but holding the data in (persistent) cache. While a single map reduce process is not much faster in Spark over Hadoop’s map reduce, algorithms iterating on the same dataset are greatly more efficient since data is stored in memory cache for continuous access through iterations.

Spark being a processing framework is not a database or filesystem, albeit offering drivers to many databases and filesystems. Its memory oriented cache offers great computational speed but no storage capabilities. So combining its speed with quick access of in-memory databases is the holy grail of computational efficiency and storage.

As an example, memSQL announced a driver for Spark. Functionality of Spark is not readily accessible on top of data residing in the memSQL in-memory database. Real-time use cases such as fraud detection are sure to benefit from the marriage of the two.

A step further is Tachyon developed at Berkeley. It offers in-memory storage with a seamless integration with Spark. If several Spark jobs are accessing the same dataset stored in Tachyon, the dataset is not replicated but loaded only once. This is definitely ultimate efficiency of storage and computation.

As Hadoop will never supplant RDMS (at least in the foreseeable future), Spark with Tachyon (or any other in-memory database) will not make the two extinct. Huge data sets are unlikely to economically fit in memory and thus the three roommates will continue to dance together and occasionally bounce into each other.