Discovery vs Production with Big Data

Diego Klabjan Partner
Read Time: 3 minutes apprx.
big data data processing data science defining analytics

Andy Picture MT 3 Web

It has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads

“Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake and thus the data source requirement in the definition is met. What about the part on “interactive reports?” The verb “to discover” based on dictionaries means “to learn of, to gain sight or knowledge of,” which is quite disconnected from interactive reports. It actually does not have much in common. Indeed, in business, data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data – in order to ultimately derive business value – by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value and thus such insights are operationalized separately in more established architectures (read EDW, RDBMS, BI).  A good example is customer behavior derived from many data sources, e.g., transactional data, social media data, credit performance. This clearly calls for data discovery in a data lake and insights written into a ‘relational database’ and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deployed in production big data technologies (Google for page ranking, Facebook for friend recommendations) but outside of this industry big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served the boundary will gradually shift more towards production assuming that business value would be derived from such opportunities. Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an illusion and it is better to stick with the proper English definition of gaining knowledge of.