Diego Klabjan
Mar 3rd, 2015

Many sites and portals offer text content on web pages. For example, news aggregators such as The Huffington Post or Google News allow users to browse news stories; membership-based portals focusing on a specific industry, e.g., constructionsupport.terex.com for construction, offer members a one-stop page for the latest and greatest updates in a particular domain; in the service domain DirectEmployers provides site www.my.jobs with job listings for members to explore. A challenge faced by these site providers is to distinguish users that simply browse the site versus those that are actively searching with an end goal, e.g., for DirectEmployers it means distinguishing between the user that actively seek a job vs those only exploring the portal. The former can then be targeted with possible marketing campaigns to provide higher business value.

While traditionally this can be accomplished through web analytics by following page views and not considering the actual textual content on pages, this is no longer satisfactory because modern sites use the html5 technology which enables data collection of users’ interactions with the textual content. By recording user clicks in javascript, new data streams collect and combine user ids with click streams and text content viewed. For example, DirectEmployers records the user id and the job description viewed. This should conceptually enable the company to identify which user is merely browsing the portal vs users that actively search a job.

In order to achieve this, relevant information needs to be extracted from each text description, next a measure of proximity of two extracted information is needed, and in the end a single ‘dispersion’ metrics is computed for each users. The higher the metrics, the more exploratory behavior of the user is. The workflow requires substantial data science and engineering using several tools.

Hadoop’s schema on read is a well suited framework for the bulk of the analysis. Its easy to load concept makes it adequate to simply dump textual descriptions, click data, and user information to the filesystem.

To form relevant information from each text description, Latent Dirichlet Allocation (LDA) can be performed. The process starts by removing stop words from text which is easily accomplished in the Hadoop’s map reduce paradigm. Instead of using raw java, scripting language Pig can be used in combination with user defined functions (UDFs) to accomplish this task in a few lines of code. Next the document-term matrix is constructed. This is again simple to perform in Pig by a single pass through text descriptions and fully exploring concurrency.

LDA which takes the document-term matrix as input is hard to execute in Hadoop’s map reduce framework and thus it is more common to export the matrix and perform LDA in either R or Python since both offer excellent support for LDA. The resulting topics mapped back to the original text content can then be exported back to Hadoop for subsequent steps.

The calculation of distances between text descriptions based on the topics provided by LDA can be efficiently executed in Hadoop by using a self-join in Pig with help of UDFs. Finally, the score for each user is computed by joining user data with clicks and pair-wise distances of text descriptions.

All of these steps can be accomplished by Pig (and select steps are more elegant in Hive) with only a limited number of java code hidden in UDFs and assistance of R or Python.

Without the use of Hadoop’s capability of handling size and variety of data, this analysis would be confined to only user clicks and thus the value provided would be very limited.