Information Gain based feature selection in Spark’s MLlib

Diego Klabjan Partner
Read Time: 4 minutes apprx.
big data classification data science machine learning recommendation systems sampling spark web data

We recently worked on a client engagement that included web data and other customer specific information. It was a propensity analysis type project where recommendations were required for each individual client based on his or her past actions on the web. Each item recommended has many features and clients belong to organizations, which creates interactions among them.

Recommendations should be personalized and take into account linkage through organizations. To handle the latter, we selected to use organization related features as features in the model (instead of a possible alternative approach of having organization-level and individual customer-level models which are then combined).

These characteristics led to a personalized model for each client with each model having more than 2,000 features. To avoid over-fitting, feature selection had to be performed at customer level and thus due to a large number of customers in an automated fashion.

A possible line of attack is to apply PCA to each model. The problem with this approach is that the resulting features are linear combinations of original features. The project managers at the client were consistently asking which features are important and what is the importance of each feature (think weights in logistic regression). For this reason PCA was not a viable option.

Instead we opted to go with feature selection based on information gain. The concept is based on the information gain between a feature vector and the label vector. The goal is to select a subset of features that maximize the information gain between them and the label (reveal as much information about the labels as possible) and minimize the information gain among the features themselves (minimize the redundancy of the features among themselves). The goal is to find set S that

equation for 2-15-15 blog

where ig is the information gain between two vectors. It is common to solve this problem, called maximum relevance minimum redundancy, greedily.

Due to a large amount of data across all customers, we implemented the entire system in Spark. Spark’s MLlib unfortunately offers only feature selection based on PCA but not based on information gain. For this reason we implemented our own package. The package is available at for public peruse.

For our purpose we had to also customize MLlib with other extensions (that are rather short and thus not put in a public repository).

Our model was heavily unbalanced (one class represented less than 1% of the records). We undersampled the large class but we had also used different weights for different records in the model calibration phase. This also allowed us to put different weights for different events of customers (for example, a misclassification of a purchase event has more weight than a misclassification of an item that the user recommended to someone else in the same organization). We achieved this by subclassing LabelPoint with a weight of the record. We also had to customize the classes for gradient computation (in logistic and SVM since these were the most competitive models) in order to account for weights.

The second enhancement was the use of a quadratic kernel in SVM and logistic regression. We implemented the quadratic kernel since it yields a linear model in a finite higher dimensional space and thus can reuse a lot of the linear machinery in MLlib. To this end, we created a class that extends the input rdd with the kernel function and then exposes the standard API of linear models. We have also extended the model class to automatically take into account the same kernel function when a test vector is applied for predictions.

Overall, Spark definitely met the needs of this project. It is robust and definitely scalable. On the downside, our problem was not a textbook model and thus it needed several enhancements to MLlib.