Blog Series: Demystifying Data Science

If you are like me, you hear terms like machine learning, data mining, algorithms, etc. talked about as needed solutions all the time. For years I would hear these terms and think to myself, “Yeah, I think I know what that means” or “Am I the only person in the room who doesn’t really know what that is?” It’s funny how these terms can create such a divide and can even get people blindly hopping on the buzzword hype train. While these terms are both complex and exciting, they can often be misunderstood and misrepresented. The goal of this series is to help translate what these terms mean, what these various algorithms can do, and most importantly help you understand how to apply them in your business processes to drive value.

As the title implies, the purpose of this series is to provide an understanding of common data science topics. I am fortunate to work alongside a team of very talented data scientists who will help me tackle these topics and provide examples to help break down these confusing and complicated terms into more easily understood clusters.

…And with that, let’s make a transition into the first blog in this series to understand a heavily used machine learning algorithm that helps major companies like Google, Netflix, and Amazon make decisions and recommendations every day.

Clustering: What is Cluster Analysis and How Does it Work?

Clustering. What is it? For starters, it is a form of machine learning that falls under the category of unsupervised learning. Unsupervised learning is the academic way of saying it classifies a dataset into groups that were not predetermined. It does this through a sequence of math problems that form an algorithm. At its core, clustering is a statistical analysis designed to segment data into its most similar groups or parts without adhering to any preconceived notion. With March Madness looming, I’m going to use a basketball related example to help illustrate my point.

Image Source: https://www.wired.com/playbook/wp-content/uploads/2012/05/NBA-analytics-graph-01-660×478.gif

The above picture is a visualization of an award winning clustering analysis which was performed by a Stanford student named Muthu Alagappan. The objective of this analysis was to identify how many positions exist when looking at the data relating to the top 452 NBA players. Traditionally, basketball is viewed as having five positions: Point Guard, Shooting Guard, Small Forward, Power Forward, and Center. The positions come with common traits and responsibilities. For example, a point guard is usually a smaller, quicker player that plays a role similar to quarterback in calling plays and ensuring ball movement while on offense. In contrast to a point guard, a center is usually the biggest person on the team and they are viewed as a huge asset on defense as well as rebounding. Muthu felt like this classification may not be best suited to the game that exists today, and he wanted to test that hypothesis using clustering. As a result, the analysis concludes that the current classifications are oversimplified and even flatout incorrect. Lebron James for example is listed as a Forward, but he typically brings the ball up the court and runs the offense the way a Point Guard would. This clustering analysis identified 13 total positions looking at various player statistics such as points, rebounds, assists, steals, blocks, turnovers, and fouls all normalized to per minute values. The clustering algorithm then categorizes each player into similar groups relative to that player’s function and performance. For example, Lebron James is in an elite classification of NBA 1st Team while fellow Forward Blake Griffin is classified as a Scoring Paint Protector. Both of these players have the same position of Forward, but are categorized more specifically in this analysis. Ultimately, this classification gives a better description of what to expect from each player.

Now that you have a grounding in what clustering is, let’s discuss how it works. This is a bit tricky to answer as there are many types of analysis and algorithms you can use. This is because the data going into the analysis can look very different. The density, shape, and linearity of the dataset can impact which algorithms will be effective in clustering. The illustrations below will help further explain my point.

Image Source: http://commons.apache.org/proper/commons-math/images/userguide/cluster_comparison.png

In the visuals above, what you are looking at are various potential distributions of data and the corresponding clustering results when using two different algorithms. Let’s assume we are looking at facebook data where the x-axis is age and the y-axis is time spent on the site. The two left columns are using an algorithm called k-means. It’s one of the most popular clustering algorithms. This algorithm iteratively identifies a desired number of centroids that define the areas that will be grouped. Note that the first column of illustrations always has two sets (red and blue) and the second column always has three (red, blue, and green) as the algorithm was programmed to find two and three cluster groups respectively. Hence, the name k-means. K represents the number of desired centroids. Note that in the illustration with three groups of data points (second from the bottom), the k-means clustering where k=2 groups two of the groups together into a single red cluster. The third column uses a different algorithm called DBSCAN (more confusingly referred to as Density Based Spatial Clustering of Applications with Noise), which determines how many clusters to find on its own. You will note that the top two images of the circles and the banana shaped patterns are not as accurate when using k-means due to the shape of the data; however DBSCAN maps these very well. Conversely, DBSCAN isn’t effective at clustering when dealing with uniform datasets like you see on the bottom row. It essentially doesn’t cluster anything at all! There are more than just two clustering algorithms, but these are two of the most widely used clustering methodologies and are great places to start your clustering journey.

One final point about clustering is that it’s extremely useful in finding relationships and insights across multiple variables. If you recall the basketball example above, this evaluated players across seven statistical categories. Traditional methods of segmentation usually get overwhelmed beyond two or three variables due to running out of axes in which to plot data. These clustering methods on the other hand can help identify relationships across several variables and in extreme cases can even handle hundreds of variables.

So, now that you know at a high level what clustering is and how it works, what should you do with it? Well, this is where I would say let the data lead you. Data driven insights is what clustering is all about. This is why Netflix is so effective at recommending what movies you would like to see next. They see what movies you watch and cluster you with similar viewers. They then look at movies similar viewers watched that you haven’t and recommend those movies to you. Similarly, this analysis lays the foundation for how many companies perform targeted marketing. They don’t rely solely on predefined demographics, but instead look to identify which customers are buying which products at which times and at which quantities to better group customers based on their behavior and market specifically to that behavior. Furthermore, clustering can handle a lot of data and variables. And I mean, A LOT of data. It can group thousands of patterns into simple and digestible groups of similar patterns in an automated way that can simplify decision making. This is another reason why it’s become a favored approach by many companies more recently, especially as they begin tapping into their big data infrastructure.

Clustering shouldn’t be limited to marketing for uncovering insights. It is equally impactful in supply chain and operations. For example, it can help define product segmentation to support better forecasting and inventory management as well as SKU rationalization efforts. This MIT research white paper is a great example of this application. The benefits are not constrained to just inventory management either as clustering could offer insights to improve forecasting, production planning, sourcing decisions, profit optimization, root cause analysis, and pretty much any other area that requires decision making.