Technical Note on Calculating Many Distances for Analytics Projects

Sam Hillis Data Scientist
Read Time: 3 minutes apprx.
mapping network design supply chain visualization

Distance matrices of different varieties have applications throughout the field of analytics; though the computation varies depending on the application, they always serve the same purpose: providing a pairwise measure of difference between all data points.  Some applications require geographical distance matrices; these can prove computationally intensive if the distances between geographies is on the analyst to calculate.

If latitude and longitude are not given, several simple shortcuts exist such as creating a database of distances between agglomerated areas (zip codes, cities, etc.). Intuitively, some accuracy is lost here.  If the analyst knows the latitudes and longitudes of the data points, the Haversine formula is often preferred. However, at Opex Analytics, we have found that this method does not scale well for creating distance matrices as magnitude increases (note that for road distances in the US, we have found that Radical Logistics is a very good and scaleable solution).

Instead, we have chosen to convert latitudes and longitudes to Earth Centered Earth Fixed (ECEF) coordinates. ECEF converts latitudes, longitudes, and height (if desired) into a Cartesian coordinate system, allowing the analyst to perform much quicker calculations by minimizing the use of trigonometric functions.

The graph below shows the output of a series of time trials carried out using Python’s NumPy and SciPy libraries.  Distance matrices were created from 10, 100, 1,000 and 10,000 randomly generated points throughout the continental United States: the Haversine method is shown in blue and ECEF in green.  When the number of data points is small, the difference in time may be acceptable; with a 100 point dataset, it is a difference of fractions of a second.  However, datasets in the tens of thousands are not uncommon.  Datasets containing 10,000 data points averaged over 20 minutes using the Haversine formula and just over one second using ECEF.

Haversine vs ECEF Distance

So why isn’t ECEF the standard? Because the Earth is round of course.  If the Earth is approximately spherical, the city of Perth in Western Australia is almost exactly on the opposite side of the globe from the Opex  Analytics office in Evanston, IL.  The Haversine formula takes into account the curvature of the Earth (spherical law of cosines), so the distance across the surface of the sphere is used.  ECEF coordinates are Cartesian in nature so the distance through the body of the sphere is used.

This realization may lead you to question the validity of all ECEF results, but consider this: in our tests of 10,000 random data points in the US, the difference in distance calculated using both methods was less than 0.5%; however, the difference in calculated distance between Perth and Evanston is nearly 30%. Clearly, ECEF coordinates should not be used to calculate distances between points on the other side of the world, but even on the scale of a large country the difference may be acceptable.

If you are in a line of business where the accuracy of measurement is the most important factor, the ECEF coordinate system would not be acceptable.  If you are performing a spatial analysis of 10’s of thousands of data points that are relatively close and a small margin of error is acceptable, ECEF is definitely worth a consideration.