Needless to say, there can be no analytics project without data. A project starts by identifying data, cleansing it, performing analytics, and then conveying the results or solutions. The rule of thumb is that 70 to 80 percent of the total timeframe for an analytics project is spent on data preparation and a much smaller portion to actually conducting the analytics.
There are many reasons why analytics software projects routinely miss deadlines and overrun the budget, in particular regarding the data preparation phase where leaders kick these projects off without first nailing the prerequisites:
- Many analytics projects begin as software engagements, and software projects are notoriously known to be delayed. Commonly the development processes lack specification, there is no source control in place, and no incremental delivery requirements.
- Developers are humans with all of the usual proclivities to over-promise, overestimate skills, and give in to the pressure of management.
- The inherent nature of development does not help either: bugs can be tough to fix, software libraries do not link, the implementation must work on a great variety of different computing architectures, etc.
Analytics projects are subject to all these quirks. They also rely heavily on data which make the delays even bigger and more frequent. And it is not all about data cleansing despite requiring the majority of the time. If the data is readily available, then cleansing is time consuming but doable.
Way too often in an analytics project, significant delays accumulate even before the first line of code or query has been written. It all starts with data availability and collection. It is not uncommon for companies to start analytics projects based on the belief in the potential business value of analytics, dependent on utopic data sets. The quality of data eats into the realized value, but it is even worse if the data is not even available or accessible to the team before the start of the project.
I have been involved in several projects where there is an initial assumption that the business users will provide some data and the miracles of analytics will be performed. This can happen, but only after a multi-year delay in the execution due to data not being available from the beginning. If the business users own the data, then the task of making the data accessible is not a top priority. After all, the business has been running for years without this new analytics nuisance.
After the data has been collected and passed to the analytics team, the data quality issues arise and missing data sets are identified. At this point, a back-and-forth is started with the data owners and the best outcome is that the data is ready after a significant delay in the project execution. While many companies use EDWs that ease data access, in the foreseeable future there will always be spreadsheets around corporations with valuable information not accessible to everyone.
Corporations that do this well can execute a project more quickly because of well-established analytics project leadership. These leaders understand that a project cannot start without the data being ready.