I recently read Ten Simple Rules for Effective Statistical Practice by Kass et al and was inspired to share the article as it elegantly captures a practical philosophy for how best to analyze data. I’ll summarize their rules here and highly recommend reading the the journal article itself.
The following 10 rules are directly from the paper with my comments following.
- “Rule 1: Statistical Methods Should Enable Data to Answer Scientific Questions” Kass et al wrote the article with academic researchers in mind but this rule is valid for any question you’re trying to answer with data. The essence of this first point is that the starting point for any data analysis inquiry should be consider the link between the available data and the question. The trap to avoid is receiving the data and immediately thinking about the tools that could be applied to the question. First understand the question and then consider what tools will best identify the answer.
- “Rule 2: Signals Always Come with Noise” No brainer right? There is always variation in the results. The point being made here is to retain view of the probability distribution of outcomes. The bigger the data, the more noisy it is likely to be – make sure to understand how that noise behaves.
- “Rule 3: Plan Ahead, Really Ahead” Collect data with great care. Think about the expected answers and be very deliberate in making sure the data will support that question. This will make the subsequent analysis simpler and more rigorous.
- “Rule 4: Worry about Data Quality” Expect to spend a lot of time exploring and cleaning the data. Performing a solid exploratory analysis of the data will give an intuitive sense for what can be expected from the data and will also highlight unexpected results from missing values or incorrectly recorded values. Discover data issues early!
- “Rule 5: Statistical Analysis Is More Than a Set of Computations” Be clear and specific about the statistical tests you are going to run and why they are appropriate. Methodology should be carefully planned not ad hoc.
- “Rule 6: Keep it Simple” Start with the simplest approach and only add complexity as needed. This will have the added benefit of making the communication of the result easier.
- “Rule 7: Provide Assessments of Variability” The bigger and more complex the data source, the greater the risk for underestimating the uncertainty. Particularly when the data has a lot of dependencies between variables.
- “Rule 8: Check Your Assumptions” Assumptions of independence or linearity are common and may not be justified. Both the data itself and the nature of the variables in question should be inspected closely before applying assumptions.
- “Rule 9: When Possible, Replicate!” In the world of academia, this means performing your analysis in a way that can be replicated by another researcher with an independent data set. In the world of business this is unlikely. But in any modeling exercise there is always the need for separating the data into training and test sets. Make sure to do this in way that preserves the independence of the two dataset.
- “Rule 10: Make Your Analysis Reproducible” This last rule also comes out of the scientific method but is just as important for industry. Many analyses need to be repeated periodically and carefully documenting the way that the work is done dramatically increases ease with which subsequent analyses can be performed. It also has the beneficial effect of standardizing the approach allowing the results to be more rigorously compared over time.
In short, know the question, know the data, keep it simple, be mindful of the variability and the assumptions, and document the work well for ease of reproduction in future. Keeping this list of rules on hand and reviewing them before beginning any new analysis will lead to a more efficient process and rigorous result – exactly what we’re all striving for with every project we undertake.