Data: Lost in translation
We all too often believe in data’s power of indisputable “truth”, however we fail to ask key questions of the data analysis, which leads us to blindly follow “data-driven truths”.
This is not to say that analysis should be scrutinised down to the last line of code or row, but that we should probe to understand how fit-for-purpose the data insight is. For instance, you wouldn’t make consequential business decisions from a statistical model with a small sample size (statistically significant or not). How is one to know without asking or being informed?
We obviously don’t want to be manually reviewing raw data, so why do things go wrong?
From a young age, we are all taught to read graphs a certain way to the point where we can interpret one in seconds, and the best graphs should do just that. However, this is one of the most obvious ways in which data can be misinterpreted (or, unfortunately, manipulated).
Look at the graphs below – both graphs use the same data, but by amending the y-axis we can tell a remarkably different story. Neither is incorrect, but both are extremes and require the interpreter to come to their own conclusion. Is the graph on the left too blasé about a decline in profit, or is the graph on the right too alarming? Both will need more context and data for the reader to discern which is the most appropriate visualisation.
The right tool
Often danger strikes when we don’t understand how certain tools function or we are using them for the sake of doing so. But, it’s using our data and outputting some results, so how wrong can it be? Again, the results aren’t wrong, but is this the best tool to provide the best solution? Selecting the right tool and understanding its caveats is pivotal to making informed decisions.
Below shows London Firebrigade data of incidents within Camden from 2009 to 2017 (left). One useful insight for the LFB would be to locate high concentration of incidents so that they could allocate resources to nearby stations. We can apply a k-means algorithm (some machine learning because everyone is doing it these days!) to find clusters of fire incidents.
Even though we’ve applied machine learning and used a sufficiently large dataset, is this enough analysis to pass onto someone else? Probably not. Only through understanding of how the algorithm works, you’d know that you have to specify how many clusters we’d expect to find, in this case six, which in itself is biased. There is also a random element to the algorithm when finding clusters, so if we were to run the process twice more, we arrive at two different outputs below. So which one is correct? All of them, but they are not fit-for-purpose, regardless of how complex the algorithm may be it’s clear that we are using the wrong tool for the wrong job.
Also known as p-hacking, data-dredging is a common data pitfall when working with data, especially in academia. Data-dredging is when one looks for any statistical significance within data and selectively pursues significant results, as opposed to testing a single hypothesis. It often occurs out of pressure from employers or funders to publish statistically significant research.
A common phrase is “correlation does not imply causation”. Just because something is statistically significant to a 99% confidence level, does not make it true. For example, below shows correlation between consumption of mozzarella cheese and the number of civil engineering doctorates awarded. We could (blindly) come to the conclusion that having more civil engineers causes consumption of cheese to increase. This is unlikely to be true.
However, let’s consider underlying factors of both. The more affluent one is, the more disposable income one has to buy cheese and also increase the likelihood of one attending university. So there may be a connection between engineering degrees and cheese consumption, but the causal relationship might lie with affluence and not because the data and statistical test says so.
So, what questions should you be asking?
How was the data obtained?
Why are we doing this type of analysis?
How does the tool work?
What are the limitations of this analysis?
What’s the theory or rationale behind the numbers?
Next time, ask us about our analysis, we’d love to prove it.