Statistical Model: A Map is Not the Territory

statistical model map

“A good analogy is that a model is like a map, rather than the territory itself. And we all know that some maps are better than others: a simple one might be good enough to drive between cities, but we need something more detailed when walking through the countryside. “

[The Art of Statistics, David Spiegelhalter]

When you visit Disneyland, you may not need a detailed map made by satellite information. We need only a simple cartoon map which includes the relative location of all the attractions. If you are a secret agent, you need a much more detailed map to investigate. The statistical model (or even data-driven model) is the same. The fidelity of the statistical models totally depends on the purpose of the use of the statistical model and the quality of data that has been fed into it.

When making a statistical model, there is a general trade-off between bias and variance (bias-variance dilemma). If we reduce the variance, the model may fail to approximate underlying ground truth (high bias). if we reduce the bias, on the other hand, the model is vulnerable to noise, leading to failure of approximation of ground truth (overfitting and high variance). This dilemma shows that we cannot make a perfect model from data. A map is a map. it is not the territory. A menu is a menu. it is not food. A statistical model is a model. it is not ground truth. So, we don’t need to overestimate (and also underestimate) the power of the statistical model. As a map is still useful to find a right path, a statistical model is useful to understand and predict the system.

Signal and Noise: How to Understand Data?

“In the statistical world, what we see and measure around us can be considered as the sum of a systematic mathematical idealized form plus some random contribution that cannot yet be explained.”

[The Art of Statistics, David Spiegelhalter]

The famous book by Nate Silver, The Signal and the Noise, said how to find the signal from the noises. Since the level of the signal and the noise totally depend on the quality of data, it is really hard to distinguish these perfectly from the data. Also, it requires prior knowledge, intuition, and experiences about the data. So, all statistical models have two components: (deterministic) mathematical formulation and (stochastic) residual error. Hence, when we make a statistical model for analyzing the data, we need to check what we know (mathematical form) and what we don’t know (randomness). The name “residual error” seems to refer a bad model but it is not. Of course, the large residual error may stem from the bad choice of the model but this error often stems from the lack of our knowledge, the lack of data, or the data acquisition method.

When we analyze data, we don’t need to make a perfect model (actually it is impossible due to the aforementioned issues). If we try to make an errorless model, we can be struggling with overfitting issues, leading to the worst model without any significant finding. Instead, we provide both mathematical formulations and the corresponding residual errors. That is the only thing the statistician can do. Our life is the same. We don’t self-flagellate much when our life plan fell through. This is not our mistake but randomness in our life. If the failure of the plan came from our mistake, we fail several times in a row and then we can check what we did and adjust our plan (or mindset). If not, we may make a successful comeback the next time by randomness. Hence, no matter which reason makes your plan fail, we do try more and more for the success of our life.