Statistical Model: A Map is Not the Territory

statistical model map

“A good analogy is that a model is like a map, rather than the territory itself. And we all know that some maps are better than others: a simple one might be good enough to drive between cities, but we need something more detailed when walking through the countryside. “

[The Art of Statistics, David Spiegelhalter]

When you visit Disneyland, you may not need a detailed map made by satellite information. We need only a simple cartoon map which includes the relative location of all the attractions. If you are a secret agent, you need a much more detailed map to investigate. The statistical model (or even data-driven model) is the same. The fidelity of the statistical models totally depends on the purpose of the use of the statistical model and the quality of data that has been fed into it.

When making a statistical model, there is a general trade-off between bias and variance (bias-variance dilemma). If we reduce the variance, the model may fail to approximate underlying ground truth (high bias). if we reduce the bias, on the other hand, the model is vulnerable to noise, leading to failure of approximation of ground truth (overfitting and high variance). This dilemma shows that we cannot make a perfect model from data. A map is a map. it is not the territory. A menu is a menu. it is not food. A statistical model is a model. it is not ground truth. So, we don’t need to overestimate (and also underestimate) the power of the statistical model. As a map is still useful to find a right path, a statistical model is useful to understand and predict the system.

Signal and Noise: How to Understand Data?

“In the statistical world, what we see and measure around us can be considered as the sum of a systematic mathematical idealized form plus some random contribution that cannot yet be explained.”

[The Art of Statistics, David Spiegelhalter]

The famous book by Nate Silver, The Signal and the Noise, said how to find the signal from the noises. Since the level of the signal and the noise totally depend on the quality of data, it is really hard to distinguish these perfectly from the data. Also, it requires prior knowledge, intuition, and experiences about the data. So, all statistical models have two components: (deterministic) mathematical formulation and (stochastic) residual error. Hence, when we make a statistical model for analyzing the data, we need to check what we know (mathematical form) and what we don’t know (randomness). The name “residual error” seems to refer a bad model but it is not. Of course, the large residual error may stem from the bad choice of the model but this error often stems from the lack of our knowledge, the lack of data, or the data acquisition method.

When we analyze data, we don’t need to make a perfect model (actually it is impossible due to the aforementioned issues). If we try to make an errorless model, we can be struggling with overfitting issues, leading to the worst model without any significant finding. Instead, we provide both mathematical formulations and the corresponding residual errors. That is the only thing the statistician can do. Our life is the same. We don’t self-flagellate much when our life plan fell through. This is not our mistake but randomness in our life. If the failure of the plan came from our mistake, we fail several times in a row and then we can check what we did and adjust our plan (or mindset). If not, we may make a successful comeback the next time by randomness. Hence, no matter which reason makes your plan fail, we do try more and more for the success of our life.

Trade-off: The Quality or The Quantity of Data for Better Statistics

the art of statistics quality

“When we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims.”

[The Art of Statistics, David Spiegelhalter]

In the age of Big Data, we can collect tremendous data from many sources that have different qualities (e.g. accuracy, resolution, or fidelity). Using all the data we can easily draw statistics about what we measured. These statistical results help us to understand what is going on by comparing the previous statistical results. However, if we want a deep understanding of hidden patterns for accurate future prediction (statistical inference), the quality of data becomes the main factor for accurate prediction; higher quality, higher accuracy. Collecting data, however, has a general trade-off between the quality and the quantity. High accurate data require expensive data acquisition costs (e.g. expansive measurements, fine-scale simulation using more computer resources) while less accurate data are relatively cheap to obtain.

As the book mentioned, the data-driven predictive model totally depends on the quality (and the quantity) of data. First, we check the accuracy (or fidelity) of data and use only the high-fidelity data to make a data-driven model for decision or prediction. Due to the aforementioned trade-off, however, we generally have a few high-fidelity data and/or many low-fidelity data. Then how to make a data-driven model? since a few high-fidelity data provide only partial information, it is hard to make an accurate model globally. the use of many low-fidelity data enables us to make a global model but it has a systematic inherent bias, leading to a wrong prediction. Hence, in data science, many researchers have focused on multi-fidelity data fusion, which enables us to make an accurate global model using both high and low fidelity data; chasing both the quality and the quantity.

Are There a Few Magic Numbers for Describing Complex Systems?

The art of statistics few

“Large collection of numerical data are routinely summarized and communicated using a few statistics of location and spread, (…), these can take us a long way in grasping an overall pattern.”

[The Art of Statistics, David Spiegelhalter]

Can we understand all (fine-scale) patterns from a massive data set? If you were a genius, you may keep track of all the patterns. But, it is (almost) impossible to analyze all. That’s is why we employ statistics to understand and analyze a large data set and predict/estimate the future from statistical results (e.g. population, economic growth, the unemployment rate, or stock price). For example, to make a business model for kids, it is much easier to see the average birthrate in some regions rather than count the number of children in my neighborhood. Statistical approaches always provide just a few numbers to describe the complex systems. This simplification enables us to make a simple (predictive) model, leading to an efficient and optimized analytics.

I agree that a few numbers make the complex system simple and I have experienced that this simple representation gives us the proper direction to make a better decision. Then, what is the good “number (statistic)” for massive data in our hands? The average? well, but the book also said: “there is no substitute for simply looking at data properly.” Hence, we should be careful to understand the complex system using only a few statistics. Some statistics are venerable to outliers such as average. Also, we can draw the dinosaur patterns using the given mean and variance (please see my previous post). Nowadays, data-driven approaches via statistical learning (machine learning) may provide optimal numbers to describe the complex system effectively. Yet, we need to scrutinize all the statistics the data-driven models provide. However, I do expect that a data-driven AI model may find a good reparameterization of the massive data set for a better understanding in the near future.

Framing: Statistics Can Manipulate Our Thought

framing

“The examples in this chapter have demonstrated how the apparently simple task of calculating and communicating proportions can become a complex matter.”

[The Art of Statistics, David Spiegelhalter]

Thanks to you, the number of followers increases by 22% in November! When you see this sentence, you may think that this emerging blog is growing rapidly and there are some reasons for this success. If I wrote “4 people start to follow my blog in November”, you might have a different feeling. But both are true: my blog has 18 followers in October and now 22. Different representations in statistics can change the impact of observations, we called this positive (or negative) framing. There are many examples of positive or negative framing. For example, pharmaceutical companies want to say that a new medicine has a 95% survival rate rather than a 5% mortality rate (positive framing). Investigative journalists want to say that 3,000,000 people are suspected of tax evasion every year rather than 1% of people (negative framing). This framing also appears in the graph. Assume that we need to draw a bar chart with two bars whose values are 95 and 98, respectively. If we draw a bar chart from 0 to 100, the two bars look similar. However, if we draw a bar chart from 90 to 100, we see totally different bars on the graph.

How can we escape from this framing? Information providers should provide alternative data representations (different graphs, law data, tables) so that we can get a balanced view of the data by examining raw data. Also, we always should be skeptical when we see data. First, we should check who (and why) published statistical data; data do not lie, only presenters may lie. However, this argument does not refer to that statistics are totally crafty tricks. Statistics is still powerful to understand, analyze, and visualize data effectively. Moreover, in the age of Big Data, statistical knowledge is fast becoming the main tool to deal with big data correctly. That is, statistics are a double-edged sword; the power of statistics depends on us.

[Wrap up] Book Review: Humble PI: A Comedy of Maths Errors

In our life, mathematics is very important for logical thinking based on evidence-based knowledge through rigorous mathematical analysis. Especially, when we predict something new, the power of mathematics overwhelms our instinct or heuristics. However, when using mathematics improperly, catastrophic results are waiting for us. In this book, the author, Matt Parker, said such an important role of mathematics and showed examples of disasters stemming from mathematical errors through exhilarating stories he has experienced. 

Then, what is the role of a human in mathematics? We try to use mathematics when deciding something important. And then, we should check all the types of mathematical errors to avoid the disaster. I would like to introduce his last paragraph. “Our modern world depends on mathematics and, when things go wrong, it should serve as a sobering reminder that we need to keep an eye on the hot cheese but also remind us of all the maths which works faultlessly around us.”

The following links are some quotations from the book with my thoughts.

(1) What Number Is a Really Big Number?

(2) Please Give Math More Time to Pick up the Pieces

(3) I Don’t Count on You When You Count Numbers

(4) More Approximations, More Problems in Your Life

(5) Probably, We Are Not Independent

(6) Searching for Average Man

(7) Sometimes, Simple Mathematics is Better than Our Experiences

What Number Is a Really Big Number?

humble pi

“As humans, we are not good at judging the size of large numbers. And even when we know one is bigger than another, we don’t appreciate the size of the difference.”

[Humble Pi: A Comedy of Maths Errors, Matt Parker]

In the Stone Age, a hundred might be a sufficient number to count a herd of deer for hunting or to count gathered nuts. In the early (and mid) 20th century, a million is enough to call the rich people ‘Millionaire’ but now it is too small to count Mark Zuckerberg’s net worth (a million is still BIG money for me by-the-way). In the age of Big Data, what number is a really big number? In the 1980s, Bill Gates, the pioneer to usher in the computer age, said: “for computer memory, 640K ought to be enough for anybody.” Nobody can predict the big number correctly and this is human nature. 

However, we need to estimate a certain big number for a data-driven model, for our business, or for our blogs. After unveiling a smartphone, data acquisition speed is now super fast, leading to the age of AI and Big Data. Nowadays, when we make a model, we consider its own capacity to deal with tremendous data (beyond a trillion). The recent introduction of the Internet of Things (IoT) and the autonomous vehicle will generate countless data every second. Then, we need to keep thinking about the big number again and again. That is why I am preparing for the first event for the Billionth visitor to my blog. Do you think this number is still small? It depends on your action. please visit my blog more!