What Number Is a Really Big Number?

humble pi

“As humans, we are not good at judging the size of large numbers. And even when we know one is bigger than another, we don’t appreciate the size of the difference.”

[Humble Pi: A Comedy of Maths Errors, Matt Parker]

In the Stone Age, a hundred might be a sufficient number to count a herd of deer for hunting or to count gathered nuts. In the early (and mid) 20th century, a million is enough to call the rich people ‘Millionaire’ but now it is too small to count Mark Zuckerberg’s net worth (a million is still BIG money for me by-the-way). In the age of Big Data, what number is a really big number? In the 1980s, Bill Gates, the pioneer to usher in the computer age, said: “for computer memory, 640K ought to be enough for anybody.” Nobody can predict the big number correctly and this is human nature. 

However, we need to estimate a certain big number for a data-driven model, for our business, or for our blogs. After unveiling a smartphone, data acquisition speed is now super fast, leading to the age of AI and Big Data. Nowadays, when we make a model, we consider its own capacity to deal with tremendous data (beyond a trillion). The recent introduction of the Internet of Things (IoT) and the autonomous vehicle will generate countless data every second. Then, we need to keep thinking about the big number again and again. That is why I am preparing for the first event for the Billionth visitor to my blog. Do you think this number is still small? It depends on your action. please visit my blog more! 

Make Your Problem Harder!

How not to be wrong

“Instead, we turn to the other strategy, which is the one Birbier used: make the problem harder. That doesn’t sound promising. But when it works, it works like a charm.”

[How not to be wrong, Jordan Ellenberg]

When your friend was struggling with a difficult problem, we often said: “Don’t make it complex, just start with a simple problem”. This is because we have experienced that this simplification provides some clues for solving the difficult problem. This is what mathematicians actually do every day. When proving some statements, they start from the simplest case and expand it to the target problem. However, sometimes, making the problem harder suggests a simple alternative way to solve your real problems effectively.

Many data scientists have focused only on reducing the number of features to make a data-driven model simper. However, this approach does not always give the simplest model. The projection onto the low-dimension (fewer features) may make the data structure more complicated, leading to a failure of spotting the hidden pattern. Hence, sometimes, they need to increase features to make a model simpler (because of more data, more simple). This alternative thinking (adding more features) embodies the trade-off between a simpler model with many features and a complicated model with few features.

Justice: What’s the Right Thing to Do in Data Science?

“The model is optimized for efficiency and profitability, not for justice or the good of the “team”. This is, of course, the nature of capitalism.”

[Weapons of Math Destruction, Cathy O’neil]

Michel J Sandel’s magnum opus, Justice: What’s the Right Thing to Do?, called our attention to justice (and fairness) in a period of prosperity of capitalism. Data science acts in a similar fashion of capitalism. More data (money) is more powerful and the efficiency (profitability) is the most important factor for its success. Hence, in Data Science, we should consider that fairness and efficiency (and profitability) are compatible.

To take fairness into the consideration in data-driven models, we need to think over what we can do. First, we should double-check that our data are unbiased. Specifically, historical data are often biased due to different historical backgrounds. So when combining long-time history data, we need delicate effort to eliminate hidden bias. Moreover, we add “fairness” to the main objectives in data-driven models directly. Here, we have the problem of how to quantify fairness (also justice and morality). Hence, it is still challenging to make the fair model but it is not impossible.

Don’t Put Me in, Data

“More times than not, birds of a feather do fly together. … Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets.”

[Weapons of Math Destruction, Cathy O’neil]

To enhance computational efficiency, data-driven models often create subgroups and predict future behaviors of these subgroups (not every single person). Under the premise that people who have similar characteristics may make a similar decision for specific problems (like doppelganger search). This prediction for the subgroups leads to an efficient and simple predictive model.

There are still some important questions about this efficient model. Are we in the correct subgroups? If so, is it true that all the people in the same subgroup always make the same (or very similar) decision? Data scientists should remind these questions. And then they check that our prediction results can be divided by some (finite) subgroups and the number of subgroups is enough to make the right prediction. Someone may want to say like “Don’t put me in any subgroups. I am so independent!”

The Whole is Different the Sum of its Parts

“When a whole body of data displays one trend, yet when broken into subgroups, the opposite trend comes into view for each of those subgroups.”

[Weapons of Math Destruction, Cathy O’neil]

In statistics, Simpson’s paradox shows the difference in trends between the whole group and subgroups. This result shows that we often mislead the statistical result from the data-driven model, leading to the wrong causation for some phenomena in our world.

We pay particular attention to make a model to avoid this misconception. A general model often fails to predict the behavior of subgroups. And also, we CANNOT guarantee that the combination of specific models predicts the coarse-scale behavior of the whole system effectively. Hence, there should be a proper balance in the use of Big Data.

Am I the Same Person as I Was Yesterday?

“Mathematical models, by their nature, are based on the past, and on the assumption that patterns will repeat.”

[Weapons of Math Destruction, Cathy O’neil]

We always promise ourselves, ‘I will be not the same person as I was yesterday’. But, can we do that? It is really hard to change our daily routine and we repeat the same mistakes over and over. Hence, we are so predictable.

Data-driven models are finding patterns from past (or nearly present) to predict future behaviors under the belief that history always repeats itself. Hence, the data-driven model uses our repeated patterns to enhance its prediction accuracy. So, we need a break from routines and to be an outlier so that the model cannot predict our future.

How to Make my Model Unspoiled?

“Without feedback, however, a statistical engine can continue spinning out faulty and damaging analysis while never learning mistakes.”

[Weapons of Math Destruction, Cathy O’neil]

To deal with a spoiled child effectively, parents (or teachers) should consistently observe their child. When they become rude, parents should give them feedback immediately and teach responsibility, appreciation, and respect about others.

Data-driven models also require the right feedback so that they update them in the right direction. To this end, data-driven models use their mistakes to adjust the model. However, this feedback loop commonly includes only one-side mistakes. For example, the AI HR-screening model has only data about the bad performance of chosen applicants. It cannot have data about the successful career of unchosen applicants.