The Whole is Different the Sum of its Parts

“When a whole body of data displays one trend, yet when broken into subgroups, the opposite trend comes into view for each of those subgroups.”

[Weapons of Math Destruction, Cathy O’neil]

In statistics, Simpson’s paradox shows the difference in trends between the whole group and subgroups. This result shows that we often mislead the statistical result from the data-driven model, leading to the wrong causation for some phenomena in our world.

We pay particular attention to make a model to avoid this misconception. A general model often fails to predict the behavior of subgroups. And also, we CANNOT guarantee that the combination of specific models predicts the coarse-scale behavior of the whole system effectively. Hence, there should be a proper balance in the use of Big Data.

Some Numbers to Represent You

“When you can create a model from proxies, it is far simpler for people to game it. This is because proxies are easier to manipulate than the complicated reality they represent.”

[Weapons of Math Destruction, Cathy O’neil]

This is NOT the story of “Numerology” (but somewhat related). Can you believe that there exist some numbers to represent your whole life? In data science, we called these numbers “proxies”. To make a model simpler, we often use some numbers instead of your characteristic, leading to an efficient prediction about you and your future.

The AI model for college admission may use only SAT scores (and the number of completed APs) for screening. The AI model for a personal loan may use only five numbers of your zip code. However, we should keep in mind that these proxies (numbers) cannot represent our life correctly.

Am I the Same Person as I Was Yesterday?

“Mathematical models, by their nature, are based on the past, and on the assumption that patterns will repeat.”

[Weapons of Math Destruction, Cathy O’neil]

We always promise ourselves, ‘I will be not the same person as I was yesterday’. But, can we do that? It is really hard to change our daily routine and we repeat the same mistakes over and over. Hence, we are so predictable.

Data-driven models are finding patterns from past (or nearly present) to predict future behaviors under the belief that history always repeats itself. Hence, the data-driven model uses our repeated patterns to enhance its prediction accuracy. So, we need a break from routines and to be an outlier so that the model cannot predict our future.

How to Make my Model Unspoiled?

“Without feedback, however, a statistical engine can continue spinning out faulty and damaging analysis while never learning mistakes.”

[Weapons of Math Destruction, Cathy O’neil]

To deal with a spoiled child effectively, parents (or teachers) should consistently observe their child. When they become rude, parents should give them feedback immediately and teach responsibility, appreciation, and respect about others.

Data-driven models also require the right feedback so that they update them in the right direction. To this end, data-driven models use their mistakes to adjust the model. However, this feedback loop commonly includes only one-side mistakes. For example, the AI HR-screening model has only data about the bad performance of chosen applicants. It cannot have data about the successful career of unchosen applicants.

What Ingredients Do We Need for Yummy Data Soup?

“To create a model, then, we make choices about what’s important enough to include, simplifying the world into a toy version that can be easily understood.”

[Weapons of Math Destruction, Cathy O’neil]

Imagine you make chicken soup for dinner. What ingredients do you need for delicious soup? Chicken absolutely and maybe celery and onion and more; it depends on your mother’s recipe. Organic ingredients will be much better for your health.

Successful data-driven models commonly require enough important data. However, we do not know which data is important so we just put all the data into the model. Fortunately, a data-driven model may know what data is salient among all these data (via feature selection or dimensionality reduction) and make the own recipe. Also, unbiased (like organic) data will be much better for the accurate model.

Make Your Crystal Ball Shine

“These mathematical models were opaque, their workings invisible to all but the highest priests in their domain: mathematicians and computer scientists. Their verdicts, even when wrong or harmful, were beyond dispute or appeal.”

[Weapons of Math Destruction, Cathy O’neil]

The fortune-tellers show our future with their crystal balls. We don’t know how this crystal ball works but we just believe (or not) its prediction. In fact, the fortune-tellers don’t know, too. The only thing they can do is to make their crystal balls shine.

Most of the predictive models based on machine learning are black-box models like crystal balls so we cannot know what happens inside. Someone worries about this opacity but this opacity may eliminate prejudice and bias. The one thing we can do is to feed them on unbiased and accurate data.

More Data, More Simple

“Single molecules are far too random. The balloon, along with the air it contains, follows a predictable pattern, but only when considered in aggregate.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

A dot-to-dot has scattered points with the number and we connect these dots in numerical order. Finally, we get the hidden figure. But, if there is no number on each dot, we may not draw the right figure. What if we have more and more dots? Now, we realize the silhouette of the figure and draw it easily (now, no number required).

More data also provides the silhouette of the underlying pattern of Big Data effectively. After spotting the hidden patterns, we don’t need the detailed data (like a number in dot-to-dot). It can be a clear and simple model in large-scale data domain (intertwined with many features). Don’t worry about adding more features and data in your model. More data, more simple.

Data Avengers Save the World

“By working together, teams of scientist, humanist, and engineers can create shared resources of extraordinary power. … Big humanities is waiting to happen.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

The Avengers make a team fight against the villain. Different heroes/heroines with different special abilities make a great synergy and finally save the world. Against Big Data (it is not the villain but), many researchers in different fields make a team find an important pattern from the data, called the (data-driven) interdisciplinary research.

This interdisciplinary research will give us a better (and unbiased) solution to the complex problems we are facing in the world (e.g. poverty, inequality, health, world peace …). Because these problems are too complex to understand in one direction only, we need various tools in statistics, machine learning, scientific theory, social science, psychology, and economics.

Does Big Data Steal Your Soul?

“As we just saw, having just a single picture of someone can give you a form of power over that person. Will big data steal your soul outright?”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

When photography is unveiled in the world, Some people believed that this scary box captures their spirit and then they should live in the rest of their life without a soul. It is quite silly in nowadays but we now think about our privacy (instead of a soul).

Today, big social media companies collect data from every single person. Even they try to anonymize the data by replacing user identifiers with random numbers, Big data is always exposed to the risk of a violation of privacy. Hence, data scientists should make a delicate effort to keep someone’s privacy when making a data-driven model.