weapons of math destruction – Everybody makes DATA

[Wrap up] Book Review: Weapons of Math Destruction

September 20, 2019October 11, 2019 APMALeave a comment

NOW, we live in the age of Big Data and nobody can stop the invasion of artificial intelligence (AI) into our world. Many books about data science have shown the effectiveness, accuracy, and efficiency of the data-driven models in various fields such as economics, social science, and engineering. Do you agree with that data-driven models via mathematics/computer science/machine learning make a prosperous future?

THE author, Cathy O’Neil, based on her academical and industrial experiences, showed us the dark side of mathematical (or data-driven) models and called this ‘Weapons of Math Destruction’. The author said that WMDs have peculiar characteristics: (1) Opacity; (2) Scale; and (3) Damage. Through several examples that these characteristics of WMD have a negative effect on, the author called our attention to ‘fairness’ in Big Data and AI.

The following links are some quotations from the book with my thoughts.

(1) Make Your Crystal Ball Shine

(2) What Ingredients Do We Need for Yummy Data Soup?

(3) How to Make my Model Unspoiled?

(4) Am I the Same Person as I Was Yesterday?

(5) Some Numbers to Represent You

(6) The Whole is Different the Sum of its Parts

(7) Don’t put me in, Data

(8) Justice: What’s the Right Thing to Do in Data Science?

(9) As Human Beings, We are Flawed but We Learn

As Human Beings, We are Flawed but We Learn

September 18, 2019 APMA1 Comment

“But human decision making, while often flawed, has one chief virtue. It can evolve. As human beings learn and adapt, we change, and so do our processes.”

[Weapons of Math Destruction, Cathy O’neil]

We are flawed and vulnerable. We sometimes are blinded by prejudice. We are often apt to be emotional and fails to make the right decision. Yes, we are human beings. However, we have learned from our mistakes. We accepted the Copernican system. We changed our mind after Martin Luther King’s “I Have A Dream” speech. When we realized that there is something wrong, we can change all at once.

Automated systems, by contrast, CANNOT change their model immediately. The only thing they can do is an improvement of the model to add more parameters and correlations (like the eccentric and the epicycle in the Ptolemaic system). This makes the model more complex and complicated (not the right direction!). This shows the main role of human beings in the age of Big Data. Only we, human beings, stop and change the data-driven model immediately when it goes wrong.

Justice: What’s the Right Thing to Do in Data Science?

September 16, 2019September 17, 2019 APMA1 Comment

“The model is optimized for efficiency and profitability, not for justice or the good of the “team”. This is, of course, the nature of capitalism.”

[Weapons of Math Destruction, Cathy O’neil]

Michel J Sandel’s magnum opus, Justice: What’s the Right Thing to Do?, called our attention to justice (and fairness) in a period of prosperity of capitalism. Data science acts in a similar fashion of capitalism. More data (money) is more powerful and the efficiency (profitability) is the most important factor for its success. Hence, in Data Science, we should consider that fairness and efficiency (and profitability) are compatible.

To take fairness into the consideration in data-driven models, we need to think over what we can do. First, we should double-check that our data are unbiased. Specifically, historical data are often biased due to different historical backgrounds. So when combining long-time history data, we need delicate effort to eliminate hidden bias. Moreover, we add “fairness” to the main objectives in data-driven models directly. Here, we have the problem of how to quantify fairness (also justice and morality). Hence, it is still challenging to make the fair model but it is not impossible.

Don’t Put Me in, Data

September 13, 2019September 23, 2019 APMA1 Comment

“More times than not, birds of a feather do fly together. … Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets.”

[Weapons of Math Destruction, Cathy O’neil]

To enhance computational efficiency, data-driven models often create subgroups and predict future behaviors of these subgroups (not every single person). Under the premise that people who have similar characteristics may make a similar decision for specific problems (like doppelganger search). This prediction for the subgroups leads to an efficient and simple predictive model.

There are still some important questions about this efficient model. Are we in the correct subgroups? If so, is it true that all the people in the same subgroup always make the same (or very similar) decision? Data scientists should remind these questions. And then they check that our prediction results can be divided by some (finite) subgroups and the number of subgroups is enough to make the right prediction. Someone may want to say like “Don’t put me in any subgroups. I am so independent!”

The Whole is Different the Sum of its Parts

September 11, 2019September 17, 2019 APMA1 Comment

“When a whole body of data displays one trend, yet when broken into subgroups, the opposite trend comes into view for each of those subgroups.”

[Weapons of Math Destruction, Cathy O’neil]

In statistics, Simpson’s paradox shows the difference in trends between the whole group and subgroups. This result shows that we often mislead the statistical result from the data-driven model, leading to the wrong causation for some phenomena in our world.

We pay particular attention to make a model to avoid this misconception. A general model often fails to predict the behavior of subgroups. And also, we CANNOT guarantee that the combination of specific models predicts the coarse-scale behavior of the whole system effectively. Hence, there should be a proper balance in the use of Big Data.

Some Numbers to Represent You

September 9, 2019September 17, 2019 APMA1 Comment

“When you can create a model from proxies, it is far simpler for people to game it. This is because proxies are easier to manipulate than the complicated reality they represent.”

[Weapons of Math Destruction, Cathy O’neil]

This is NOT the story of “Numerology” (but somewhat related). Can you believe that there exist some numbers to represent your whole life? In data science, we called these numbers “proxies”. To make a model simpler, we often use some numbers instead of your characteristic, leading to an efficient prediction about you and your future.

The AI model for college admission may use only SAT scores (and the number of completed APs) for screening. The AI model for a personal loan may use only five numbers of your zip code. However, we should keep in mind that these proxies (numbers) cannot represent our life correctly.

Am I the Same Person as I Was Yesterday?

September 6, 2019October 2, 2019 APMA1 Comment

“Mathematical models, by their nature, are based on the past, and on the assumption that patterns will repeat.”

[Weapons of Math Destruction, Cathy O’neil]

We always promise ourselves, ‘I will be not the same person as I was yesterday’. But, can we do that? It is really hard to change our daily routine and we repeat the same mistakes over and over. Hence, we are so predictable.

Data-driven models are finding patterns from past (or nearly present) to predict future behaviors under the belief that history always repeats itself. Hence, the data-driven model uses our repeated patterns to enhance its prediction accuracy. So, we need a break from routines and to be an outlier so that the model cannot predict our future.

How to Make my Model Unspoiled?

September 4, 2019September 17, 2019 APMA1 Comment

“Without feedback, however, a statistical engine can continue spinning out faulty and damaging analysis while never learning mistakes.”

[Weapons of Math Destruction, Cathy O’neil]

To deal with a spoiled child effectively, parents (or teachers) should consistently observe their child. When they become rude, parents should give them feedback immediately and teach responsibility, appreciation, and respect about others.

Data-driven models also require the right feedback so that they update them in the right direction. To this end, data-driven models use their mistakes to adjust the model. However, this feedback loop commonly includes only one-side mistakes. For example, the AI HR-screening model has only data about the bad performance of chosen applicants. It cannot have data about the successful career of unchosen applicants.

What Ingredients Do We Need for Yummy Data Soup?

September 2, 2019September 22, 2019 APMA1 Comment

“To create a model, then, we make choices about what’s important enough to include, simplifying the world into a toy version that can be easily understood.”

[Weapons of Math Destruction, Cathy O’neil]

Imagine you make chicken soup for dinner. What ingredients do you need for delicious soup? Chicken absolutely and maybe celery and onion and more; it depends on your mother’s recipe. Organic ingredients will be much better for your health.

Successful data-driven models commonly require enough important data. However, we do not know which data is important so we just put all the data into the model. Fortunately, a data-driven model may know what data is salient among all these data (via feature selection or dimensionality reduction) and make the own recipe. Also, unbiased (like organic) data will be much better for the accurate model.

Make Your Crystal Ball Shine

August 30, 2019September 17, 2019 APMA1 Comment

“These mathematical models were opaque, their workings invisible to all but the highest priests in their domain: mathematicians and computer scientists. Their verdicts, even when wrong or harmful, were beyond dispute or appeal.”

[Weapons of Math Destruction, Cathy O’neil]

The fortune-tellers show our future with their crystal balls. We don’t know how this crystal ball works but we just believe (or not) its prediction. In fact, the fortune-tellers don’t know, too. The only thing they can do is to make their crystal balls shine.

Most of the predictive models based on machine learning are black-box models like crystal balls so we cannot know what happens inside. Someone worries about this opacity but this opacity may eliminate prejudice and bias. The one thing we can do is to feed them on unbiased and accurate data.