November 2019 – Everybody makes DATA

Happy Thanksgiving, Happy Thanksreading

November 29, 2019November 29, 2019 APMALeave a comment

Happy Thanksgiving, Happy Thanksreading.

[Everybody makes DATA]

This year, I started to read many books about mathematics, statistics, and data science intensively and also started to write reviews on my blog. To read and write became a big turning point in my life. I hope I keep writing good reviews on this blog.
I would like to thank you for visiting my blog this year!

Trade-off: The Quality or The Quantity of Data for Better Statistics

November 27, 2019 APMALeave a comment

“When we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims.”

[The Art of Statistics, David Spiegelhalter]

In the age of Big Data, we can collect tremendous data from many sources that have different qualities (e.g. accuracy, resolution, or fidelity). Using all the data we can easily draw statistics about what we measured. These statistical results help us to understand what is going on by comparing the previous statistical results. However, if we want a deep understanding of hidden patterns for accurate future prediction (statistical inference), the quality of data becomes the main factor for accurate prediction; higher quality, higher accuracy. Collecting data, however, has a general trade-off between the quality and the quantity. High accurate data require expensive data acquisition costs (e.g. expansive measurements, fine-scale simulation using more computer resources) while less accurate data are relatively cheap to obtain.

As the book mentioned, the data-driven predictive model totally depends on the quality (and the quantity) of data. First, we check the accuracy (or fidelity) of data and use only the high-fidelity data to make a data-driven model for decision or prediction. Due to the aforementioned trade-off, however, we generally have a few high-fidelity data and/or many low-fidelity data. Then how to make a data-driven model? since a few high-fidelity data provide only partial information, it is hard to make an accurate model globally. the use of many low-fidelity data enables us to make a global model but it has a systematic inherent bias, leading to a wrong prediction. Hence, in data science, many researchers have focused on multi-fidelity data fusion, which enables us to make an accurate global model using both high and low fidelity data; chasing both the quality and the quantity.

Are There a Few Magic Numbers for Describing Complex Systems?

November 25, 2019 APMALeave a comment

“Large collection of numerical data are routinely summarized and communicated using a few statistics of location and spread, (…), these can take us a long way in grasping an overall pattern.”

[The Art of Statistics, David Spiegelhalter]

Can we understand all (fine-scale) patterns from a massive data set? If you were a genius, you may keep track of all the patterns. But, it is (almost) impossible to analyze all. That’s is why we employ statistics to understand and analyze a large data set and predict/estimate the future from statistical results (e.g. population, economic growth, the unemployment rate, or stock price). For example, to make a business model for kids, it is much easier to see the average birthrate in some regions rather than count the number of children in my neighborhood. Statistical approaches always provide just a few numbers to describe the complex systems. This simplification enables us to make a simple (predictive) model, leading to an efficient and optimized analytics.

I agree that a few numbers make the complex system simple and I have experienced that this simple representation gives us the proper direction to make a better decision. Then, what is the good “number (statistic)” for massive data in our hands? The average? well, but the book also said: “there is no substitute for simply looking at data properly.” Hence, we should be careful to understand the complex system using only a few statistics. Some statistics are venerable to outliers such as average. Also, we can draw the dinosaur patterns using the given mean and variance (please see my previous post). Nowadays, data-driven approaches via statistical learning (machine learning) may provide optimal numbers to describe the complex system effectively. Yet, we need to scrutinize all the statistics the data-driven models provide. However, I do expect that a data-driven AI model may find a good reparameterization of the massive data set for a better understanding in the near future.

Framing: Statistics Can Manipulate Our Thought

November 22, 2019 APMA2 Comments

“The examples in this chapter have demonstrated how the apparently simple task of calculating and communicating proportions can become a complex matter.”

[The Art of Statistics, David Spiegelhalter]

Thanks to you, the number of followers increases by 22% in November! When you see this sentence, you may think that this emerging blog is growing rapidly and there are some reasons for this success. If I wrote “4 people start to follow my blog in November”, you might have a different feeling. But both are true: my blog has 18 followers in October and now 22. Different representations in statistics can change the impact of observations, we called this positive (or negative) framing. There are many examples of positive or negative framing. For example, pharmaceutical companies want to say that a new medicine has a 95% survival rate rather than a 5% mortality rate (positive framing). Investigative journalists want to say that 3,000,000 people are suspected of tax evasion every year rather than 1% of people (negative framing). This framing also appears in the graph. Assume that we need to draw a bar chart with two bars whose values are 95 and 98, respectively. If we draw a bar chart from 0 to 100, the two bars look similar. However, if we draw a bar chart from 90 to 100, we see totally different bars on the graph.

How can we escape from this framing? Information providers should provide alternative data representations (different graphs, law data, tables) so that we can get a balanced view of the data by examining raw data. Also, we always should be skeptical when we see data. First, we should check who (and why) published statistical data; data do not lie, only presenters may lie. However, this argument does not refer to that statistics are totally crafty tricks. Statistics is still powerful to understand, analyze, and visualize data effectively. Moreover, in the age of Big Data, statistical knowledge is fast becoming the main tool to deal with big data correctly. That is, statistics are a double-edged sword; the power of statistics depends on us.

[Wrap up] Book Review: Hello World: Being Human in the Age of Algorithms

November 20, 2019November 20, 2019 APMALeave a comment

There’s an old African proverb that says “If you want to go quickly, go alone. If you want to go far, go together.” Big Data and artificial intelligence based on algorithms herald a new era of our people. Then what is the main role of humans in the age of algorithms? How can we go far with algorithms together? The author, Hannah Fry, considers the pros and cons of the age of algorithms through various examples. Also, the author provides a balanced view of the AI Utopia and Dystopia, enabling the readers to think about the real future.

This book is a modern version of the fable: “The lame man and the blind man.” The author emphasized that both algorithms and humans are flawed like the lame man and the blind man. Hence, we go together with algorithms for the better world. we try to use the power of algorithms properly. Then, what can we do? The author said in the last paragraph: “By questioning their decisions; scrutinizing their motives; acknowledging our emotions; demanding to know who stands to benefit; holding them accountable for their mistakes; and refusing to become complacent.”

The following links are some quotations from the book with my thoughts.

(1) Who Does Make It a Rule? Human? or Machine?

(2) Who Is our Future AI and What Is our Role?

(3) Digging Data in the New Wild West

(4) Can Artificial Intelligence Be the New Judge in the Future?

(5) As Algorithms Becoming Intelligent, We May Become Unintelligent

(6) Finding the Cause from the Effect in the Age of Big Data

(7) Can Algorithms Make a Way for Creativity?

Can Algorithms Make a Way for Creativity?

November 18, 2019 APMALeave a comment

“Similarity works perfectly well for recommendation engines. But when you ask algorithms to create art without a pure measure for quality, that’s where things start to get interesting. Can an algorithm be creative if its only sense of art is what happened in the past?”

[Hello World: Being Human in the Age of Algorithm, Hannah Fry]

Experiments in Musical Intelligence (EMI) by David Cope produced similar music (but not the same) from their music database by algorithms. This was a new algorithmic way to compose songs keeping the typical composer’s style. In October 1997, such algorithms compose the new song similarly to Johann Sebastian Bach. Audiences did not distinguish this music from genuine Bach music; the algorithm can compose a new (qualitatively) masterpiece of Bach without any composing skills and inborn musical talents. Nowadays, there are more sophisticated AI models to generate (I would not like to say “compose” here) songs we will like using the past famous and popular songs. Can we say these algorithms are creative?

Due to vague definitions and various preferences, it is really hard to measure the popularity (and the beauty) correctly. So, the AI models mimic the success of previous masterpieces and generate similar (this is a vague word but I would say “without infringing copyrights”) music, paintings, drawings, and even novels. Many people think that this is just mimicking of previous artworks but Pablo Picasso said: “Good artists borrow; great artists steal”. Nobody makes creative things out of nothing. All the artists are inspired by the previous masterpieces and make their artworks based on this inspiration as the AI models did. However, Marcel Duchamp, in 1917, introduced his magnum opus “Fountain”, which is the first readymade sculpture via a porcelain urinal. Can algorithms also make these kinds of creative art? If this artwork reflected the philosophy about the art of the time, can the algorithm find the current philosophy about the art from Big Data and make the creative artwork?

Finding the Cause from the Effect in the Age of Big Data

November 14, 2019 APMALeave a comment

“Just as it would be difficult to predict where the very next drop of water is going to fall, (…). But once the water has been spraying for a while and many drops have fallen, it’s relatively easy to observe from the pattern of the drops where the lawn sprinkler is likely to be situated.”

[Hello World: Being Human in the Age of Algorithm, Hannah Fry]

In science, an inverse problem is one of the research fields to extracts the hidden law (or the mathematical formula) from observation (data). That is, the inverse problem is to find the “cause” from the “effect”. It is a similar concept of profiling a serial killer in criminology. Through all the data of victims, we anticipate the character of the serial killer. We agree that more victims make an accurate prediction of the serial killer BUT we don’t want more victims. So, the important part of the inverse problem is to find the appropriate formulation from a small data set. However, as you see the quote, it is really hard to estimate something accurately with small data. This issue has been a bottleneck of the development of an inverse problem.

In the age of Big Data, on the other hand, we collect massive data set from individuals, autonomous systems, efficient measurements, or online websites, leading to accurate prediction of the cause. So many people thought that it is easy to solve the inverse problem using massive data; that is somewhat true and many research achievements about data-driven modeling that finds the underlying laws or governing equations (or a black box model) to describe the cause and effect directly from data. However, the inverse problem is now struggling with another issue – finding “right” causality. In big data, improbable things happen all the time. This may lead to the wrong causality of input/output data. For example, there is a possibility that the correlation between two variables stems from just coincidence but the algorithm cannot distinguish this coincidence and the real causality. Hence, the human check the data-driven causality based on rigorous way. That is why the fundamental mathematics/statistics are becoming important in the age of Big Data.

As Algorithms Becoming Intelligent, We May Become Unintelligent

November 11, 2019November 15, 2019 APMALeave a comment

“There’s a hidden danger in building an automated system that can safely handle virtually every issue its designers can anticipate. (…) So they’ll have very little experience to draw on to meet the challenge of an unanticipated emergency.”

[Hello World: Being Human in the Age of Algorithm, Hannah Fry]

Using Google Maps, I was driving to Quebec in Canada from my home (in the U.S.) for late summer vacation with my family. Just after passing the border, I realized that my phone did not work and of course Google Maps lost their power, too. I made a desperate attempt to drive with only road signs as my dad did. Our world is fast becoming intelligent via recent developments in smart devices, algorithms, automated systems, and AI. We don’t need to remember our friends’ phone numbers and physical addresses anymore. Moreover, we don’t need to memorize the exact spelling of the longer word; Google search can show the correct results from the misspelling. Can we say that we (not the world) are becoming intelligent?

Large autonomous systems will be widespread inevitably. For example, autonomous cars will be popular in the near future. So, the next generation may not know (or experience) how to correct a slide on an icy road. This lack of experience may lead to a nasty accident when the autonomous system is not working. Technologies do more, we do less (e.g. thinking or experience). However, there are two sides to every story. Since the invention of the calculator (or the computer), we have developed new research fields such as numerical analysis, scientific computing, or computational biology, resulting in the enormous expansion of knowledge. I hope that the advent of the large autonomous system provides not only the answer to problems we are facing now but also the vision for the better future.

Can Artificial Intelligence Be the New Judge in the Future?

November 8, 2019 APMALeave a comment

“The algorithm will always give exactly the same answer when presented with the same set of circumstances. (…). There is another key advantage: the algorithm also makes much better predictions.”

[Hello World: Being Human in the Age of Algorithm, Hannah Fry]

When we consider the dark side of algorithms and AI models, we always think about justice; The AI model is usually optimized for efficiency and profitability, not for justice (also see my previous post: Justice: What’s the Right Thing to Do in Data Science?). What is justice, by the way? Justice is the quality of being just or fair. Then what is “just” or “fair”? Defining the word “justice” or “just” is still an arguable issue in our society. In a slightly different context, we may say about “fair” instead. Fairness, in a narrow sense, requires consistency; The same input should produce the same output. For example, if you and I write down the same answer on the test, we should get the same score. That is the starting point to discuss fairness.

The (AI) algorithms which encapsulate detailed mathematical formulas have such fairness inherently, leading to a consistent consequence (the same input, same output). This is a big advantage of the algorithm for finding someone’s guilty consistently. Furthermore, the prediction is much accurate than human’s prediction. However, the consistency may also occur consistent error until the algorithm is adjusted. Humans, on the other hand, have their own models in their minds to judge someone’s guilty. However, this is not based on mathematics (or rigorous reasoning), leading to inconsistent outcomes for the same circumstances. Also, it is really hard to correct their bias and prejudice while the algorithm is easy to adjust their parameters for correction. Then who is more righteous in terms of fairness?

Digging Data in the New Wild West

November 6, 2019 APMALeave a comment

“We do well to remember that there’s no such thing as a free lunch. (…). Data and algorithms don’t just have the power to predict our shopping habits. They also have the power to rob someone of their freedom”

[Hello World: Being Human in the Age of Algorithm, Hannah Fry]

There are many FREE apps for tracking your routine such as walking, jogging, eating, book reading, shopping, or studying. Thanks to these productive apps, we can check our daily routine and change our routine for better performance. By the way, how do these free apps make money? There is no free lunch in the world. They make their profit from the data you recorded. In the age of Big Data, data is the new gold and many companies are digging such gold in our daily routines now. We might say we live in the new Wild West.

Someone might think that Data is just Data. That is true but the AI model can spot important (hidden) patterns from massive data effectively. They can dig gold in the mine by efficient tools. Moreover, they make precise categories for people’s behaviors, leading to an accurate prediction (classification) for new customers. Hence, AI models are becoming more sophisticated as increasing the number of data they collect. Amazon and other online retailers provide irresistible deals and coupons every day. Netflix and other streaming services recommend the best movies we will like so we cannot help clicking the next movie. In these days, we cannot blame a shopaholic. because (internet) shopping addiction is not caused by a lack of self-control but caused by a sophisticated AI model. That is why I purchase more books on Amazon today (Don’t blame me!).