Make Your Crystal Ball Shine

“These mathematical models were opaque, their workings invisible to all but the highest priests in their domain: mathematicians and computer scientists. Their verdicts, even when wrong or harmful, were beyond dispute or appeal.”

[Weapons of Math Destruction, Cathy O’neil]

The fortune-tellers show our future with their crystal balls. We don’t know how this crystal ball works but we just believe (or not) its prediction. In fact, the fortune-tellers don’t know, too. The only thing they can do is to make their crystal balls shine.

Most of the predictive models based on machine learning are black-box models like crystal balls so we cannot know what happens inside. Someone worries about this opacity but this opacity may eliminate prejudice and bias. The one thing we can do is to feed them on unbiased and accurate data.

More Data, More Simple

“Single molecules are far too random. The balloon, along with the air it contains, follows a predictable pattern, but only when considered in aggregate.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

A dot-to-dot has scattered points with the number and we connect these dots in numerical order. Finally, we get the hidden figure. But, if there is no number on each dot, we may not draw the right figure. What if we have more and more dots? Now, we realize the silhouette of the figure and draw it easily (now, no number required).

More data also provides the silhouette of the underlying pattern of Big Data effectively. After spotting the hidden patterns, we don’t need the detailed data (like a number in dot-to-dot). It can be a clear and simple model in large-scale data domain (intertwined with many features). Don’t worry about adding more features and data in your model. More data, more simple.

Data Avengers Save the World

“By working together, teams of scientist, humanist, and engineers can create shared resources of extraordinary power. … Big humanities is waiting to happen.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

The Avengers make a team fight against the villain. Different heroes/heroines with different special abilities make a great synergy and finally save the world. Against Big Data (it is not the villain but), many researchers in different fields make a team find an important pattern from the data, called the (data-driven) interdisciplinary research.

This interdisciplinary research will give us a better (and unbiased) solution to the complex problems we are facing in the world (e.g. poverty, inequality, health, world peace …). Because these problems are too complex to understand in one direction only, we need various tools in statistics, machine learning, scientific theory, social science, psychology, and economics.

Does Big Data Steal Your Soul?

“As we just saw, having just a single picture of someone can give you a form of power over that person. Will big data steal your soul outright?”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

When photography is unveiled in the world, Some people believed that this scary box captures their spirit and then they should live in the rest of their life without a soul. It is quite silly in nowadays but we now think about our privacy (instead of a soul).

Today, big social media companies collect data from every single person. Even they try to anonymize the data by replacing user identifiers with random numbers, Big data is always exposed to the risk of a violation of privacy. Hence, data scientists should make a delicate effort to keep someone’s privacy when making a data-driven model.

Finding the Signal in Noises

“Big news travels fast – but big ideas don’t.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

Breaking news rapidly (and widely) spreads out the community via the internet and social media. So, you know what happens on the other side of the world immediately. However, the idea moves slowly along with news or even requires incubating time for all the people to realize its usefulness and effectiveness.

If we don’t have a creative idea, we should be a first follower. Hence, we need to discover the hidden idea quickly and take advantage of it. In Big Data, usually, the idea is hiding behind noises (insignificant data). Data scientists are trying to find the signal in noises via cutting-edge tools of machine learning.

Thinking Hidden Figures in Data Science

“The millions of voices reflected in books tell a long and fascinating story about our culture and our history. But not everyone’s voice is recorded on our bookshelves. And sometimes the silence of the missing voices can drown out everything else.

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

The movie “Hidden Figures” shows that our history has not included hidden figures or unsung heroes correctly. This intentional (or sometimes inadvertent) omission provides only partial knowledge about our world, leading to a misconception or bias about the world.

Data scientists should consider this sparsity structure of data, called “unknown unknowns” to understand the whole picture of data effectively. For example, we could not clearly understand spatiotemporal human behaviors in the city with data from smartphones because the person who doesn’t have a smartphone are excluded in this data.

Make a Model like a Police Sketch in Big Data

“Their measurements of the performance of airfoils in a wind tunnel were a simplification, an imperfect simulacrum of the actual performance of an actual wing on an actual plane in actual flight. But they reasoned, data is better than no data.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

A police sketch artist draws a brief sketch of the suspect based on one or more eyewitnesses’ memory of a face. Even though it is not complete information compared to a photograph or video, it is still helpful to find the suspect. Yes, data is better than no data.

In data science, we sometimes make a simple model to describe the behavior of data and predict its future behavior using this model. It is impossible to make a perfect prediction with a simple model but it still provides qualitatively reliable prediction, enabling us to sketchy knowledge about Big Data.