More Data, More Simple

“Single molecules are far too random. The balloon, along with the air it contains, follows a predictable pattern, but only when considered in aggregate.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

A dot-to-dot has scattered points with the number and we connect these dots in numerical order. Finally, we get the hidden figure. But, if there is no number on each dot, we may not draw the right figure. What if we have more and more dots? Now, we realize the silhouette of the figure and draw it easily (now, no number required).

More data also provides the silhouette of the underlying pattern of Big Data effectively. After spotting the hidden patterns, we don’t need the detailed data (like a number in dot-to-dot). It can be a clear and simple model in large-scale data domain (intertwined with many features). Don’t worry about adding more features and data in your model. More data, more simple.

Data Avengers Save the World

“By working together, teams of scientist, humanist, and engineers can create shared resources of extraordinary power. … Big humanities is waiting to happen.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

The Avengers make a team fight against the villain. Different heroes/heroines with different special abilities make a great synergy and finally save the world. Against Big Data (it is not the villain but), many researchers in different fields make a team find an important pattern from the data, called the (data-driven) interdisciplinary research.

This interdisciplinary research will give us a better (and unbiased) solution to the complex problems we are facing in the world (e.g. poverty, inequality, health, world peace …). Because these problems are too complex to understand in one direction only, we need various tools in statistics, machine learning, scientific theory, social science, psychology, and economics.

Does Big Data Steal Your Soul?

“As we just saw, having just a single picture of someone can give you a form of power over that person. Will big data steal your soul outright?”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

When photography is unveiled in the world, Some people believed that this scary box captures their spirit and then they should live in the rest of their life without a soul. It is quite silly in nowadays but we now think about our privacy (instead of a soul).

Today, big social media companies collect data from every single person. Even they try to anonymize the data by replacing user identifiers with random numbers, Big data is always exposed to the risk of a violation of privacy. Hence, data scientists should make a delicate effort to keep someone’s privacy when making a data-driven model.

Finding the Signal in Noises

“Big news travels fast – but big ideas don’t.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

Breaking news rapidly (and widely) spreads out the community via the internet and social media. So, you know what happens on the other side of the world immediately. However, the idea moves slowly along with news or even requires incubating time for all the people to realize its usefulness and effectiveness.

If we don’t have a creative idea, we should be a first follower. Hence, we need to discover the hidden idea quickly and take advantage of it. In Big Data, usually, the idea is hiding behind noises (insignificant data). Data scientists are trying to find the signal in noises via cutting-edge tools of machine learning.

Thinking Hidden Figures in Data Science

“The millions of voices reflected in books tell a long and fascinating story about our culture and our history. But not everyone’s voice is recorded on our bookshelves. And sometimes the silence of the missing voices can drown out everything else.

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

The movie “Hidden Figures” shows that our history has not included hidden figures or unsung heroes correctly. This intentional (or sometimes inadvertent) omission provides only partial knowledge about our world, leading to a misconception or bias about the world.

Data scientists should consider this sparsity structure of data, called “unknown unknowns” to understand the whole picture of data effectively. For example, we could not clearly understand spatiotemporal human behaviors in the city with data from smartphones because the person who doesn’t have a smartphone are excluded in this data.

Make a Model like a Police Sketch in Big Data

“Their measurements of the performance of airfoils in a wind tunnel were a simplification, an imperfect simulacrum of the actual performance of an actual wing on an actual plane in actual flight. But they reasoned, data is better than no data.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

A police sketch artist draws a brief sketch of the suspect based on one or more eyewitnesses’ memory of a face. Even though it is not complete information compared to a photograph or video, it is still helpful to find the suspect. Yes, data is better than no data.

In data science, we sometimes make a simple model to describe the behavior of data and predict its future behavior using this model. It is impossible to make a perfect prediction with a simple model but it still provides qualitatively reliable prediction, enabling us to sketchy knowledge about Big Data.

Plato’s Cave in Big Data

“Though shadowing is more art than science, it’s crucial to making progress when working on big data. … But if you choose exactly the right angle, it’s possible to obscure the legally and ethically sensitive parts of the original dataset while retaining much of its extraordinary power.”

[Uncharted, Erez Aiden & Jean-Baptiste Michael]

In Plato’s cave, the prisoners can see not the original things but the shadows on the wall. Hence, the prisoners realized these shadows as reality. Due to the location of the fire, the prisoner can see the shadows close to reality or the biased image.

In Big Data, sometimes, we cannot access the original (and raw) data due to ethical and legal issues. Hence, data scientists always consider where we put the light for “appropriate” shadow. These data shadows may provide a lot of discoveries about our world without the invasion of privacy.