Small Data Like a Diamond

“In fact, it was just as I was learning that many Big Data operations use small data to fill in the holes that I showed up in Ocala, Florida, to meet Jeff Seder.”

[Everybody lies, Seth Stephens-Davidowitz]

In the Big Data era, many people think that larger data sets can recognize the hidden patterns accurately, leading to reliable (or perfect) prediction. However, there is still plenty of room for understanding complex patterns using Big Data only.

Hence, to fill these gaps, we need still small data, like small surveys and interviews. These data can add something of value to Big Data. This is like a diamond in the jewelry. It is very smaller than gold and silver but the combination of diamond, gold, and silver makes the jewelry beautiful and shine. That is, we, human being, still have an important role in Data-driven and AI world.

Meet a Doppelgänger in Big Data

“But what else can these searches reveal? For one thing, doppelgänger searches have been used by many of the biggest internet companies to dramatically improve their offerings and user experience.”

[Everybody lies, Seth Stephens-Davidowitz]

There is an urban legend about a doppelganger, a non-biologically related look-alike or double of a living person; If you meet a doppelganger, you will die. We know this is fictional but nobody wants to meet her/his doppelganger.

In data science, however, meeting a doppelganger is helpful to understand and predict. In Big Data, we have assumed that similar input data, like a doppelganger, has similar output. Hence, if we can find doppelgangers of the target, we can predict its output effectively (e.g. averaging output of doppelgangers like a k-nearest neighbor or kNN). For example, Amazon and Netflix figure out what you might like from your doppelgangers in their database.

Juxtaposition of Data

“This is the third power of Big Data: Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are.”

[Everybody lies, Seth Stephens-Davidowitz]

Artworks of Georges-Pierre Seurat or modern color printers use the color juxtaposition to effectively render the mixed color without loss of brightness. For example, the pattern intertwined red and blue dots produces a purple color. Depending on the size of color dots (or the resolution of printers), the result looks more vivid and realistic.

In data science, small data or observations (like a low-resolution printer) provide only “purple” data which lose the brightness. However, Big Data, like enormous color dots, enable us to see the small but important patterns of “red” and “blue”, leading to better and deeper understand our world.

The Blind Man and the Lame Man in Data Science

“But combining both the flawed government data with the imperfect night light data gives a better estimate than either source alone provide.”

[Everybody lies, Seth Stephens-Davidowitz]

In a well-known fable, “The blind man and the lame man”, they can cross the bridge together. ‘A blind man carried a lame man on his back, lending him his feet and borrowing from him his eyes’. This story tells us how the collaboration overcome the difficulties.

In the Big Data era, the same collaboration is highly required to analyze and extract useful information from Big Data. Generally, it takes a long time and is costly to obtain highly accurate data while less accurate data is cheap and fast to access. Then, we need to blend these two different data (a few highly accurate data and a lot of less accurate data) to better analyze data, leading to correct forecast and prediction.

Beyond Pride and Prejudice

“First, and perhaps most important, if you are going to try to use new data to revolutionize a field, it is best to go into a field where old methods are lousy.”

[Everybody lies, Seth Stephens-Davidowitz]

Our decisions are often based on prejudice, resulting in a bad ending. Specifically, our prejudice stems from wrong or weak cause-and-effect which totally depends on our limited experiences, pseudosciences, or popular misconceptions.

Big Data makes us escape from bias, provides the right cause-and-effect, and finally suggests the optimal choice. For example (in this book), a pedigree has been commonly regarded as a primary factor for choosing a racing horse but it is NOT. Big Data shows the size of the heart (specifically the left ventricle) is a much more important factor. But we should keep in mind that data science and Big Data is not always perfect. Biased and incomplete data also provides another data-driven prejudice and misconception.

David Hume as a Data Scientist

“Hume believed that we can’t be absolutely certain about anything that is based only on traditional beliefs, testimony, habitual relationships, or cause and effect. In short, we can rely only on what we learn from experience.”

[The Theory That Would Not Die, McGrayne, Sharon B.]

Empiricism put forth by David Hume claims that observation/investigation is the correct way to extend our cognitive capacities. However, individual experiences often are incomplete and biased due to limited experiences. Hence, practically, it is skeptical to gain certainty based only on personal experience, observations, and investigation.

Nowadays, in the Big Data era, we have enormous data collected from all the people in the world. Integrated data can provide unbiased and common observation about human nature. That is, it is time to recall Hume’s empiricism. In fact, the fundamental philosophy of machine learning is totally based on Hume’s empiricism and this would bring us closer to “Truth”. If Hume was still alive, he would be a Googler.