“Large collection of numerical data are routinely summarized and communicated using a few statistics of location and spread, (…), these can take us a long way in grasping an overall pattern.”
[The Art of Statistics, David Spiegelhalter]
Can we understand all (fine-scale) patterns from a massive data set? If you were a genius, you may keep track of all the patterns. But, it is (almost) impossible to analyze all. That’s is why we employ statistics to understand and analyze a large data set and predict/estimate the future from statistical results (e.g. population, economic growth, the unemployment rate, or stock price). For example, to make a business model for kids, it is much easier to see the average birthrate in some regions rather than count the number of children in my neighborhood. Statistical approaches always provide just a few numbers to describe the complex systems. This simplification enables us to make a simple (predictive) model, leading to an efficient and optimized analytics.
I agree that a few numbers make the complex system simple and I have experienced that this simple representation gives us the proper direction to make a better decision. Then, what is the good “number (statistic)” for massive data in our hands? The average? well, but the book also said: “there is no substitute for simply looking at data properly.” Hence, we should be careful to understand the complex system using only a few statistics. Some statistics are venerable to outliers such as average. Also, we can draw the dinosaur patterns using the given mean and variance (please see my previous post). Nowadays, data-driven approaches via statistical learning (machine learning) may provide optimal numbers to describe the complex system effectively. Yet, we need to scrutinize all the statistics the data-driven models provide. However, I do expect that a data-driven AI model may find a good reparameterization of the massive data set for a better understanding in the near future.