By: Sastry G. Pantula, Dean, College of Science
Articles about Big Data are as ubiquitous as Big Data themselves are. However, as 2015 gets underway, I encourage you to consider reading the following two recent articles.
First, in the December issue of Significance, “Big data: A big mistake?” And second, “The Enormous implications of Facebook indexing 1 trillion of our posts,” an article that appeared December 28 on Techcrunch.com.
Harford makes an excellent point about the importance of using fundamental statistical principles in drawing conclusions and making proper inferences. Large “found data” is not a solution to everything, and worse, it could only lead to biased data, spurious correlations and false discoveries.
To quote Harford:
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends:
- That data analysis produces uncannily accurate results;
- That every single data point can be captured, making old statistical sampling techniques obsolete;
- That it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves.”
Unfortunately, these four articles of faith are at best optimistic oversimplifications.
Also, as with Census, there is a fascination with “N=All”, or a misconception that we are observing everything on everyone, without realizing what we are missing or that we are under counting certain groups. This article gives not only excellent examples to illustrate some of the traps and pitfalls of blind use of Big Data, but also illustrates what statisticians have spent the past couple of centuries figuring out in proper and efficient use of appropriate data. This is exactly the reason, as the College of Science develops a new master’s program in Data Analytics, we are basing it on solid statistical principles and then bring in collaborations with statistical, mathematical and computational sciences.
I mention the second article only to indicate how some of our data are becoming permanent records of our lives. There are certainly many privacy, ethical and moral issues related to how our own data could be used against us. Bits of biased information—as with the misconception of “N=All” and without proper context—can be extremely harmful and have a significant negative impact on people’s lives. These biased bits should be labeled: “Handle with care!”
Fast forwarding from 53 years ago January, today the question is “Ask not what the Big Data can do for you, but what good you can do for society with Big Data.”
With New Year’s Eve a fading memory, don’t be intoxicated by the charm of Big Data. Instead, leave the keys in the hands of a good statistician to derive proper conclusions.