I’ve worked with medium, large, and occasionally big data. What are big data? One line to demarcate it would be data sets so large in size that they cannot be processed in reasonable time by any computing hardware. Datasets like that force the analyst to take non-traditional approaches to process them.
But most of the hype about big data isn’t about such analytic distinctions. The excitement over big data is that we now have the ability to gather truly large datasets on phenomenon that previously were measurable only in small datasets. So, it used to be that you had to survey people to get their opinions about X, Y, or Z, but now we can gather vast amounts of data on people’s opinions from monitoring the twitterverse or blogosphere.
Gathering large data where previously we only had small data was what excited people about Google’s flu prediction system that was based on the search terms people were using. A recent paper, however, shows convincingly that Google flu predictions seriously overestimated the actual incidence of flu as tracked by the records related to physician office visits for influenza like illness, which is the exact measure Google was trying to predict from search terms.
This commentary by Lazer et al. in the journal Science includes many important reminders of how using big data for statistics doesn’t much change the rules of good statistical practice and inference. I often tell my colleagues the same thing. I consider myself one of the relatively few people in a personal position to judge this, since I have conducted studies on data sets as small as six capuchin monkeys and as large as millions of lines of healthcare claims.
Sometimes big data are a real advantage. In particular;
1) Large datasets make testing model fit easier,
2) You have more data to burn through corrections for multiple testing if you are searching for a model with little theory to guide you,
3) Big data are useful when you are predicting something very, very rare, like most cancers or terrorism.
Any particular usefulness to big data pretty much stops there though. Big data don’t much help you to discover important and general features of social behavior, for example. Folks, if you can’t find an effect of one variable on another in several hundred or a thousand data points – then probably it isn’t an important or general effect.
There is something more going on with the Google flu flub, however, than just missteps in statistical best practice. What the Googlers, Twitterers, and other such conspecifics are used to measuring--and they are great at it--are systems where the outcome of interest is more or less the thing being measured itself.
If you want to know what is trending about your company on Twitter, then that is by definition what people are tweeting about. If you want to direct people to the most popular pages from a given search term (what Google excels at) then you by definition want to measure where people click after using the same search term.
But doing those things are doing math and not doing statistics. Statistics is an offshoot of mathematics that is specific to the task of making predictions about things you haven’t measured from observations of other things you did measure. This axiom applies even to basic statistical inference problems such as finding the average height of a population, which doesn’t proceed by measuring everyone in the population. If you just measure everyone then you have counted, you have used some math by calculating the average, but you didn’t do any statistics. Statistically inferring the mean height of a population would proceed by measuring only a smaller sample of that population, and from that deriving an inference of what the average of the whole population likely is, and some measure of your confidence in that prediction. If you have measured everyone, then you have an observation of the population average, not a prediction of it, and you don’t need a measure of confidence.
That’s why estimating the incidence of flu is from search terms is not like Google’s bread and butter work. The incidence of flu is an out-of-sample prediction from search terms because the flu’s incidence is not itself determined just from what people think and therefore search about it. Surely flu incidence is partly a social construct. For example, a group struck by apprehension over the flu will avoid social contact and thereby slow the flu’s spread. But the flu also has other causes outside our thinking and searching the internet about it. Ambient temperature and humidity affect flu transmission, and mutation of the genetic material of the flu virus is a function of properties mostly not constructed by our own sociality.
My contention is the day-to-day work of Google and many tech companies that use big data is not out-of-sample prediction in this way. Instead, they are able to directly measure the thing of interest to their advertisers because the latter are intrinsically interested in the behavior of people within the self-creating system that is the internet.
I think when Google set out to predict the incidence of flu from search terms, they may not have realized they were stepping outside the realm of measurement of a self-creating system (like internet searches) and stepping into the realm of predicting unobserved phenomenon from measurements of another phenomenon.
This realm is that of statistics, and it is well trod by practicing scientists from many fields. Yes, these travelers of the statistical realm usually have used small data. Some of them travel accompanied by a cartload of models and information-theoretic Bayesian livestock. Others are more modest practitioners but highly adept with a particular tried and true beast of burden, such as linear regression or K-means clustering.
Regardless, Google and others may do well to consult some of these conventional statistical vagabonds the next time they venture into analytics that are truly about predicting things not measured. Knowing the paths through a landscape can be even more important if you are carrying big data with you, which can make the effects of navigational errors all the larger.
But most of the hype about big data isn’t about such analytic distinctions. The excitement over big data is that we now have the ability to gather truly large datasets on phenomenon that previously were measurable only in small datasets. So, it used to be that you had to survey people to get their opinions about X, Y, or Z, but now we can gather vast amounts of data on people’s opinions from monitoring the twitterverse or blogosphere.
Gathering large data where previously we only had small data was what excited people about Google’s flu prediction system that was based on the search terms people were using. A recent paper, however, shows convincingly that Google flu predictions seriously overestimated the actual incidence of flu as tracked by the records related to physician office visits for influenza like illness, which is the exact measure Google was trying to predict from search terms.
This commentary by Lazer et al. in the journal Science includes many important reminders of how using big data for statistics doesn’t much change the rules of good statistical practice and inference. I often tell my colleagues the same thing. I consider myself one of the relatively few people in a personal position to judge this, since I have conducted studies on data sets as small as six capuchin monkeys and as large as millions of lines of healthcare claims.
Sometimes big data are a real advantage. In particular;
1) Large datasets make testing model fit easier,
2) You have more data to burn through corrections for multiple testing if you are searching for a model with little theory to guide you,
3) Big data are useful when you are predicting something very, very rare, like most cancers or terrorism.
Any particular usefulness to big data pretty much stops there though. Big data don’t much help you to discover important and general features of social behavior, for example. Folks, if you can’t find an effect of one variable on another in several hundred or a thousand data points – then probably it isn’t an important or general effect.
There is something more going on with the Google flu flub, however, than just missteps in statistical best practice. What the Googlers, Twitterers, and other such conspecifics are used to measuring--and they are great at it--are systems where the outcome of interest is more or less the thing being measured itself.
If you want to know what is trending about your company on Twitter, then that is by definition what people are tweeting about. If you want to direct people to the most popular pages from a given search term (what Google excels at) then you by definition want to measure where people click after using the same search term.
But doing those things are doing math and not doing statistics. Statistics is an offshoot of mathematics that is specific to the task of making predictions about things you haven’t measured from observations of other things you did measure. This axiom applies even to basic statistical inference problems such as finding the average height of a population, which doesn’t proceed by measuring everyone in the population. If you just measure everyone then you have counted, you have used some math by calculating the average, but you didn’t do any statistics. Statistically inferring the mean height of a population would proceed by measuring only a smaller sample of that population, and from that deriving an inference of what the average of the whole population likely is, and some measure of your confidence in that prediction. If you have measured everyone, then you have an observation of the population average, not a prediction of it, and you don’t need a measure of confidence.
That’s why estimating the incidence of flu is from search terms is not like Google’s bread and butter work. The incidence of flu is an out-of-sample prediction from search terms because the flu’s incidence is not itself determined just from what people think and therefore search about it. Surely flu incidence is partly a social construct. For example, a group struck by apprehension over the flu will avoid social contact and thereby slow the flu’s spread. But the flu also has other causes outside our thinking and searching the internet about it. Ambient temperature and humidity affect flu transmission, and mutation of the genetic material of the flu virus is a function of properties mostly not constructed by our own sociality.
My contention is the day-to-day work of Google and many tech companies that use big data is not out-of-sample prediction in this way. Instead, they are able to directly measure the thing of interest to their advertisers because the latter are intrinsically interested in the behavior of people within the self-creating system that is the internet.
I think when Google set out to predict the incidence of flu from search terms, they may not have realized they were stepping outside the realm of measurement of a self-creating system (like internet searches) and stepping into the realm of predicting unobserved phenomenon from measurements of another phenomenon.
This realm is that of statistics, and it is well trod by practicing scientists from many fields. Yes, these travelers of the statistical realm usually have used small data. Some of them travel accompanied by a cartload of models and information-theoretic Bayesian livestock. Others are more modest practitioners but highly adept with a particular tried and true beast of burden, such as linear regression or K-means clustering.
Regardless, Google and others may do well to consult some of these conventional statistical vagabonds the next time they venture into analytics that are truly about predicting things not measured. Knowing the paths through a landscape can be even more important if you are carrying big data with you, which can make the effects of navigational errors all the larger.