Category:

Big data enthusiasts discover statistics is about what you didn’t measure

6/4/2014

I’ve worked with medium, large, and occasionally big data. What are big data? One line to demarcate it would be data sets so large in size that they cannot be processed in reasonable time by any computing hardware. Datasets like that force the analyst to take non-traditional approaches to process them.

But most of the hype about big data isn’t about such analytic distinctions. The excitement over big data is that we now have the ability to gather truly large datasets on phenomenon that previously were measurable only in small datasets. So, it used to be that you had to survey people to get their opinions about X, Y, or Z, but now we can gather vast amounts of data on people’s opinions from monitoring the twitterverse or blogosphere.

Gathering large data where previously we only had small data was what excited people about Google’s flu prediction system that was based on the search terms people were using. A recent paper, however, shows convincingly that Google flu predictions seriously overestimated the actual incidence of flu as tracked by the records related to physician office visits for influenza like illness, which is the exact measure Google was trying to predict from search terms.

This commentary by Lazer et al. in the journal Science includes many important reminders of how using big data for statistics doesn’t much change the rules of good statistical practice and inference. I often tell my colleagues the same thing. I consider myself one of the relatively few people in a personal position to judge this, since I have conducted studies on data sets as small as six capuchin monkeys and as large as millions of lines of healthcare claims.

Sometimes big data are a real advantage. In particular;

1) Large datasets make testing model fit easier,

2) You have more data to burn through corrections for multiple testing if you are searching for a model with little theory to guide you,

3) Big data are useful when you are predicting something very, very rare, like most cancers or terrorism.

Any particular usefulness to big data pretty much stops there though. Big data don’t much help you to discover important and general features of social behavior, for example. Folks, if you can’t find an effect of one variable on another in several hundred or a thousand data points – then probably it isn’t an important or general effect.

There is something more going on with the Google flu flub, however, than just missteps in statistical best practice. What the Googlers, Twitterers, and other such conspecifics are used to measuring--and they are great at it--are systems where the outcome of interest is more or less the thing being measured itself.

If you want to know what is trending about your company on Twitter, then that is by definition what people are tweeting about. If you want to direct people to the most popular pages from a given search term (what Google excels at) then you by definition want to measure where people click after using the same search term.

But doing those things are doing math and not doing statistics. Statistics is an offshoot of mathematics that is specific to the task of making predictions about things you haven’t measured from observations of other things you did measure. This axiom applies even to basic statistical inference problems such as finding the average height of a population, which doesn’t proceed by measuring everyone in the population. If you just measure everyone then you have counted, you have used some math by calculating the average, but you didn’t do any statistics. Statistically inferring the mean height of a population would proceed by measuring only a smaller sample of that population, and from that deriving an inference of what the average of the whole population likely is, and some measure of your confidence in that prediction. If you have measured everyone, then you have an observation of the population average, not a prediction of it, and you don’t need a measure of confidence.

That’s why estimating the incidence of flu is from search terms is not like Google’s bread and butter work. The incidence of flu is an out-of-sample prediction from search terms because the flu’s incidence is not itself determined just from what people think and therefore search about it. Surely flu incidence is partly a social construct. For example, a group struck by apprehension over the flu will avoid social contact and thereby slow the flu’s spread. But the flu also has other causes outside our thinking and searching the internet about it. Ambient temperature and humidity affect flu transmission, and mutation of the genetic material of the flu virus is a function of properties mostly not constructed by our own sociality.

My contention is the day-to-day work of Google and many tech companies that use big data is not out-of-sample prediction in this way. Instead, they are able to directly measure the thing of interest to their advertisers because the latter are intrinsically interested in the behavior of people within the self-creating system that is the internet.

I think when Google set out to predict the incidence of flu from search terms, they may not have realized they were stepping outside the realm of measurement of a self-creating system (like internet searches) and stepping into the realm of predicting unobserved phenomenon from measurements of another phenomenon.

This realm is that of statistics, and it is well trod by practicing scientists from many fields. Yes, these travelers of the statistical realm usually have used small data. Some of them travel accompanied by a cartload of models and information-theoretic Bayesian livestock. Others are more modest practitioners but highly adept with a particular tried and true beast of burden, such as linear regression or K-means clustering.

Regardless, Google and others may do well to consult some of these conventional statistical vagabonds the next time they venture into analytics that are truly about predicting things not measured. Knowing the paths through a landscape can be even more important if you are carrying big data with you, which can make the effects of navigational errors all the larger.

1 Comment

The evolutionary analysis of Little Red Riding Hood

12/7/2013

0 Comments

There is a new paper out by Jamie Tehrani on the evolution of the Little Red Riding Hood fairy tale that is getting some much-deserved attention.

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0078871

There are two points about this paper that are just fantastic.

First, Dr. Tehrani applies phylogenetic methods to identify both inheritance and diffusion processes. To accomplish this, he supplements his phylogenetic analysis with some network algorithms (neighbornet) and with detailed ethnographic knowledge of these fairy tale variants. The results are a wonderful illustration of how applying phylogenetic methods does not lock the researcher into the assumption that culture evolves by inheritance rather than by diffusion. Instead, the phylogenetic results actively support a reasonable ethnographic argument that cultural diffusion of the story elements was extensive in China, while the story elements were conserved and thus inherited in Europe. Papers like Dr. Tehrani’s move us well beyond the now sterile debate about whether culture is inherited or diffused (folks, sometimes it’s one, and sometimes it’s the other, and sometimes it’s a mix, so deal with it). I think studies like this one are a clear model for the future growth of quantitative cross-cultural analysis.

I would also point out a prior paper by Dr. Tehrani, myself, and others, that showed how phylogenetic methods could also detect other types of cultural diffusion – specifically when different functionally or socially linked blocks of cultural traits move together from one population to another.

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0014810

To understand the basic point of this paper, think of the way new languages or religions can be adopted wholesale by a population, whether by choice or conquest. Such events result in cultural diffusion from the viewpoint of the population of people (they adopted a new set of traits) but from the viewpoint of the cultural elements the process is one of inheritance of a ‘cultural core’ because conversion can occur without blending of the elements within the core, i.e. the story elements of a single fairly tale, or the ritual elements of a religious denomination. This is why whole populations can be converted and languages or religions moved about across the globe, and yet the process of change in characteristics of these same languages or religions can still, at least sometimes, evolve through tree like inheritance. Our paper provides a method to detect and model such circumstances.

The second fantastic point about Dr. Tehrani’s paper is it shows the power of phylogenetic and network methods to construct empirically rigorous but quantitative models for global-scale cultural phenomenon. In the past I have often thought of language and religion as the two systems of human culture that are ubiquitous, variable, and ancient, but Dr. Tehrani’s paper makes a strong case that traditions of folklore and fairy tale may also fit these criteria. With the analytical methods we now have available, it is just a matter of motivation and funding for all of us to have quantitative global maps of the lineages (inheritance pathways) and linkages (diffusion pathways) among languages, religions, and folktales. Indeed, it is just such a quantified and global model of the landscape of human culture that I believe was the original, pre-Boasian, mandate of anthropology.

0 Comments

What is anthropology?

9/24/2012

1 Comment

I think my experience is not unusual among anthropologists that I am often asked what anthropology is? This question usually implies what makes anthropology a distinct discipline; that is, what makes it different from evolutionary biology or sociology?

Through teaching and multiple interactions with colleagues, I've found the best answer to this question of what anthropology is involves understanding a little about how anthropology came about. Anthropology is essentially a natural science discipline. It came about at a time when many natural history type disciplines arose. As we discovered the incredible diversity of natural life in the 18th and 19th centuries, scientists began to specialize on particular taxonomic groups of related organisms. Thus we started to have mammalogists and herpetologists and so forth. After Darwin we had a fully viable mechanism for how all the diversity of life could be linked together in a single unbroken and absolutely continuous history. Anthropology was born of the realization that this same unbroken character of evolutionary history must apply to humans; that all the incredible biological and cultural diversity of our species arose from a less diverse origin, and that it came about through evolutionary processes. Since we are humans, it seemed reasonable that there should be a natural history science about ourselves. Hence anthropology.

Because anthropology was born of the mindset of naturalists pursuing science, it made sense to early anthropologists like Edward Tylor and Louis H. Morgan that anthropology would pursue both a survey of extant cultural diversity and would investigate the archeological and fossil record of human existence as part of one discipline. This is, after all, exactly how a natural historian of the time would attempt to understand the evolutionary diversification of a related group of fish or rodents. You would want to know the existing diversity, about which you can of course have much more detailed information, but then also be linking that as much as possible to the direct evidence of past evolutionary change from the fossil record. Comparison and comparative methods have always been key to the study of evolution in any set of organisms. Indeed, comparison across many geographic scales, comparison among extant species, and comparison to the fossil record were all key sources of evidence for Darwin's insights on natural selection and descent with modification of species.

Given the role of language in human social life, and it's magnificent inheritance properties, it was sensible that linguistics would be brought into anthropological science at least in part. The addition of primatology to the field was also a logical broadening of the comparative basis for understanding our species' evolution.

This is what anthropology was founded to do: to be the natural history science of humankind. Such an endeavor does not encompass the study of all of human life, and the sciences of sociology and psychology had very different origins. I will touch briefly on sociology, which is often the most difficult to disambiguate from anthropology. In contrast to anthropology, sociology was not founded with the fundamental goal of understanding how human social life diversified to what it is today from a series of past mechanistic causes (evolution). Sociology established itself as strictly the science of social causes for human social life and behavior. Thus, sociology studied a type of causation that exists at a particular emergent level, just as chemists study the causation of interactions among atoms, and community ecologists study causal interactions among species, etc. Sociology even today tends to model human social interactions as analogous to particle interactions of physics and with little interest in reducing causal sequences to psychology, biology or physics through a chain of causation; rather, the social causes themselves are of primary interest. Sociology was always a level of analysis type science, and in that sense more similar to much of modern science (E. O. Wilson has written well on the contrast of natural history science and science practiced at a single emergent level).

So, that is why anthropology made sense as a distinct discipline. Does this characterize anthropology today? Not really. There are anthropologists (like me) who still are motivated principally by this vision of anthropology as the natural history science of humanity. I think there has always been at least a small core of anthropologists with this view throughout its 100-150 year existence as a distinct discipline. However, ever since Franz Boas, many and perhaps most anthropologists have not seen anthropology in this way. It was Boas who first made popular within anthropology the idea that cultural diversity just springs spontaneously from peoples' heads, and that this construction of culture by ourselves had scarcely anything to do with our biological heritage. Once anthropologists accepted that, the linkage of fossil diggers, archeologists, and ethnographers in one discipline started to seem incoherent. This was exacerbated by the increasing popularity of nonscientific methods of cultural analysis that rejected any reconstruction of historical diversification and rejected quantitative methods. Most of Boas' highly influential students, like Margaret Meade, helped push the discipline in this direction, further splintering anthropology.

What will happen now to anthropology? I'm not sure, but I am sure that the comparative naturalist science of humanity will be conducted, whether by anthropologists, or some other group of researchers like cultural neuroscientists or psychologists. This is because there is a real academic discipline at the heart of anthropology. As we discover more and more about how genes affect our behavior, and even which genetic changes are responsible for our impressive cultural capabilities, the Boasian wall of separation between biological and cultural evolution will become more and more obviously false. So, someone will take up the effort, because there is a lot of science still to do.

1 Comment

Author

This is my personal blog. The views expressed on this page are my own. My views should not be taken to represent the views of my mentors, employer, or any person or group other than myself.

Big data enthusiasts discover statistics is about what you didn’t measure

The evolutionary analysis of Little Red Riding Hood

What is anthropology?

Author

Archives

Categories