"Big data" is a very 21st-century kind of buzzword, which ambiguously invokes the idea of using large sets of data to draw computer-assisted conclusions about trends, patterns and correlations, often about people and their behavior.
But if you wanted to trace the origin of using big data for health research, you'd have to go back — way back, to 17th-century England.
There, you'll find a haberdasher by the name of John Graunt, who undertook a peculiar project. He began to study so-called bills of mortality, death records kept during the plague-riddled times, and compiled death details into tables, noting age, gender, cause, location and time.
This vital statistics research later turned into a 1662 tome. It marked a seminal moment in demography, the statistical study of populations, but also in epidemiology, the study of what causes diseases and how they spread among different groups of people.
"It was totally groundbreaking for its time. It was a much larger scale of looking at trends in disease than anyone had looked at previously," says Stephen Mooney, an epidemiologist at Columbia University's Mailman School of Public Health.
"At some point you have to think about what it means to put together a table and look at patterns in year-over-year," he says. "For the time, that was big data."
Of course, the groundbreaking big data of today is a far cry from hand-crafted tables. It allows researchers to use super-fast computers to query billions of digital records we leave in our wake on social media, on our wearable devices, in our search history — our "digital exhaust," as Boston Children's Hospital Chief Innovation Officer John Brownstein puts it.
And isn't that a good thing?
The promise of big data for modern health is much extolled. This week came the latest feat. Scientists at Microsoft published a study showing that Web search queries (on Microsoft's Bing search engine) may hold clues to a future diagnosis of pancreatic cancer, one of the fastest and most fatal.
In essence what Microsoft researchers did was this: They studied millions of anonymized searches on Microsoft's Bing to find queries suggestive of a user's recent diagnosis, such as "Why did I get cancer in pancreas" or "Just diagnosed with pancreatic cancer." They then backtracked the digital footprints left by the same computer to locate searches for earlier symptoms of the disease, and to create a statistical model that they say could predict 5 percent to 15 percent of the ultimate diagnoses based on earlier search activity, with pretty low false positives.
"My take is that it's exciting but preliminary," says Mooney, who has studied the use of big data in public health. "The potential benefit is huge," he says, but "it would be easy to naively assume we know more about this than we do." It's one thing to detect early digital clues to a diagnosis, but another to actually prevent or delay a death.
The Microsoft scientists themselves acknowledge this in the study. "Clinical trials are necessary to understand whether our learned model has practical utility, including in combination with other screening methods," they write.
Therein lies the crux of this big data future: It's a logical progression for the modern hyper-connected world, but one that will continue to require the solid grounding of a traditional health professional, to steer data toward usefulness, to avoid unwarranted anxiety or even unnecessary testing, and to zero in on actual causes, not just correlations within particular health trends.
"That's why I think, if you talk to a lot of epidemiologists, they may be suspicious of some of these big data-type approaches," says Mooney, "because they'd be concerned that there's a loss of attention to causation."
The most high-profile lesson in failed causation was Google Flu Trends.
In 2008, Google researchers decided to measure flu activity, in real time, based on users' Web searches. It was a headline-grabbing project and worked well — for a while. Academic researchers who later did a postmortem on the project, David Lazer and Ryan Kennedy, wrote in Wired magazine:
"GFT failed — and failed spectacularly — missing at the peak of the 2013 flu season by 140 percent. ...
"While Google's efforts in projecting the flu were well meaning, they were remarkably opaque in terms of method and data — making it dangerous to rely on Google Flu Trends for any decision-making.
"For example, Google's algorithm was quite vulnerable to overfitting to seasonal terms unrelated to the flu, like 'high school basketball.' ... There were bound to be searches that were strongly correlated by pure chance, and these terms were unlikely to be driven by actual flu cases or predictive of future trends."
The project's failure, however, does not negate the promise of big data in health. Beyond analyses of large-scale trends, capturing passively created data on people's sentiments, mental ups and downs, things you may not ever think to bring up with your physician can be "very powerful," Brownstein says. (Of course, with proper privacy and security protections in mind.)
"It's not data that can be used in a silo, it's one gear in the system," he says, "so it's not like this holy grail. It's just data that can be used, that can be harnessed, in conjunction with other types of information strains."
To Mooney, Google Flu Trends was a case of a hype cycle, "this concept that technologies get overhyped and then are disappointing, but sometime after the disappointment, can often return to a sort of plateau of usefulness."
And in that is a lesson on big data in health: It deserves both enthusiasm and caution.
"Ideally, I'd like people to embrace both of them," says Mooney, "to recognize that it's exciting and concerning at the same time. Because the world is messy and it's possible to be exciting and concerning at the same time."