Some of the most important, most interesting, and most difficult open problems in the modern world involve amazingly complex systems. Problems like: How do we develop artificially intelligent machines that can think and behave like humans? How do we design a vaccine that prevents cancer? And: How will a new economic policy influence human behavior? The systems and processes that these problems concern — the brain, the immune system, and the economy — involve a staggering number of moving pieces and variables. While science is gaining ground, our understanding remains limited. However, unprecedented progress has been made recently on these problems, largely driven by one factor: data.
As it turns out, the approach of gathering massive data sets and searching them for underlying patterns has been a surprisingly effective way to solve problems involving complex systems that we don’t yet fully understand. We might build an artificial intelligence, for example, by gathering billions of pieces of text written by humans and “training” a computer to recognize and mimic the patterns in human writing. Or we might develop a vaccine for cancer by gathering large amounts of data on the environments and genetics of people who seem to be less susceptible to the disease and searching that data for clues to what makes them so. While the use of data has long been a key part of science, using data at this scale has only recently been made possible due to things like the Internet, the proliferation of cheap sensors, and fast computers.
In short, data science is the study of the mathematical techniques capable of finding meaningful patterns in messy, real-world data and the computational methods that enable their use on huge data sets. But it is also the study of the ethical implications of these methods and how to use them responsibly for the benefit of everyone.