Fall 2012 Magazine
The Data Age
The Data Age: Turning Information into Insight
Researchers mine masses of data for new solutions and understanding
When they look back at this age, historians may comment that the currency of choice—what we manufactured and traded in—was, for the first time, information. The amount of data created each year requires using terms that the layperson can’t fathom. One study puts annual data creation and replication at 1.8 zettabytes—equal to 1 billion terabytes. (Today’s average home computer hard drive averages anywhere from 500 gigabytes to 1 terabyte.)
What’s unique about this currency is that it doesn’t age, doesn’t degrade, and can self-replicate infinitely. We’ve figured out how to store it; anyone with a few hundred dollars can buy a few terabytes of hard drive. But as information accumulates, we lack the time and the capacity to search through it, so it becomes less and less valuable. How can we use it to find not only the answers we seek but also insights we never imagined?
The answer is data mining and analytics—designing models and algorithms that can parse through terabytes and petabytes of data to find the gems leading to new solutions and understandings in fields as varied as business, medicine, and science. Several McCormick professors are at the forefront of mining large data sets and are teaching the next generation a set of skills that could make them extremely valuable in the workplace.
“I’m getting calls from firms that see the value in big data, but they don’t know how to extract it,” says analytics expert Diego Klabjan, professor of industrial engineering and management sciences. “It’s definitely a very, very hot area. Everyone’s looking for expertise. We’ve had tremendous interest from companies. These days every company needs analytics. They need to hire a workforce that is capable of analyzing data.”
To that end, McCormick recently developed a master of science program in analytics. The inaugural class of the 15-month program is learning data warehousing techniques, the science behind analytics, and the business aspects of analytics. Directed by Klabjan, the program has its own computing cluster to take on big-data problems, and students will each do a summer internship. They will learn to identify patterns and trends, interpret and gain insights from vast quantities of structured and unstructured data, and communicate their findings in business terms.
“Our students will be hired by excellent companies,” Klabjan says.
Giving Computers the Data to Solve Great Problems
While industry is getting on board, Doug Downey has already spent years thinking about big data and its possibilities. “I’ve always seen it as an important challenge and a key enabler,” says the assistant professor of electrical engineering and computer science.
When Downey arrived at McCormick in 2008, he decided to use data as a resource to take another tack with a huge engineering challenge: artificial intelligence.
“I thought that there are huge amounts of data that could make artificial intelligence problems we considered really hard in fact really easy,” he says. Researchers might eventually teach computers to solve the world’s greatest problems—such as curing cancer. Downey began his research by building systems that can automatically extract information from web data. With graduate student Brent Hecht, he designed an “exploratory search” program called Atlasify, a visual search engine that combines cartography with web mining to provide a graphic tool to explore concepts and correlations. To make the program user-ready has taken months of work. The researchers had to build software tools that can determine correlations between information on disparate Wikipedia pages, and they had to make the system quick enough to respond to queries in real time.
“This gives users a way to automatically tie together a bunch of very different sources of information and come up with insights that could help them understand concepts,” Downey says.
Atlasify can provide maps and charts with any sort of reference system, from the relatedness of Elvis Presley and Russia to the correlation of rock and hip-hop based on Grammy Award winners. A concept such as nuclear power, for instance, is visualized with a world map of nuclear capabilities and a US Senate floor map showing senators’ support for nuclear power legislation.
Downey hopes such tools as Atlasify ultimately do more than just satisfy a single user’s thirst for knowledge. “It could make human life phenomenally better,” he says. “When you get a ton of data, it changes the kinds of problems you can solve. That’s what I get most excited about.”
While Dirk Brockmann’s computations aren’t yet curing diseases, they are making headway into learning how diseases spread. Brockmann’s research group is building models from large data sets on human mobility and on past outbreaks of disease.
“We want to eventually be able to predict the spread of emerging infectious disease—just like forecasting the weather,” says the associate professor of engineering sciences and applied mathematics.
Brockmann’s group found that complex networks—from disease outbreaks to global air traffic—share similar backbones. By stripping each network down to its essential nodes and links, the researchers found that each possesses a skeleton and that the skeletons have common features. This reduction of complexity should make it easier to predict the spread
of pandemics, Brockmann says.
Now the group is inverting its network theory techniques to remap the enterohemorrhagic E. coli (EHEC) outbreak in Europe last year to determine the geographic source of the epidemic. “It’s a new idea to turn the problem around and ask not where the disease spread but where it first began,” Brockmann says. “This has never been done before.”
Brockmann is also using his network techniques to determine the important links in cancer gene networks. Using large data sets of a human’s 23,000 genes and how they are expressed in cancer, Brockmann, working with professor of engineering sciences and applied mathematics Bill Kath, hopes to whittle that network down to the most important nodes and links. “Which connections are really important?” he says. “We hope to narrow it down so the experimentalists can focus on malfunctioning areas.”
This sort of network research could have implications in other academic realms, too. Brockmann hopes to use the same techniques to determine the origin of farming technology during the Neolithic period.
“There’s a big discussion of how farming spread into central Europe and replaced the hunter-gatherer mode of life,” he says. “We could use our network techniques with archaeologists’ large data sets of sites to determine the origin of this culture.”
Brockmann’s research attracts undergraduates eager to delve into big data; even though they cannot recall a time before the Internet, they are still fascinated with the magnitude of data available for research. Brockmann is fascinated as well, even after more than 10 years as a professor.
“I didn’t know we would ever have access to this much data,” he says. “That makes this field exciting, but it’s also a challenge. You have to evolve quickly to be able to devise models that can deal with this amount of data.”
From Big Data to Broad Data
For Noshir Contractor, mashing data sets together is the future of data research.
“What is really exciting is ‘broad data’: getting data from different sources and being able to combine them to find new insights,” says Contractor, the Jane S. and William White Professor of Behavioral Science, who has appointments in industrial engineering and management sciences, communication studies, and management and organizations. “That’s much trickier than big data.”
Contractor, known for using large-scale computing to test and develop social network theories, used his mashing skills to help the National Cancer Institute develop its PopSciGrid Community Health Portal, which allows users to visualize and analyze data from disparate sources. For example, a user could mash together data sets of smokers and cigarette taxation in each state.
“It offers a new way to look at the data, to inspire discussion and analysis among a much larger community,” Contractor says. “We’re developing a tapestry of broad data rather than big data.”
These days Contractor and his SONIC Lab (Science of Networks in Communities) are most interested in using broad data sets to determine how to best assemble teams. He’s studying scientific teams—such as those supported by the Northwestern University Clinical and Translational Sciences (NUCATS) Institute at the intersection of basic, clinical, and translational sciences—and teams brought together to perform a specific task. Using a tailored computer war game where participants must work together to bring humanitarian aid to a certain site, Contractor’s group is collecting data on each action and interaction to determine how players develop trust. Gamers are each assigned a task (such as looking for improvised explosive devices or making sure the convoy doesn’t get blown up by insurgents) and must provide information to their teammates through voice or text chat. Researchers investigate how communication structures and trust affect team performance.
The gigabytes of server data created by each two-hour game allow the researchers to test hypotheses. “We’ve found that in some cases, having too much trust is not necessarily a good thing,” Contractor says. “If people talk only to the few people they trust, their performance is worse.”
Another new theory Contractor is developing regards the importance of studying ecosystems of teams. Previous team-theory studies focused on only the team at hand—not whether team members were involved with other teams within the ecosystem, which most people are. Contractor is focusing on how team performance is influenced by overlapping memberships with other teams in the ecosystem.
“We’ve never had the ability to study these things at scale,” he says. “Previously social sciences have had to focus on doing small surveys or experiments. Now that we have big data at scale, we can look to make claims that we never could before with only a small sample.”
Predicting Extreme Climate Events
When it comes to weather, data at a large scale have been available for decades—we just didn’t have the tools to mine them for insights.
Alok Choudhary, John G. Searle Professor of Electrical Engineering and Computer Science, is focusing on understanding climate change based on the wealth of observational weather data from the past 150 years. Data-mining techniques are allowing Choudhary and colleagues from five other universities to look for predictors of extreme events such as hurricanes and droughts. Current climate-change models work well for predicting big developments, such as global temperature change over the next 50 years, but not for predicting extreme events that result from climate change.
“With traditional data-mining techniques, it’s like taking only a sample of the haystack to find the needle,” Choudhary says. “We are designing new ways to analyze the entire haystack, to discover interesting and actionable information without having to ask specific questions.”
For the past two decades Choudhary has used up to 50,000 processors to mine petabytes (about 1 million gigabytes) of data, searching for answers in climate change, social networks, astrophysics, and chemistry. A member of the International Exascale Software Project, which is working to design systems and hardware to process information more quickly, Choudhary developed algorithms to evolve with growing data sets.
Choudhary and his group are transforming weather data sets into climate networks in search of patterns that are predictive of events. The information could be helpful in deploying the right resources. “If you see humidity at a certain level for a few months in certain regions of Africa, then you can predict that there will be a meningitis outbreak,” Choudhary says. “Then you can start to plan a vaccination.”
Choudhary also uses data-mining algorithms that can identify and segment networks by what people say and whom they follow. The information could help companies target potential customers; for example, it might be that a person who is actively engaged with Amazon.com and likes University of Wisconsin football is a target customer for McDonald’s. “It’s a new way of thinking beyond broad-based demographics,” Choudhary says.
Locating Car-Charging Stations, Research Sites
Diego Klabjan, the electric car–driving professor of industrial engineering and management sciences, has extended his interest in sustainable energy to his large-data research: he has used analytics to create models of where to put electric car–charging stations. That involved culling information from several sources to find where and how people drive, where potential drivers of electric vehicles might live, and where these drivers might want to spend their time while their cars charge. Now Klabjan is analyzing data from the usage of charging stations to determine correlations with local demographics and economic activity.
Working with Irina Dolinskaya, Junior William A. Patterson Professor in Transportation and assistant professor of industrial engineering and management sciences, and a graduate student, Klabjan is also developing a smart routing system for electric vehicles. If an electric car driver doesn’t have enough charge to get from A to B, the system will route him or her to a charging station. “Because the remaining [driving] range is affected by so many factors—outside temperature, inside temperature, how fast you accelerate—there are a lot of data to consider,” he says. “It’s a fascinating project from a practical and a research perspective.”
Klabjan’s other research took him all the way to Greenland to study the logistics of National Science Foundation research sites. He traveled on a military plane and spent days on the ice sheet that covers most of the country. Using a sophisticated logistics operations model, he hopes to find how many research sites the NSF should have over the next decade. “There’s such a big variety of research projects that it is very challenging to estimate demand,” he says.
The Skills Needed to Keep Up with the Data
For years Dan Apley, associate professor of industrial engineering and management sciences, has performed statistical analyses for industrial quality control. In the past, data were often limited, sometimes coming from manual measurement of manufactured parts. When lasers began to do the measuring, Apley became a large-data researcher.
“Instead of employees taking a few measurements with calipers, or a machine taking dozens of measurements, a laser scanner can measure millions of points per part,” he says. “There were suddenly huge amounts of data to help understand what’s happening during the manufacturing process.”
Apley later began using his skills in the service sector—most recently, analyzing customer data for a credit card company. The company’s database contains thousands of variables (application and credit bureau data, monthly spending and payment data, etc.) for millions of customers. Apley was asked to narrow that information down to a small combination of variables that could best indicate whether the customer was high risk.
“To handle these large data sets—far larger than what people have looked at in the past—you need a broad skill set,” he says. “You need to understand statistical modeling and computer science concepts and be able to code algorithms that are computationally feasible. And you must be able to interact with the domain experts to pick the brains of the people who know what problems are important, what solutions may be useful, and what variables may potentially contain relevant information. There is no magic-bullet data mining method that applies universally.”
Apley plans to use his expertise in collaboration with materials science and other engineering professors to learn how microscale characteristics affect macroscale material properties and to discover new key microscale characteristics. They have recently submitted a number of proposals as part of President Obama’s Materials Genome Initiative for Global Competitiveness, which seeks to build a national infrastructure for data sharing and analysis by scientists and engineers designing new materials.
“Years ago the limitation was the data, because methodologies for analyzing data had evolved faster than the structure of the data sets,” Apley says. “Now the structure and availability of data—the size, the richness, the complexity—are advancing faster than the algorithms and methods of analysis that we have, and the data mining and statistical learning communities are challenged to keep pace.”