Data Analysis at Massive Scales

Extracting information and insights from massive datasets; "big data"; "data mining"

With the goal of developing analysis techniques for massive data sets, DARPA rolled out the Topological Data Analysis (TDA) program, which ran from 2004 to 2008. Like many other programs, this one spawned a commercial firm, in this case a software firm that remained in business at the posting of this timeline in 2018.
DARPA held a multi-program performer meeting for researchers to hear presentations on the latest innovations and promising approaches in the area of Big Data and data analytics. Speakers during the day-long event included representatives from the White House, FBI, universities from across the country and leading companies from the private sector who are focused on the potential efficiencies and advantages that can be gained in Big Data.
Today's web searches use a centralized, one-size-fits-all approach that searches the Internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases. For example, it still remains a largely manual process that does not save sessions, requires nearly exact input with one-at-a-time entry, and doesn't organize or aggregate results beyond a list of links. Moreover, common search practices miss information in the deep web—the parts of the web not indexed by standard commercial search engines—and ignore shared content across pages.
The chikungunya virus (CHIKV) is quickly spreading through the Western Hemisphere; as of May 15, 2015, the Pan-American Health Organization (PAHO) had tallied close to 1.4 million suspected cases and more than 33,000 confirmed cases since the virus’ first appearance in the Americas in December 2013. Spread by mosquitoes, chikungunya is rarely fatal but can cause debilitating joint and muscle pain, fever, nausea, fatigue and rash, and poses a growing public health and national security risk. Governments and health organizations could take more effective proactive steps to limit the spread of CHIKV if they had accurate forecasts of where and when it would appear. But such predictions for CHIKV and other emerging infectious diseases remain beyond the reach of current modeling capabilities.
Popular search engines are great at finding answers for point-of-fact questions like the elevation of Mount Everest or current movies running at local theaters. They are not, however, very good at answering what-if or predictive questions—questions that depend on multiple variables, such as “What influences the stock market?” or “What are the major drivers of environmental stability?” In many cases that shortcoming is not for lack of relevant data. Rather, what’s missing are empirical models of complex processes that influence the behavior and impact of those data elements.