Important Points Made by Individual Speakers
The fifth session of the workshop was chaired by Deepak Agarwal (LinkedIn Corporation). The session had four speakers: Christopher Ré (Stanford University), Bill Cleveland (Purdue University), Ron Brachman (Yahoo Labs), and Mark Ryland (Amazon Corporation).
CAN KNOWLEDGE BASES HELP ACCELERATE SCIENCE?
Christopher Ré, Stanford University
Christopher Ré focused on a single topic related to data science: knowledge bases. He first discussed Stanford University’s experience with knowledge bases. Ré explained that in general scientific discoveries are published and lead to the spread of ideas. With the advent of electronic books, the scientific idea base is more accessible than ever before. However, he cautioned, people are still limited by their eyes and brain power. In other words, the entire science knowledge base is accessible but not necessarily readable.
Ré noted that today’s science problems require macroscopic knowledge and large amounts of data. Examples include health, particularly population health; financial markets; climate; and biodiversity. Ré used the latter as a specific example: broadly speaking, biodiversity research involves assembling information about Earth in various disciplines to make estimates of species extinction. He explained that this is “manually constructed” data—a researcher must input the data by examining and collating information from individual studies. Manually constructed databases are time-consuming to produce; with today’s data sources, the construction exceeds the time frame of the typical research grant. Ré posited that the use of sample-based data and their synthesis constitute the only way to address many important questions in some fields. A system that synthesizes sample-based data could “read” journal articles and automatically extract the relevant data from them. He stated that “reading” machines may be coming in the popular domain (from such Web companies as IBM, Google, Bing, and Amazon). The concept of these machines could be extended to work in a specific scientific domain. That would require high-quality reading—reading of a higher quality than is needed in a popular-domain application—in that mistakes can be more harmful in a scientific database.
Ré described a system that he has developed, PaleoDeepDive,1 a collaborative effort with geologist Shanan Peters, also of Stanford University. The goal of PaleoDeepDive is to build a higher-coverage fossil record by extracting paleobiologic facts from research papers. The system considers every character, word, or fragment of speech from a research paper to be a variable and then conducts statistical inference on billions of variables defined from the research papers to develop relationships between biologic and geologic research. PaleoDeepDive has been in operation for about 6 months, and preliminary results of occurrence relations extracted by PaleoDeepDive show a precision of around 93 percent; Ré indicated that this is a very high-quality score.
______________
1 See the DeepDive website at http://deepdive.stanford.edu/ (accessed June 9, 2014) for more information.
Ré then stated the challenges for domain scientists related to synthetic knowledge bases:
Ré noted challenges for computer scientists related to synthetic knowledge bases:
A participant noted that noise, including misspelled words and words that have multiple meanings, is a standard problem for optical character recognition (OCR) systems. Ré acknowledged that OCR can be challenging and even state-of-the-art OCR systems make many errors. PaleoDeepDive uses statistical inference and has a package to improve OCR by federating open-source material together and using probabilistic inputs. Ré indicated that Stanford would be releasing tools to assist with OCR.
DIVIDE AND RECOMBINE FOR LARGE, COMPLEX DATA
Bill Cleveland, Purdue University
Bill Cleveland explained the goals of divide and recombine for big data:
Cleveland then described the divide and recombine method. He explained that a division method is used first to divide the data into subsets. The subsets are then treated with one of two categories of analytic methods:
Cleveland described several specific methods of division. In the first, conditioning-variable division, the researcher divides the data on the basis of subject matter regardless of the size of the subsets. That is a pragmatic approach that has been widely used in statistics, machine learning, and visualization. In a second type of division, replicate division, observations are exchangeable, and no conditioning variables are used. The division is done statistically rather than by subject matter. Cleveland stated that the statistical division and recombination methods have an immense effect on the accuracy of the divide and recombine result. The statistical accuracy is typically less than that with other direct methods. However, Cleveland noted that this is a small price to pay for the simplicity in computation; the statistical computation touches subsets no more than once. Cleveland clarified that the process is not MapReduce; statistical methods in divide and recombine reveal the best way to separate the data into subsets and put them back together.
Cleveland explained that the divide and recombine method uses R for the front end, which makes programming efficient. R saves the analyst time, although it is slower than other options. It has a large support and user community, and statistical packages are readily available. On the back end, Hadoop is used to enable parallel computing. The analyst specifies, in R, the code to do the division compu-
tation with a specified structure. Analytic methods are applied to each subset or each sample. The recombination method is also specified by the analyst. Cleveland explained that Hadoop schedules the microprocessors effectively. Computation is done by the mappers, each with an assigned core for each subset. The same is true for the reducers; reducers carry out the recombination. The scheduling possibilities are complex, Cleveland said. He also noted that this technique is quite different from the high-performance computing systems that are prevalent today. In a high-performance computing application, time is reserved for batch processing; this works well for simulations (in which the sequence of steps is known ahead of time and is independent of the data), but it is not well suited to sustained analyses of big data (in which the process is iterative and adaptive and depends on the data).
Cleveland described three divide and recombine software components between the front and back ends. They enable communication between R and Hadoop to make the programming easy and insulate the analyst from the details of Hadoop. They are all open-source.
Cleveland explained that divide and recombine methods are best suited to analysts who are conducting deep data examinations. Because R is the front end, R users are the primary audience. Cleveland emphasized that the complexity of the data set is more critical to the computations than the overall size; however, size and complexity are often correlated.
In response to a question from the audience, Cleveland stated that training students in these methods, even students who are not very familiar with computer science and statistics, is not difficult. He said that the programming is not complex; however, analyzing the data can be complex, and that tends to be the biggest challenge.
______________
2 See Purdue University, Department of Statistics, “Divide and Recombine (D&R) with RHIPE,” http://www.datadr.org/, accessed June 9, 2014, for more information.
3 See GitHub, Inc., “R and Hadoop Integrated Programming Environment,” http://github.com/saptarshiguha/RHIPE/, accessed June 9, 2014, for more information.
4 See Tessera, “datadr: Divide and Recombine in R,” http://hafen.github.io/datadr/, accessed June 9, 2014, for more information.
5 See Tessera, “Trelliscope: Detailed Vis of Large Complex Data in R,” http://hafen.github.io/trelliscope/, accessed June 9, 2014, for more information.
YAHOO’S WEBSCOPE DATA SHARING PROGRAM
Ron Brachman, Yahoo Labs
Ron Brachman prefaced his presentation by reminding the audience of a 2006 incident in which AOL released a large data set with 20 million search queries for public access and research. Unfortunately, personally identifiable information was present in many of the searches, and this allowed the identification of individuals and their Web activity. Brachman said that in at least one case, an outside party was able to identify a specific individual by cross-referencing the search records with externally available data. AOL withdrew the data set, but the incident caused shock-waves throughout the Internet industry. Yahoo was interested in creating data sets for academics around the time of the AOL incident, and the AOL experience caused a slow start for Yahoo. Yahoo persisted, however, working on important measures to ensure privacy, and has developed the Webscope6 data sharing program. Webscope is a reference library of interesting and scientifically useful data sets. It requires a license agreement to use the data; the agreement is not burdensome, but it includes terms whereby the data user agrees not to attempt to reverse-engineer the data to identify individuals.
Brachman said that Yahoo has just released its 50th Webscope data set. Data from Webscope have been downloaded more than 6,000 times. Webscope has a variety of data categories available, including the following:
______________
6 See Yahoo! Labs, “Webscope,” http://webscope.sandbox.yahoo.com, accessed May 20, 2014, for more information.
Brachman explained that in many cases there is a simple click-through agreement for accessing the data, and they can be downloaded over the Internet. However, downloads are becoming impractical as database size increases. Yahoo had been asking for hard drives to be sent through the mail; now, however, it is hosting some of its databases on AWS.
In response to questions, Brachman explained that each data set is accompanied by a file explaining the content and its format. He also indicated that the data provided by Webscope are often older (around a year or two old), and this is one of the reasons that Yahoo is comfortable with its use for academic research purposes.
Brachman was asked whether any models fit between the two extremes of Webscope (with contracts and nondisclosure agreements) and open-source. He said that the two extremes are both successful models and that the middle ground between them should be explored. One option is to use a trusted third party to hold the data, as is the case with the University of Pennsylvania’s Linguistic Data Consortium data.7
Mark Ryland, Amazon Corporation
Mark Ryland explained that resource sharing can mean two things: technology capabilities to allow sharing (such as cloud resources) and economic and cost sharing, that is, how to do things less expensively by sharing. AWS is a system that does both. AWS is a cloud computing platform that consists of remote computing storage and services. It holds a large array of data sets with three types of product. The first is public, freely available data sets. These data sets consist of freely available data of broad interest to the community and include Yahoo Webscope data, Common Crawl data gathered by the open-source community (240 TB), Earth-science satellite data (40 TB of data from NASA), 1000 Genomes data (350 TB of data from NIH), and many more. Ryland stated that before the genome data were publicly stored in the cloud, fewer than 20 researchers worked with those
______________
7 See University of Pennsylvania, Linguistic Data Consortium, “LDC Catalog,” http://catalog.ldc.upenn.edu/, accessed May 14, 2014, for more information.
data sets. Now, more than 200 are working with the genome data because of the improved access.
A second type of AWS data product is requestor-paid data. This is a form of cost-sharing in which data access is charged to the user account but data storage is charged to the data owner’s account. It is fairly popular but perhaps not as successful as AWS would like it to be, and AWS is looking to broaden the program.
The third type of AWS data product is community and private. AWS may not know what data are shared in this model. The data owner controls data access. Ryland explained that AWS provides identity control and authentication features, including Web Identity Federation. He also described a science-oriented data service (Globus8), which provides cloud-based services to conduct periodic or episodic data transfers. He explained that an ecosystem is developing around data sharing.
Sharing is also taking place in computation. Ryland noted that people are developing Amazon Machine Images with tools and data “prebaked” into them, and he provided several examples, including Neuroimaging Tools and Resources Clearinghouse and scientific Linux tools. Ryland indicated that there are many big data tools, some of which are commercial and some of which are open-source. Commercial tools can be cost-effective when accessed via AWS in that a user can access a desired tool via AWS and pay only for the time used. Ryland also pointed out that AWS is not limited to single compute nodes and includes cluster management, cloud formation, and cross-cloud capacities. AWS also uses spot pricing, which allows people to bid on excess capacity for computational resources. That gives users access to computing resources cheaply, but the resource is not reliable; if someone else bids more, then the capacity can be taken away and redistributed. Ryland cautioned that projects must be batch-oriented and assess their own progress. For instance, MapReduce is designed so that computational nodes can appear and disappear.
Ryland explained that AWS offers a number of other managed services and provides higher-level application program interfaces. These include Kinesis9 (for massive-scale data streaming), Data Pipeline10 (managed datacentric workflows), and RedShift11 (data warehouse).
______________
8 See the Computation Institute, University of Chicago, and Argonne National Laboratory “Globus” website at https://www.globus.org/ (accessed May 14, 2014) for more information.
9 See Amazon Web Services, “Amazon Kinesis,” http://aws.amazon.com/kinesis/, accessed June 9, 2014, for more information.
10 See Amazon Web Services, “AWS Data Pipeline,” http://aws.amazon.com/datapipeline/, accessed June 9, 2014, for more information.
11 See Amazon Web Services, “Amazon Redshift,” http://aws.amazon.com/redshift/, accessed June 9, 2014, for more information.
AWS has a grants program to benefit students, teachers, and researchers, Ryland said. It is eager to participate in the data science community to build an education base and the resulting benefits. Ryland reported that AWS funds a high percentage of student grants, a fair number of teaching grants, but few research grants. Some of the research grants are high in value, however. In addition to grants, AWS provides spot pricing, volume discounting, and institutional cooperative pricing. In the latter, members of the cooperative can receive shared pricing; AWS intends to increase its cooperative pricing program.
Ryland explained that AWS provides education and training in the form of online training videos, papers, and hands-on, self-paced laboratories. AWS recently launched a fee-based training course. Ryland indicated that AWS is interested in working with the community to be more collaborative and to aggregate open-source materials and curricula. In response to a question, Ryland clarified that the instruction provided by Amazon is on how to use Amazon’s tools (such as RedShift), and the instruction is product-oriented, although the concepts are somewhat general. He said that the instruction is not intended to be revenue-generating, and AWS would be happy to collaborate with the community on the most appropriate coursework.
A workshop participant posited that advanced tools, such as the AWS tools, enable students to use systems to do large-scale computation without fully understanding how it works. Ryland responded that this is a pattern in computer science: a new level of abstraction develops, and a compiled tool is developed. The cloud is an example of that. Ryland posited that precompiled tools should be able to cover 80 percent or more of the use cases although some researchers will need more profound access to the data.