Moving Forward with Ecological Informatics and Reproducibility
Over the last year it has become increasingly apparent to me that ecologists and environmental scientists must take a more active role in providing access to both data and the analytical techniques used to analyze those data. As our studies become increasingly broad, our analytical capabilities must also expand and, perhaps more importantly, we should be able to more easily share and reproduce complex analyses. My awareness of this need for change is, in part, due to a recent exchange in Bioscience that discussed the pros and cons of reproducibility and repeatability in ecology in a framework of intellectual property rights (Cassey and Blackburn 2006; Parr 2007).
Also during this time, I have learned about reproducible research (Gentleman and Lang 2004), attended an ecoinformatics training offered by the Science Environment for Ecological Knowledge (SEEK), and had several discussions about reproducibility in ecology (Duke 2006, 2007; Hollister and Walker 2007). The sum of these experiences is an appreciation of the vast array of tools currently available, but also a feeling of being a bit overwhelmed and confused as to the best way forward. The purpose of this contribution is to provide an abbreviated list of these tools and make a few suggestions on how ecologists can move toward a reproducible and repeatable field.
The list I provide here is biased by my own limited experience during the last year. I have broken the list into two broad categories: Ecological Informatics and Reproducible Research.
1.) Ecological Informatics – I define this broadly as an area that uses existing (e.g., from bioinformatics) and new methods to store, document, access, integrate, and analyze large, complex, and distributed ecological datasets. The new journal Ecological Informatics encompasses this broad definition and includes articles on metadata standards, database technologies, and statistical methodologies. Also, ecoinformatics.org provides open access to relevant software tools. Three examples of Ecological Informatics projects are the Knowledge Network for Biocomplexity, the Kepler Project, and the Analytic Web Project.
- Knowledge Network for Biocomplexity – The Knowledge Network for Biocomplexity (KNB) is a network designed to facilitate the discovery and analysis of distributed ecological and environmental datasets. The KNB accomplishes this through the use of Ecological Metadata Language (EML), a metadata standard for the ecological sciences; Morpho, user-friendly EML management software; and Metacat, a metadata server system (Andelman et al. 2004).
- Kepler – The Kepler Project aims to develop a system that facilitates discovery of pertinent datasets, automates integration of those datasets, and streamlines the analysis and reporting of results derived from those data (Pennington and Michener 2005). This is accomplished all within a single, consistent interface that deals with many distributed datasets (e.g. locally stored data and data from the KNB) and analytical tools (e.g. GIS, R, and other ecoinformatics tools).
- Analytic Webs – Analytic webs are similar to Kepler in that they aim to document and automate much of the scientific endeavor; however, they appear to be somewhat unique in the explicit consideration of “process metadata” (Ellison et al. 2006; Osterweil et al. 2006; Boose et al. In Press). Furthermore, output from both Kepler and the Analytic Web can be integrated with EML (Ellison et al. 2006).
2.) Reproducible Research – Reproducible research, quite similar to Ecological Informatics, is listed separately because it was independently developed. Although the idea of reproducibility dates back to Karl Popper, the current definition of “reproducible research” was developed originally in the computer sciences (Knuth 1992) and is based on the idea that publications (i.e., results, figures, and tables) are merely advertisement of our work, Access to the data, analytical techniques and code is also needed. Reproducible research tools and infrastructure are evolving, but come mostly from the Open Source world. For instance, combining R and LaTeX, users embed data, and code with text to create inherently reproducible documents. For more information see the Stanford Exploration Project and the Bioconductor Project. Additionally, Peng et al. (2006) propose basic standards for reproducibility.
Many of these tools overlap and practicing ecologists are left with a decision on how best to move forward. Towards that end, I make three modest suggestions.
First, I suggest that ecologists embrace the open exchange of data and the reproducibility of analytical methods. Doing so will benefit our discipline by providing tools and datasets appropriate to the broad scale questions we must address. It will also benefit us individually with new and unexpected research questions and collaborations. Furthermore, embracing these ideas need not be complicated and may be as simple as providing a website with reprints, data, and the statistical code used to generate your results. While this approach may not attain the ultimate goals of reproducible research and ecological informatics, it is a step in the right direction and does not preclude future changes that are in line with larger aspirations. Furthermore, ESA provides the ability to publish data through Ecological Archives and the ESA Data Registry.
Second, I suggest that ecologists support the various ecological informatics research and development efforts. Be willing to collaborate, to test new software tools, and to use current and future research projects as case studies. I would even suggest that the tools we use go beyond those listed above. Tools of the information age (e.g. wikis and blogs) provide excellent opportunities for collaboration and communication (Byrnes 2006). Without a rich community of users and successful examples of how these tools can be used, it will be difficult to reach a consensus.
Finally, I’d encourage patience and a willingness to change. Prior to the application of these tools becoming more commonplace, a certain level of trial and error is to be expected.
If our community is willing to accept the premise of ecological informatics and reproducible research (i.e., open access to data and methods), then the diversity of tools is an asset. It is an asset because when combined, the diversity of approaches will bring us closer to a more open, more collaborative, and more capable discipline.
Contributed by Jeffrey W. Hollister, U.S. Environmental Protection Agency, Office of Research and Development, National Health and Environmental Effects Research Lab, Atlantic Ecology Division Note: I certainly take the blame for any outrageous and egregious statements made in this contribution; also, I must defer credit for any of the good ideas as they are not solely mine. Specifically, I have discussed these ideas with Stephen Hale and Henry Walker of the USEPA’s Atlantic Ecology Division, Aaron Ellison of Harvard Forest, and William Michener at the Long Term Ecological Research Network Office. This letter has not been subjected to Agency-level review; therefore, it does not necessarily reflect the views of the agency. This is contribution number AED-07-097 of the Atlantic Ecology Division, National Health and Environmental Effects Research Laboratory.
Andelman S.J., Bowles C.M., Willig M.R. and Waide R.B. 2004. Understanding Environmental Complexity through a Distributed Knowledge Network. BioScience 54: 240-246. Boose E.R., Ellison A.M., Osterweil L.J., Podorozhny R., Clarke L.A., Wise A., Hadley J.L. and Foster D.R. In Press. Ensuring reliable datsets for environmental models and forecasts. Ecological Informatics.Byrnes J. 2006. Embracing blogs and other tools of the information age.
Cassey P. and Blackburn T.M. 2006. Reproducibility and Repeatability in Ecology. BioScience 56: 958-959.
Duke C.S. 2006. Data: share and share alike. Frontiers in Ecology and the Environment 4: 395.
Duke C.S. 2007. Reply to: Beyond Data: Reproducible Research in Ecology and Environmental Sciences. Frontiers in Ecology and the Environment 5: 67.
Ellison A.M., Osterweil L.J., Clarke L., Hadley J.L., Wise A., Boose E., Foster D.R., Hanson A., Jensen D., Kuzeja P., Riseman E. and Schultz H. 2006. Analytic Webs Support the Synthesis of Ecological Data Sets. Ecology 87: 1345-1358.
Gentleman R. and Lang D.T. 2004. Statistical analysis and reproducible research. Bioconductor Project working paper 2.
Hollister J.W. and Walker H.A. 2007. Beyond Data: Reproducible Research in Ecology and Environmental Science. Frontiers in Ecology and the Environment 5: 11-12.
Knuth D.E. 1992. Literate programming. Center of the Study of Language and Information, Stanford, CA.
Osterweil L.J., Wise A., Clarke L.A., Ellison A.M., Hadley J.L., Boose E. and Foster D.R. 2006. Process Technology to Facilitate the Conduct of Science. In Li M., Boehm B. and Osterweil L. J. (eds.), Unifying the Software Process Spectrum, pp. 403-415. Springer, Berlin.
Parr C.S. 2007. Open Sourcing Ecological Data. BioScience 57: 309-310.
Peng R.D., Dominici F. and Zeger S.L. 2006. Reproducible Epidemiologic Research. Amercian Journal of Epidemiology 163: 783-789.
Pennington D.D. and Michener W.K. 2005. The EcoGrid and the Kepler Workflow System: a new platform for conducting ecological analyses. Bulletin of the Ecological Society of America 86: 169-176.