Moving Forward with Ecological Informatics and Reproducibility

Over the last year it has become increasingly apparent to me that ecologists and environmental scientists must take a more active role in providing access to both data and the analytical techniques used to analyze those data. As our studies become increasingly broad, our analytical capabilities must also expand and, perhaps more importantly, we should be able to more easily share and reproduce complex analyses. My awareness of this need for change is, in part, due to a recent exchange in Bioscience that discussed the pros and cons of reproducibility and repeatability in ecology in a framework of intellectual property rights (Cassey and Blackburn 2006; Parr 2007).

Also during this time, I have learned about reproducible research (Gentleman and Lang 2004), attended an ecoinformatics training offered by the Science Environment for Ecological Knowledge (SEEK), and had several discussions about reproducibility in ecology (Duke 2006, 2007; Hollister and Walker 2007). The sum of these experiences is an appreciation of the vast array of tools currently available, but also a feeling of being a bit overwhelmed and confused as to the best way forward. The purpose of this contribution is to provide an abbreviated list of these tools and make a few suggestions on how ecologists can move toward a reproducible and repeatable field.

Existing Tools
The list I provide here is biased by my own limited experience during the last year. I have broken the list into two broad categories: Ecological Informatics and Reproducible Research.

1.) Ecological Informatics – I define this broadly as an area that uses existing (e.g., from bioinformatics) and new methods to store, document, access, integrate, and analyze large, complex, and distributed ecological datasets. The new journal Ecological Informatics encompasses this broad definition and includes articles on metadata standards, database technologies, and statistical methodologies. Also, ecoinformatics.org provides open access to relevant software tools. Three examples of Ecological Informatics projects are the Knowledge Network for Biocomplexity, the Kepler Project, and the Analytic Web Project.

  • Knowledge Network for BiocomplexityThe Knowledge Network for Biocomplexity (KNB) is a network designed to facilitate the discovery and analysis of distributed ecological and environmental datasets. The KNB accomplishes this through the use of Ecological Metadata Language (EML), a metadata standard for the ecological sciences; Morpho, user-friendly EML management software; and Metacat, a metadata server system (Andelman et al. 2004).
  • Kepler – The Kepler Project aims to develop a system that facilitates discovery of pertinent datasets, automates integration of those datasets, and streamlines the analysis and reporting of results derived from those data (Pennington and Michener 2005). This is accomplished all within a single, consistent interface that deals with many distributed datasets (e.g. locally stored data and data from the KNB) and analytical tools (e.g. GIS, R, and other ecoinformatics tools).
  • Analytic Webs – Analytic webs are similar to Kepler in that they aim to document and automate much of the scientific endeavor; however, they appear to be somewhat unique in the explicit consideration of “process metadata” (Ellison et al. 2006; Osterweil et al. 2006; Boose et al. In Press). Furthermore, output from both Kepler and the Analytic Web can be integrated with EML (Ellison et al. 2006).

2.) Reproducible Research – Reproducible research, quite similar to Ecological Informatics, is listed separately because it was independently developed. Although the idea of reproducibility dates back to Karl Popper, the current definition of “reproducible research” was developed originally in the computer sciences (Knuth 1992) and is based on the idea that publications (i.e., results, figures, and tables) are merely advertisement of our work, Access to the data, analytical techniques and code is also needed. Reproducible research tools and infrastructure are evolving, but come mostly from the Open Source world. For instance, combining R and LaTeX, users embed data, and code with text to create inherently reproducible documents. For more information see the Stanford Exploration Project and the Bioconductor Project. Additionally, Peng et al. (2006) propose basic standards for reproducibility.

Modest Suggestions:
Many of these tools overlap and practicing ecologists are left with a decision on how best to move forward. Towards that end, I make three modest suggestions.

First, I suggest that ecologists embrace the open exchange of data and the reproducibility of analytical methods. Doing so will benefit our discipline by providing tools and datasets appropriate to the broad scale questions we must address. It will also benefit us individually with new and unexpected research questions and collaborations. Furthermore, embracing these ideas need not be complicated and may be as simple as providing a website with reprints, data, and the statistical code used to generate your results. While this approach may not attain the ultimate goals of reproducible research and ecological informatics, it is a step in the right direction and does not preclude future changes that are in line with larger aspirations. Furthermore, ESA provides the ability to publish data through Ecological Archives and the ESA Data Registry.

Second, I suggest that ecologists support the various ecological informatics research and development efforts. Be willing to collaborate, to test new software tools, and to use current and future research projects as case studies. I would even suggest that the tools we use go beyond those listed above. Tools of the information age (e.g. wikis and blogs) provide excellent opportunities for collaboration and communication (Byrnes 2006). Without a rich community of users and successful examples of how these tools can be used, it will be difficult to reach a consensus.

Finally, I’d encourage patience and a willingness to change. Prior to the application of these tools becoming more commonplace, a certain level of trial and error is to be expected.

If our community is willing to accept the premise of ecological informatics and reproducible research (i.e., open access to data and methods), then the diversity of tools is an asset. It is an asset because when combined, the diversity of approaches will bring us closer to a more open, more collaborative, and more capable discipline.

Contributed by Jeffrey W. Hollister, U.S. Environmental Protection Agency, Office of Research and Development, National Health and Environmental Effects Research Lab, Atlantic Ecology Division Note: I certainly take the blame for any outrageous and egregious statements made in this contribution; also, I must defer credit for any of the good ideas as they are not solely mine. Specifically, I have discussed these ideas with Stephen Hale and Henry Walker of the USEPA’s Atlantic Ecology Division, Aaron Ellison of Harvard Forest, and William Michener at the Long Term Ecological Research Network Office. This letter has not been subjected to Agency-level review; therefore, it does not necessarily reflect the views of the agency. This is contribution number AED-07-097 of the Atlantic Ecology Division, National Health and Environmental Effects Research Laboratory.

References

Andelman S.J., Bowles C.M., Willig M.R. and Waide R.B. 2004. Understanding Environmental Complexity through a Distributed Knowledge Network. BioScience 54: 240-246. Boose E.R., Ellison A.M., Osterweil L.J., Podorozhny R., Clarke L.A., Wise A., Hadley J.L. and Foster D.R. In Press. Ensuring reliable datsets for environmental models and forecasts. Ecological Informatics.Byrnes J. 2006. Embracing blogs and other tools of the information age.

Cassey P. and Blackburn T.M. 2006. Reproducibility and Repeatability in Ecology. BioScience 56: 958-959.

Duke C.S. 2006. Data: share and share alike. Frontiers in Ecology and the Environment 4: 395.

Duke C.S. 2007. Reply to: Beyond Data: Reproducible Research in Ecology and Environmental Sciences. Frontiers in Ecology and the Environment 5: 67.

Ellison A.M., Osterweil L.J., Clarke L., Hadley J.L., Wise A., Boose E., Foster D.R., Hanson A., Jensen D., Kuzeja P., Riseman E. and Schultz H. 2006. Analytic Webs Support the Synthesis of Ecological Data Sets. Ecology 87: 1345-1358.

Gentleman R. and Lang D.T. 2004. Statistical analysis and reproducible research. Bioconductor Project working paper 2.

Hollister J.W. and Walker H.A. 2007. Beyond Data: Reproducible Research in Ecology and Environmental Science. Frontiers in Ecology and the Environment 5: 11-12.

Knuth D.E. 1992. Literate programming. Center of the Study of Language and Information, Stanford, CA.

Osterweil L.J., Wise A., Clarke L.A., Ellison A.M., Hadley J.L., Boose E. and Foster D.R. 2006. Process Technology to Facilitate the Conduct of Science. In Li M., Boehm B. and Osterweil L. J. (eds.), Unifying the Software Process Spectrum, pp. 403-415. Springer, Berlin.

Parr C.S. 2007. Open Sourcing Ecological Data. BioScience 57: 309-310.

Peng R.D., Dominici F. and Zeger S.L. 2006. Reproducible Epidemiologic Research. Amercian Journal of Epidemiology 163: 783-789.

Pennington D.D. and Michener W.K. 2005. The EcoGrid and the Kepler Workflow System: a new platform for conducting ecological analyses. Bulletin of the Ecological Society of America 86: 169-176.

Share This Post On

2 Comments

  1. Scientists in diverse fields of study are increasing recognizing the value of transparency, reproducibility. By making data available and computational methods transparent and reproducible, one can anticipate more rapid research advancements. It makes it possible for peer reviewers to examine the data and computations in more detail during the review process, reducing the likelihood of publication of fraudulent results, a journal’s worst nightmare (Laine et al, 2007). The approach is gaining traction in the biomedical field. In addition to researchers learning the practice, for the approach to succeed more broadly major journals will need to embrace it.

    Henry A. Walker U.S. EPA, National Health and Ecological Effects Research Laboratory, Atlantic Ecology Division, Narragansett, R.I.

    Christine Laine, et al 2007. Reproducible Research: Moving toward Research the Public Can Really Trust. Annals of Internal Medicine. Volume 146 Issue 6 | Pages 450-453 http://www.annals.org/cgi/content/full/146/6/450.

  2. Dr Hollister: With a bit more participation this could become a very interesting Blog. I’m sure there are interesting examples of Ecological Informatics and Reproducible Research out there.

    In addition to KNB, EML, Kepler and Analytical Webs, there is the basic business of referencing and sharing data.

    I’ve recently come accoss “Dataverse Network” G. King 2007. who points out the possibility of citing the data used in a paper using a Universal Numeric Fingerprint (UNF). The R package UNF by Micah Altman computes a UNF based on the data. People can then search for the UNF if they want to obtain the identical data set used in a paper. even if it moves to a different URL. After downloading a dataset, one can regenerate the UNF to be sure it is the same data used in a paper.

    Recent open source R packages also help make it possible to share data, script, and even cashed R computations: Eckel and Peng (2006), Peng (2007), Peng and Eckel (2007)

    Approaches involving UNFs for data, the use of open source script, and cashed computations for the analysis, and reproduction of published figures and tables will facilitate :
    (1) more critical peer review of research results,
    (2) technology transfer of computation methods that others can adapt, and
    (3) a shift from advertisment or research results and advocacy arguments over alternative interpretations data, to quantitative weight-of-evidence approaches based on Information Theory , Burnham and Anderson (2002).

    We can make more credible advances in ecology and ecoinformatincs, based on (1) more effective data sharing, and (2) the adoption of reproducible research approaches.

    References:

    Gary King 2007 Dataverse Network. http://gking.harvard.edu/talks/dvn-nsfP.pdf & @ http://thedata.org/index.html

    Sandrah P. Eckel & Roger Peng 2006
    INTERACTING WITH LOCAL AND REMOTE DATA RESPOSITORIES USING THE stashR PACKAGE. John Hopkins Working Paper 127. http://www.bepress.com/cgi/viewcontent.cgi?article=1127&context=jhubiostat

    Roger Peng 2007. A REPRODUCIBLE RESEARCH TOOLKIT
    FOR R. John Hopkins Working Paper 142. A REPRODUCIBLE RESEARCH TOOLKIT FOR R

    Roger Peng and Sandrah P. Eckel 2007. DISTRIBUTED REPRODUCIBLE RESEARCH USING CACHED
    COMPUTATIONS John Hopkins Working Paper 147. http://www.bepress.com/cgi/viewcontent.cgi?article=1148&context=jhubiostat

    Burnham and Anderson (2002) Model Selection and Multi-Model Inference. Springer.

    ————————————————-

    Henry A. Walker, PhD
    EPA ORD NHEERL Atlantic Ecology Division
    Narragansett, R.I. 02882

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>