Documents/SU2/8: Ultimate Solution/8.1: Open Data

8.1: Open Data

[Establish] open data as a standard practice across all of the sciences

Other Information:

With the massive growth in data and increased ease of making it available, calls for open data as a standard practice are occurring across all of the sciences (Freese, 2007; King, 2006, 2007; Schofield et al., 2009; Stodden, 2011; Wicherts, 2011; Wicherts & Bakker, 2012)... Arguments for open data cite the ability to confirm, critique, or extend prior research (Smith, Budzieka, Edwards, Johnson, & Bearse, 1986; Wicherts, Borsboom, Kats, & Molenaar, 2006; Wolins, 1962), the opportunity to reanalyze prior data with new techniques (Bryant & Wortman, 1978; Hedrick, Boruch, & Ross, 1978; Nosek & Bar-Anan, 2012; Poldrack et al., 2011; Stock & Kulhavy, 1989), increased ability to aggregate data across multiple investigations for improved confidence in research findings (Hrynaszkiewicz, 2010; Rothstein, Sutton, & Borenstein, 2006; Yarkoni, Poldrack, Van Essen, & Wager, 2010), the opportunity for novel methodologies and insights through aggregation and big data (Poldrack et al., 2011), and that openness and transparency increase credibility of science and the findings (Vision, 2010)... The rate of errors in published research is unknown, but a study by Bakker and Wicherts (2011) is breathtaking. They reviewed 281 articles and found that 15% contained statistical conclusions that were incorrect—reporting a significant result (p < .05) that was not, or vice versa. Their investigation could only catch statistical errors that were detectable in the articles themselves. Errors can also occur in data coding, data cleaning, data analysis, and result reporting. None of those can be detected with only the summary report. For example, a study looking at sample mix-ups in genome-wide association studies found evidence that every single original data set examined had at least one sample mix-up error, that the total error rate was 3%, and that the worst performing paper—published in a highly prestigious outlet—had 23% of its samples categorized erroneously (Westra et al., 2011). Further, correcting these errors had a substantial impact on improving the sensitivity of identifying markers in the data sets. Making data openly available increases the likelihood of finding and correcting errors and ultimately improving reported results. Simultaneously, it improves the potential for aggregation of raw data for research synthesis (Cooper, Hedges, & Valentine, 2009), it presents opportunities for applications with the same data that may not have been pursued by the original authors, and it creates a new opportunity for citation credit and reputation building (Piwowar, 2011; Piwowar, Day, & Fridsma, 2007). Researchers who create useful data sets can be credited for the contribution beyond their own uses of the data.

Stakeholder(s):

  • Sciences

  • Human Genome ProjectFor example, the Human Genome Project acknowledges its principle of rapid, unrestricted release of prepublication data as a major factor for its enormous success in spurring scientific publication and progress (Lander et al., 2001).

  • PsychologistsThe concerns about credibility may be well founded. In one study, only 27% of psychologists shared at least some of their data upon request for confirming the original results even though APA ethics policies required data sharing for such circumstances (Wicherts et al., 2006; see also Pienta, Gutmann, & Lyle, 2009). Further, Wicherts et al. (2011) found that reluctance to share published data was associated with weaker evidence against the null hypothesis and more apparent errors in statistical analysis—particularly those that made a difference for statistical significance. This illustrates the conflict between personal interests and scientific progress—the short-term benefit of avoiding identification of one's errors dominated the long-term cost of those errors remaining in the scientific literature.

  • Research Infrastructure ProjectsMovement toward open data is occurring rapidly. Many infrastructure projects are making it easier to share data. There are field-specific options such as OpenfMRI (http://www.openfmri.org/; Poldrack et al., 2011), INDI (http://fcon_1000.projects.nitrc.org/), and OASIS (http://www.oasis-brains.org/) for neuroimaging data. And, there are field-general options, such as the Dataverse Network Project (http://thedata.org/) and Dryad (http://datadryad.org/).

  • OpenfMRIhttp://www.openfmri.org/

  • INDIhttp://fcon_1000.projects.nitrc.org/

  • OASIShttp://www.oasis-brains.org/

  • Dataverse Network Projecthttp://thedata.org/

  • Dryadhttp://datadryad.org/

  • JournalsSome journals are beginning to require data deposit as a condition of publication (Alsheikh-Ali, Qureshi, Al-Mallah, & Ioannidis, 2011).

  • Funding AgenciesLikewise, funding agencies and professional societies are encouraging or requiring data availability postpublication (National Institutes of Health, 2003; National Science Foundation, 2011; PLoS ONE, n.d.).

  • Professional Societies

  • National Institutes of Health

  • National Science Foundation

  • PLoS ONE

  • ResearchersOf course, although some barriers to sharing are difficult to justify—such as concerns that others might identify errors—others are reasonable (Smith et al., 1986; Stodden, 2010; Wicherts & Bakker, 2012). Researchers may not have a strong ethic of data archiving for past research; the data may simply not be available anymore. Many times data that are available are not formatted for easy comprehension and sharing. Preparing data takes additional time (though much less so if the researcher plans to share the data from the outset of the project).

  • Research ParticipantsFurther, there are exceptions for blanket openness, such as inability to ensure confidentiality of participant identities, legal barriers (e.g., copyright), and occasions in which it is reasonable to delay openness—such as when data collection effort is intense and the data set is to be the basis for multiple research projects (American Psychological Association, 2010; National Institutes of Health, 2003; National Science Foundation, 2011). The key point is that these are exceptions. Default practice can shift to openness while guidelines are developed for the justification to keep data closed or delay their release (Stodden, 2010).

  • American Psychological Association

Indicator(s):