8.1: Open Data
[Establish] open data as a standard practice across all of the sciences Other Information:
With the massive growth in data and increased ease of making it available, calls for open data as a standard practice are
occurring across all of the sciences (Freese, 2007; King, 2006, 2007; Schofield et al., 2009; Stodden, 2011; Wicherts, 2011;
Wicherts & Bakker, 2012)... Arguments for open data cite the ability to confirm, critique, or extend prior research (Smith,
Budzieka, Edwards, Johnson, & Bearse, 1986; Wicherts, Borsboom, Kats, & Molenaar, 2006; Wolins, 1962), the opportunity to
reanalyze prior data with new techniques (Bryant & Wortman, 1978; Hedrick, Boruch, & Ross, 1978; Nosek & Bar-Anan, 2012; Poldrack
et al., 2011; Stock & Kulhavy, 1989), increased ability to aggregate data across multiple investigations for improved confidence
in research findings (Hrynaszkiewicz, 2010; Rothstein, Sutton, & Borenstein, 2006; Yarkoni, Poldrack, Van Essen, & Wager,
2010), the opportunity for novel methodologies and insights through aggregation and big data (Poldrack et al., 2011), and
that openness and transparency increase credibility of science and the findings (Vision, 2010)... The rate of errors in published
research is unknown, but a study by Bakker and Wicherts (2011) is breathtaking. They reviewed 281 articles and found that
15% contained statistical conclusions that were incorrect—reporting a significant result (p < .05) that was not, or vice versa.
Their investigation could only catch statistical errors that were detectable in the articles themselves. Errors can also occur
in data coding, data cleaning, data analysis, and result reporting. None of those can be detected with only the summary report.
For example, a study looking at sample mix-ups in genome-wide association studies found evidence that every single original
data set examined had at least one sample mix-up error, that the total error rate was 3%, and that the worst performing paper—published
in a highly prestigious outlet—had 23% of its samples categorized erroneously (Westra et al., 2011). Further, correcting these
errors had a substantial impact on improving the sensitivity of identifying markers in the data sets. Making data openly available
increases the likelihood of finding and correcting errors and ultimately improving reported results. Simultaneously, it improves
the potential for aggregation of raw data for research synthesis (Cooper, Hedges, & Valentine, 2009), it presents opportunities
for applications with the same data that may not have been pursued by the original authors, and it creates a new opportunity
for citation credit and reputation building (Piwowar, 2011; Piwowar, Day, & Fridsma, 2007). Researchers who create useful
data sets can be credited for the contribution beyond their own uses of the data.
Stakeholder(s):
- Sciences
- Human Genome Project: For example, the Human Genome Project acknowledges its principle of rapid, unrestricted release of prepublication data as
a major factor for its enormous success in spurring scientific publication and progress (Lander et al., 2001).
- Psychologists: The concerns about credibility may be well founded. In one study, only 27% of psychologists shared at least some of their
data upon request for confirming the original results even though APA ethics policies required data sharing for such circumstances
(Wicherts et al., 2006; see also Pienta, Gutmann, & Lyle, 2009). Further, Wicherts et al. (2011) found that reluctance to
share published data was associated with weaker evidence against the null hypothesis and more apparent errors in statistical
analysis—particularly those that made a difference for statistical significance. This illustrates the conflict between personal
interests and scientific progress—the short-term benefit of avoiding identification of one's errors dominated the long-term
cost of those errors remaining in the scientific literature.
- Research Infrastructure Projects: Movement toward open data is occurring rapidly. Many infrastructure projects are making it easier to share data. There are
field-specific options such as OpenfMRI (http://www.openfmri.org/; Poldrack et al., 2011), INDI (http://fcon_1000.projects.nitrc.org/),
and OASIS (http://www.oasis-brains.org/) for neuroimaging data. And, there are field-general options, such as the Dataverse
Network Project (http://thedata.org/) and Dryad (http://datadryad.org/).
- OpenfMRI: http://www.openfmri.org/
- INDI: http://fcon_1000.projects.nitrc.org/
- OASIS: http://www.oasis-brains.org/
- Dataverse Network Project: http://thedata.org/
- Dryad: http://datadryad.org/
- Journals: Some journals are beginning to require data deposit as a condition of publication (Alsheikh-Ali, Qureshi, Al-Mallah, & Ioannidis,
2011).
- Funding Agencies: Likewise, funding agencies and professional societies are encouraging or requiring data availability postpublication (National
Institutes of Health, 2003; National Science Foundation, 2011; PLoS ONE, n.d.).
- Professional Societies
- National Institutes of Health
- National Science Foundation
- PLoS ONE
- Researchers: Of course, although some barriers to sharing are difficult to justify—such as concerns that others might identify errors—others
are reasonable (Smith et al., 1986; Stodden, 2010; Wicherts & Bakker, 2012). Researchers may not have a strong ethic of data
archiving for past research; the data may simply not be available anymore. Many times data that are available are not formatted
for easy comprehension and sharing. Preparing data takes additional time (though much less so if the researcher plans to share
the data from the outset of the project).
- Research Participants: Further, there are exceptions for blanket openness, such as inability to ensure confidentiality of participant identities,
legal barriers (e.g., copyright), and occasions in which it is reasonable to delay openness—such as when data collection effort
is intense and the data set is to be the basis for multiple research projects (American Psychological Association, 2010; National
Institutes of Health, 2003; National Science Foundation, 2011). The key point is that these are exceptions. Default practice
can shift to openness while guidelines are developed for the justification to keep data closed or delay their release (Stodden,
2010).
- American Psychological Association
Indicator(s):
|