What do we mean by open data?

There is a growing need for, and growth of, open data in biomedicine. But what do we actually mean by open data? The term can mean different things to different people. With increasing numbers of ‘open data’ initiatives across the world wide web, particularly in governmental data, we risk confusing the concept of public accessibility (free access to data) with that of interoperability and integration, and ensuring data are reusable and redistributable1.

This is analogous to the challenge faced – arguably, still faced – by the open access movement. The Berlin Declaration on Open Access to Scientific Knowledge2 stipulated that by open access not only should articles be freely and permanently available, they should be free for others to reuse, redistribute and make derivative works. But many publishers continue to assert that content is open access when there are a variety of restrictions on reuse, particularly regarding commercial use3.

The Open Knowledge Foundation4, Creative Commons5, Panton Principles for Open Data in Science6, and the open access publisher BioMed Central7, have all expressed that open data should mean more than data being freely accessible. Science depends on reproducibility of results, and being able to build on previous findings and reuse data to drive new discoveries without legal or technical impediments. This ideally requires data to be placed explicitly in the public domain by the application of an appropriate licence or waiver of rights specific to data, such as Creative Commons CC08.

Why does public domain dedication of data matter? Licences, such as Creative Commons attribution licences, that legally require attribution for reuse, can lead to unmanageable attribution requirements for data gathered from multiple sources9. Imagine all of the thousands of contributors to the data in the Human Genome Project exerting legal rights of attribution or transfer agreements each time a post-doc researcher runs a query on the database, and you begin to see how this could become problematic.

Drivers for open data in biomedicine

A number of reports10,11 and surveys12 have identified benefits of data sharing in the life sciences for the public good, economy and for the advancement of knowledge. Sharing detailed research data has been associated with increased citations13, but substantial empirical evidence of the benefits and rewards of data sharing and publication for the individual scientist is still being gathered. Meanwhile, policies and mandates from funding agencies, institutions and journals are key drivers for change in researcher (author) behaviour.

A growing number of biomedical research funding agencies have data sharing policies (see Table 1). And in January 2011, 17 major international health research funding agencies, including the World Health Organization and the Bill and Melinda Gates Foundation, committed to working together to support data sharing14. This is logical, given data are the main product of the investment of research funding agency grants, often funded by public money and, therefore, preserving and archiving raw data in a reusable form maximizes its value15.

Creating at least one published, citable article about a scientific research project – preferably in a high-ranking journal – remains an essential part of the research lifecycle. Journals and their submission or publication requirements can influence author behaviour, as authors endeavour to meet the demands of the editors of their preferred publication. Of course, journal editorial policies are usually consensus driven, and motivated by meeting the needs of the scientific communities and audiences they serve. Journal policies have proven to be effective in changing author behaviour, such as requiring the prospective registration of clinical trials16.

The leading life-science journal Nature requires that, as a condition of publication, its authors ‘make materials, data and associated protocols promptly available to readers without undue qualifications in material transfer agreements’ and that supporting data be available to editors and peer reviewers after submission. It also specifies how it deals with infringements of the policy, which includes publishing corrections or refusing publication17. But Nature is a high-impact journal with substantial resources, so how do less well-established publications treat this issue? BioMed Central, which publishes more than 200 journals across biology and medicine, requires that authors confirm on submission that they will provide data to other scientists on request18, and the Public Library of Science author information, even more strongly, states that ‘publication is conditional upon the agreement of authors to make freely available any materials and information associated with their publication’19.

The high-ranking clinical medical journals Annals of Internal Medicine20 and the BMJ21 take an alternative approach. They require a statement as to the availability of supporting data rather than implying data sharing as a condition of submission or publication.

Journal policies have been associated with increased sharing of genetic sequence data (where a number of well-established repositories for the data exist)22 but compliance with policies that rely on other data types have found low compliance rates (25% from 141 published articles in psychology23 and one in ten from a sample of ten published clinical trials24).

A solution has been proposed in the form of the Joint Data Archiving Policy, which has been signed by a consortium of journals in ecology and evolutionary biology and requires that supporting data sets be archived in ‘an appropriate public repository’ and a link to the supporting data set(s) be included in the published article25. The Dryad repository is one such appropriate repository, which will host myriad data file types (unlike highly structured databases such as GenBank), promotes data citation by assigning digital object identifiers (DOIs) to data sets, and promotes reusability by requiring CC0 as its default waiver for published data sets26.

The need for open data in biomedical publishing

Reporting bias and distortion of the evidence base

The mission statements of medical journals often aspire to improving clinical decision-making or human health and wellbeing, but lack of access to data underlying publications, and data generated during clinical trials, can have the opposite outcome. Suppression of data potentially relevant to human health for monetary gain by pharmaceutical companies is indefensible, but – albeit inadvertently or subconsciously – incremental contributions of editors, journals and peer reviewers27 during the publication process may also be distorting the clinical evidence base and, consequently, having deleterious effects on human health.

In autumn 2010, the widely-prescribed antidepressant reboxetine was found to be ineffective or potentially harmful when previously unpublished data, from an alarming 74% (3033/4098) of patients from 13 clinical trials, were included in a systematic review and meta-analysis28. This is the latest in a number of high-profile cases (including celecoxib29 and rosiglitazone30) of opacity in raw clinical data leading to reporting bias. This is the phenomenon whereby articles reporting results favouring the medical intervention being studied (i.e. positive rather than negative results) are more likely to be published – and more likely to be published quickly, in high-impact journals and published multiple times. (For a review, see McGauran et al31). Open data in medicine will enable journals and publishers to better fulfil their aims of advancing science and medicine by enabling more balanced and transparent reporting of research which will, ultimately, benefit human health.

In non-human biology the content of published articles may seem less immediately able to impact human health, but across science a sharing and publication of raw data would logically be predicted to reduce the potential for error and fraud32.

The unique challenge of human subjects' research

Open medical data has much potential, but publishing data that have arisen from the doctor–patient relationship inherently carries risks to individual privacy, unless explicit consent for publication has been obtained. This is an issue for publishers, editors and journals, and indeed all those involved in the data acquisition and dissemination process, given the implications under privacy and data protection laws. In the age of open access and open data, de-identification of personal data for publication (where consent has not been obtained) is challenging. Published guidelines for authors, editors and peer reviewers33 of clinical data sets recommend data sets including three or more indirect identifiers, such as gender or ethnicity, should be independently reviewed to assess the risk of patients being identified (see Table 2).

Ethnicity Occupation Place of treatment

White British Doctor London (England)
Black Caribbean Judge Paisley (Scotland)

Table 2

Is the second hypothetical patient anonymous with certainty? How just three indirect identifiers, which in isolation would be no cause of concern, when associated with an individual could potentially put privacy at risk

The same principle of privacy protection is true of publishing medical case reports, which now usually requires explicit consent for publication from living individuals described in cases34.

Lessons in open data from (genome) biology

Inter-disciplinary and international research, and research conducted jointly by academia and industry, is growing, in part facilitated by the web and open access. The Human Genome Project was ‘a watershed moment’ for open sharing of scientific data across boundaries, as many pharmaceutical companies backed this collaborative effort instead of their propriety projects35. Despite human genome data driving new, commercially valuable, drug targets and discoveries (as well as discoveries in ecology, agriculture and beyond), participating commercial entities recognize that data are just the beginning of the drug-discovery process. They further recognize there is more to be gained from sharing without exerting intellectual property or patents in early stages of data collection. Indeed, scientific web services – machines – depend on immediate and unfettered access to data, exemplified for example by the GenBank database. Furthermore, the genomics community, via the Bermuda Principles, has agreed built-in temporal latencies that set out when data should be released, and when rights restricting use are removed, allowing researchers defined periods (e.g. 12 months) for exclusive use of data for their projects – and papers36.

The Sage Bionetworks initiative hopes to transfer the access principles ingrained in the Human Genome Project to human disease biology and biological networks (the study of changes at the molecular level linked to disease symptoms and traits). Sage Bionetworks aims to ‘be the steward of the data and associated systems’ and produce networked models of disease (from genomic, proteomic, metabolomic and clinical data), for several currently fragmented fields with no common repository for data. Importantly, this initiative is committed to ensuring all data are in the public domain by waiving all database and other rights to ensure reusability without restrictions37.

Exploring the role of publishers in open data

Online publishers are service providers, who must respond to the needs of today's scientists to facilitate rapid dissemination and transfer of knowledge and, invariably, to stay in business. So changes in scientists' behaviour, such as those set out above, are important for publishers. For example, there are growing numbers of institutional, funder and scientific subject-specific repositories for data. Publishers such as BioMed Central are responding to this and developing links to data from published articles, integrating data viewing software with their content, and participating in initiatives to agree best practices for data publication and citation38. Some journal publishers are also, effectively, data publishers, by hosting online supplementary data files. Publishing supplementary material has been the source of much debate in recent months, as some journals have claimed it puts too many demands on peer reviewers, or that it moves important material from the article to non-printed supplements. However, I would argue it is unrealistic to expect every reviewer of an article to reanalyze supplementary or repository-held data, and for online open access journals space is virtually unlimited. Furthermore, many biomedical sub-domains are yet to routinely post data in a repository (as is common in genomics) or indeed have a repository in which to deposit their data, making online supplementary files an important interim venue for data39.

Gaining academic credit (in the form of citations) for data sharing remains a challenge. Publishers such as BioMed Central and the Ecological Society of America are addressing this issue by offering publication of ‘data notes’ (in BMC Research Notes) or ‘data papers’ (in Ecological Archives). These articles put a biomedical data set or database at the core of the publication, with the peer-reviewed journal article acting more as a wrap-around for the data, such that it is discoverable, indexable and citable via standard scholarly search engines and databases. BMC Research Notes has taken this concept further by offering to publish, as educational articles, biomedical domain- specific data standards (agreed ways of presenting and formatting biomedical data so that it is readily reusable and machine-readable)40.

Via their interactions with different biomedical specialities, publishers are in a good position to share best practices across disciplinary boundaries, and identify – and work with – scientists with interests in open data. Novel open access journals such as Trials (http://www.trialsjournal.com), which puts a special emphasis on data sharing and publishing of all clinical trial results regardless of outcome (whether positive or negative), and BioData Mining (http://www.biodatamining.org/), which focuses on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic and metabolomic data, are products of such a strategy.

The future of scholarly communication

With increasing availability of raw data, the scientific record itself – albeit slowly – is changing. New platforms for sharing, publishing and linking data to publications are being developed, making data more integral to the scientific record – traditionally a collection of documents (journal articles). A thought-provoking essay, in the open access book on data-intensive science, The Fourth Paradigm, envisages instantaneous translation of new medical discoveries into clinical practice – a ‘healthcare singularity’ – by around 202541. Gillam and colleagues envisage doctors accessing, via their smartphones, data in real time that are generated from patients' electronic health records, linked to clinical evidence databases, genomic information, drug resistance and availability data, which could further be linked to records of ongoing clinical trials42. All of which, combined, will inform more effective and personalized treatment. A fantastical concept? Perhaps, but the US Department of Health and Human Services Open Government strategy to expand health data access is already calling for patient data to be available in standardized, reusable formats43. Moreover, platforms such as Microsoft (Health Vault) and Google (Health), and platforms for sharing such as patientslikeme.com are already enabling patients to control sharing of their health information in secure ‘clouds’, potentially making a data-driven scholarly record a reality sooner than might, at first glance, seem plausible.