Open Data

This page provides information on BMC’s policy on Open Data, which came into effect on 3rd September 2013, as a result of our 2012 public consultation on Open Data, the outcome of which is reported in this article in BMC Research Notes.

Legal restrictions and uncertainties surrounding scientific data are a barrier to efficient data sharing and reuse, and ultimately the pace of research. Copyright in particular is problematic for data. It is often unclear if data are protected by copyright, and the law differs greatly internationally. All open access articles published by BMC, unless otherwise stated, are published under the Creative Commons Attribution License, CC-BY 4.0. But where copyright does not apply, neither does the Creative Commons Attribution License.

Our Open Data policy aims to clarify the legal (copyright) status of data published in our open access journals and to maximize the potential for reuse of published science, such as data and text mining. For society to gain the full benefit from scientific data, it needs to be available such that it can be reused, scrutinized and built upon with the minimum of barriers – in accordance with the Panton Principles for Open Data in Science. This means enabling reuse of data without needing special permission from its original creators by waiving copyright and related rights in published data. To achieve this, unless otherwise stated in an individual article’s license, data included in BMC’s published open access articles are distributed under the Creative Commons CC0 1.0 Public Domain Dedication waiver. Anyone reusing data published in BMC journals must, wherever possible, cite the source(s) of the data in a derivative work, although this is not a legal requirement. The Creative Commons CC0 waiver applies to data included in the article, its reference list(s) and its additional files. We have described in detail the case for using Creative Commons CC0 for data in a 2012 article in BMC Research Note s

What content does the Open Data policy apply to?

The Open Data policy and Creative Commons CC0 waiver applies, unless otherwise stated, to data. There are a number of file types which obviously pertain to data but comprehensively defining them is not currently feasible. Our Open Data policy allows re-users of data in our journals (humans and machines) to interpret the license in their – likely very good – understanding of data definitions for their area of research. In the future, technology will further enhance our process of attaching licenses to different parts of published articles. The table below is a guide for those seeking definitions of data, which will evolve over time.

To recommend updates and additions, please contact us.

	Table: Examples of data published in journal articles and their additional files
File/content type	Explanation
Material submitted as additional files
XML	XML is a widely used standard for data transfer in science, with many domain specific extensions forming data standards and exchange formats, such as Gating-ML in flow cytometry experiments.
CSV	CSV is an open file format used commonly for data tables and spreadsheets.
XLS/XLSX	XLS is a proprietary spreadsheet file format which opens with Microsoft Excel, but is widely available and used for publishing scientific data.
RDF	Resource Description Framework (RDF) is the standard language for encoding data and metadata on the web.
Material contained within the full text of papers
Tables	Individual data elements, predominantly numbers, organized in columns and rows are a representation of facts and should be considered data.
Bibliographic data	Factual information which identifies a scientific publication including authors, titles, publication date and identifiers should be considered data. Applies to individual articles and their reference lists.
Graphs and graphical data points	Software can harvest data points underlying graphs and charts, and graphs and other figures are often visual representations of data.
Frequency of specific words, names and phrases in article text and their association to others	This information is frequently identified through text mining, for example the frequency of particular gene and protein names and their potential associations with one another.

Frequently Asked Questions (FAQs) about Open Data in peer-reviewed journals

These FAQs are adapted from this article in BMC Research Notes.

Question: Will commercial organizations benefit from use of public domain data?
Response: It is already possible for commercial organizations to use content published in open access journals under the CC-BY license for their own benefit. BMC, and many other open access publishers, use CC-BY as the default license for journal articles and their supplementary material (additional files, which can include data). The Open Access Scholarly Publishers Association (OASPA) strongly recommends use of CC-BY by all its members. Using CC0 for data contained in published articles does not change the already existing potential for commercial uses of the published work.
Moreover, permitting commercial use of open access content enables all reuses, including sharing of content on Wikipedia (which uses CC-BY) and preservation of content by commercial organizations, which could prove valuable in the event of a publisher going out of business. The UK Government has recognized the benefits to the wider economy and, ultimately, tax payers by making publicly-funded data available openly to stimulate business innovation in funding the start-up of the Open Data Institute, which launched in 2012. Applying CC0 to data published in journals is not intended to change the numerous community or journal data availability policies. Authors and editors remain in control of what data they choose to publish, unless they are subject to a community-specific requirement for data release.

Question: Will plagiarism increase?
Response: Plagiarism (unattributed copying) and the potential for plagiarism has increased with digital access to content, independent of content licenses. In scholarly publishing plagiarism usually occurs when text, rather than data, is reused without permission or attribution. The license, CC-BY, under which narrative text of articles is published is unchanged. If data published in journals are available under CC0, re-users of the data should still cite their sources whenever it is technically possible to do so. Software, such as CrossCheck, exists to detect plagiarism, and peer reviewers can also detect plagiarism. Both peer review and plagiarism detection software are agnostic to content licenses. The Creative Commons have rightly described plagiarism as “a completely orthogonal issue to copyright infringement”.

Question: Do authors need to publish more data than they publish already?
Response: We are not requiring authors to publish more of their data. The Open Data policy only affects data authors choose to submit to our journals for open access publication, and does not require release of any other data or a change in license of any data not submitted to the journal. Therefore, authors, editors and their communities remain in control of what content they publish. CC0 is the default term for data which are already being or will be made available open access. However, BMC supports data sharing and release from all areas of research, where this is possible.

Question: What if authors are not allowed, by their funders or employers, to use CC0 for any of their published work?
Response: Where legitimate reasons exist for authors to be unable to apply CC0 to their published data, it is possible to opt out and use a non-standard license. This process already happens in journal publishing. Commonly figures, tables or charts are reproduced, with permission, in journal articles from sources which are licensed differently to the secondary publisher’s terms – and statements to this effect included in articles. When submitting work to journals authors should read the publisher’s standard copyright and license agreement and, if they cannot agree to the terms, query these before submission or publication. Some scientists funded by the World Health Organization, UK Government, and US Government already have agreements with publishers to use a non-standard copyright statement in their open access articles.

Question: Will patient privacy be put at risk?
Response: Protecting human subjects’ right to privacy is a core principle of ethical research and of the laws of many countries. However, changing the license of published content does not affect what human subjects data are submitted for publication and does not change the accessibility of any anonymized human data which are published. Nor does it affect processes and laws relating to informed consent, privacy, and consent for publication. These proposals are only about licensing of freely available data rather than publishing more or less data than is already happening.

Question: Will articles receive fewer citations?
Response: Applying the CC0 waiver to published data means that legally there is no requirement for attribution of the original author(s) if the data are copied, redistributed or reused. However, anyone reusing data should, whenever technically feasible, still cite the original author(s). Attribution is a legal requirement of copyright law and citation is a cultural norm in scholarship which ensures scientists receive credit for their work. But the two concepts are different and often confused. Citing sources is an established cultural norm in scholarship which has persisted for centuries in the absence of legal requirements for citation. Attribution and citation can sometimes be achieved in the same manner but the practices serve different purposes (see the table in Hrynaszkiewicz & Cockerill for practical examples). Attribution does not always equal citation, and credit in scholarship is assigned by the latter.
Placing data or any other content in the public domain is not incompatible with those who generated it specifying conditions for its reuse. For example, the International Stroke Trial investigators who published a large clinical trial dataset under CC-BY and additionally requested "any publications arising from the use of this dataset acknowledges the source of the dataset, its funding and the collaborative group that collected the data." Two other research groups have since reused the data.
We are aware of no empirical evidence that applying CC0 to published data results in scientists receiving fewer citations or less credit for their work. In fact, the limited evidence available on citation share for published articles which provide full access to supporting data compared to articles with no supporting data suggests that publishing data with journal articles and enabling reuse increases the number of citations. This has been found in microarray research, astronomy and the marine sciences, although these studies did not evaluate different content licenses – only accessibility.
Furthermore, the attribution requirement is only waived for published data. Data in additional files and within journal articles will be part of an article which retains a CC-BY license.

Question: What incentive is there for the original author(s) to use CC0 instead of CC-BY?
Response: The impact of different licenses for data on citation of datasets and related scholarly works has not yet been established. But because public domain dedication maximizes the potential for data discovery and reuse we might reasonably hypothesize that open licensing might increase individual credit and citations. There is evidence that sharing of research data underlying journal articles increases citation share and increases reproducibility of results. A lack of datasets which can be readily shared and combined – are in the public domain under an open data license – has been identified as hampering progress in evolutionary magnetic resonance imaging (evoMRI) research. Data supporting publications and placed in the public domain in fields facing this problem promotes collaboration between research teams and furthers progress.

Question: Why do we need to use CC0 if copyright already does not apply to data?
Response: We are part of a global research and open access publishing enterprise and whether copyright applies to data varies depending on the legal jurisdiction. In the US this concern may be valid as copyright does not apply to facts (and data are numerical representations of facts), only to the way in which they are presented. However, in Australia copyright could apply to data as the focus of the law is on originality rather than creativity. Furthermore, public domain dedication is not just about copyright. Applying CC0 aims at removing all legal barriers to sharing and reuse of content, and so waiving not just copyright but also all related and neighboring rights, such as patents and trademarks, maximizes the potential for reuse.

Another important reason for implementing explicit and clear open data licensing is about removing ambiguity. For data reuse to be efficient, humans and machines need content to be clearly licensed. The alternative, making case-by-case assessments and checking with individual data publishers and authors about the license or copyright status of individual data packages, does not scale. Being clear about licensing also reduces the risk that an individual or organization publishing or reusing data in good faith does not become involved in unintended legal debate in the future.

Question: Will data storage problems be created for the publisher or authors?
Response: Our Open Data policy is purely about changing the license for data published in BMC journals. There are no plans to increase the maximum additional file size and number of files which can be published (virtually unlimited files of up 20Mb per file). Therefore data storage is unaffected by the policy.