Open science and open data: why, how and when?

Everybody talks about open science, but what is it exactly, and how can researchers get involved concretely? We asked Professor Katrin Beyer, chair of the Open Science Strategic Committee at EPFL, and Luc Henry, scientific advisor to Martin Vetterli, President of the same institution, to help us separate wheat from chaff. In the following paragraphs, they give some background, some practical advice and their motivations to address this important but complex topic.

What is open science? And why has it become so important over the past couple of years?

Open science is based on the assumption that the discovery process should be carefully described and shared with everyone so that people can critically assess the results and build on them to correct them or make new discoveries – “Nanos gigantum humeris insidentes” (dwarfs standing on the shoulders of giants) to use a popular metaphor attributed to 12^th century French philosopher Bernard of Chartres. But isn’t that just how science should be? ^[1]

Starting over two decades ago with open access, researchers have launched initiatives that address the technological, social or cultural issues that prevent knowledge from being transparent and shared: research articles being locked up behind paywalls by private companies ^[2]; under- or misreporting leading to a reproducibility crisis ^[3]; insufficient skills and infrastructures to cope with the ever-increasing importance of computers and digital data in research; and many more. In response to these, new practices emerge in the research community, together with new expectations from journals and funding agencies, but also from the public.

Much like the rest of society, the academic environment is in a transition, and to maintain high standards in the way research is carried out and disseminated, openness has become a necessity.

How much and how quickly should scientists share their data? Who benefits from sharing data?

There is much more to open science than simply sharing data. Actually, researchers increasingly realize that sharing data is not enough and should not be a goal in itself. The high-energy physics community has been particularly aware of this issue ^[4], but every discipline needs to explore what, how and when to share data in order to increase the reproducibility and reusability of research results.

Ultimately, identifying the benefits will clarify things: is it about creating trust within the scientific community? Opening new research avenues?

Starting with the last question, the best answer is “it depends”. Virtually everyone can benefit from weather or traffic data, while highly specialized information will only talk to experts. But value only materializes if data is shared in a way that makes it understandable to others. Investing the time and resources to make data useful is the best way to prevent it from being lost, and a prerequisite for sharing – altogether, these good practices are called “data management”. To take again the example of the field of high-energy physics – CERN in this case – the main motivation behind having an open data strategy was that “data and the knowledge needed to interpret them are more likely to survive in the long term if many people outside an experiment are constantly trying to make sense of them.” ^[5] The data management plan (DMP) now required alongside any research proposal by a majority of funding sources – Swiss National Science Foundation and European Commission included – should not be seen as “yet another administrative tick box” but a commitment to producing the best possible science.

“How much” depends on what is valuable. Is it the raw data that contains crucial information? Or is processed data sufficient? The answer will vary between disciplines. Sometimes data is easier and cheaper to collect than to store and share. In this case, it may be the code used for the data analysis that is valuable to others. Some disciplines measure events that will never happen again, e.g. daily temperature. The uniqueness of this data is calling for careful preservation.

A reasonable answer to “How quickly” can be expressed as follow: data supporting a scientific claim should be shared with the reviewers at the time of submitting a manuscript, so that they can validate the results. More and more journals require it. The data should then be made available to anyone when the article becomes public. There are of course reasonable and perfectly valid exceptions, for privacy or legal reasons, especially when the protection of individuals or intellectual property rights is at stake. Speed can matter when there is an emergency. The recent Zika or Ebola outbreaks were situations in which there was a moral imperative for scientists to agree to share information quickly and openly ^[6]. But it does not mean that they could compromise on the quality. It is important to stress that for most research projects, quantity and speed should come second to quality when sharing data.

Should PhD students and postdocs be educated in Open Access and Open Data management?

Yes, absolutely! “Educated” may be too strong a word, but everyone in academia – and not just PhD students and postdocs – should take the time to learn more about the tools and best practice that can help them manage and share their data in the best possible way. The Open Science Strategic Committee recently made a series of recommendations. One of them is how important it is for an institution like EPFL to provide infrastructures, but also support training programs for scientists so that they can acquire the skills necessary to implement open science in practice.

What are the practical challenges for researchers when sharing data?

Besides the potential technical challenges of very large or complex data sets, there are ethical and legal aspects that need to be addressed. They are not necessarily complex, but even choosing a license for the data can seem a daunting task. Sharing therefore requires a commitment in time and resources. Investing into careful data management and documentation requires some organization, and in order for others to understand and reuse information, it needs to follow standards. As a general rule, the best strategy is to understand and follow the state of the art in one’s discipline, using the tools and repositories that gathered consensus. The Registry of Research data Repositories can be useful as it provides detailed information on more than 2,000 research data repositories. However, researchers can ask for advice before making a choice. The staff at the university library can typically help with general guidelines.

Do you see any relative merits of for-profit and non-profit repositories solutions?

Categorizing repositories – or any scholarly infrastructure – as “for-profit” versus “non-profit” is not always useful. Both “for-profit” and “non-profit” models can be perfectly valid when providing a service to the research community. The motivations, governance structure and transparency of an initiative are more important, as they will determine whether it will ultimately serve the interests of the scientists, or that of shareholders.

Important aspects to take into consideration when choosing where to submit publications, data, or any other research output are, for example, whether the infrastructure is actually open – can anyone access the information free of charge and without any unnecessary restriction? Is it possible for anyone to not just access, but also reuse the information for any purpose? Or are there limitations? Is the organization that maintains the platform – or system – asking contributors for copyright transfer? This is something that can lead to a situation where the information can be locked behind a paywall in the future. Library staff can provide useful guidance in this case as well. The open access movement is an attempt to break out from an unsustainable rise in the costs of accessing scholarly literature. It all started with the scientific community making questionable strategic decisions in the past. The same mistakes should not be reproduced with data.

Why do you think preprints have become so fashionable?

Maybe it is worth reminding readers who may not be familiar with the concept that a preprint is a manuscript ready for submission, but prior to formal peer review. While for example, the physics community would not imagine a world without them, not everybody agrees that information should be shared publicly at this stage – to be fair, a large majority of scientists are still not on board – but the fact that it is increasingly popular shows that those who give it a try see the benefits.

Today, preprints are typically shared digitally on an online repository, but informal correspondence between scientists to collect feedback on one’s work has always existed. A strategy that has allowed the phenomenon to grow in scale has been the creation of mailing lists in the 1960s. Anyone could sign up to receive a physical copy of the latest research submitted to the list by authors. In this era of paper and stamps, the system quickly became overloaded, losing its reason to exist.

The only technology that could bring the concept of preprint to its full potential was the internet, and since the world wide web was invented to share scientific information amongst physicists, it explains why the arXiv.org has been so successful since its creation in 1991. What may come as a surprise to many is the fact that the biology community also attempted to adopt preprints mailing lists in the 1960s. There is an interesting paper describing the history of this movement and why it did not succeed at the time ^[7].

The recent growing popularity of preprints, in particular in biology and chemistry, may also be related to an increase in interdisciplinary projects over the past ten years. Researchers who have been trained as physicists or mathematician often contribute to research in genetics and genomics, systems biology, biophysics, molecular simulations and other subfields. They are used to publish preprints to disseminate their results and they now bring this culture to biology and chemistry. Biology preprints were initially submitted to a sub-section of arXiv – called q-bio for “quantitative biology” – before the creation of bioRxiv.

There are many reasons why researchers share preprints. It can give the authors a chance to get feedback before submitting to a journal, improving the quality of the manuscript before formal peer-review. It also provides an official time stamp for a discovery. One important concern that biologists often express is the risk of being scooped if they share information before an article is published. In reality, and if the science was done properly, making a preprint public does prove that the authors were the first to show the science worked. At the end, it is a cultural thing. Some disciplines only release results in the form of preprints, often not even bothering to publish – astrophysics for example. Biologists ^[8] and chemists ^[9] still need to figure out what is in it for them.

Could you recommend some websites, reading, training, and documentation to someone interested in learning more about open data, or open science in general?

For open science, one initiative that is not very well known yet, but potentially really useful is the Ask Open Science website. It is like Reddit, or Quora, but for questions related to open science topics. Social media, and Twitter in particular, are great platforms to participate to the conversation using the #OpenScience tag for example. There is also an Open Science MOOC in the making. Not every aspect is covered yet, but the team of volunteers behind the project is doing a great job. If you want to learn more about the policy implications of open science, two reports stand out, one from the Royal Society in the UK ^[10], the other one from the National Academies of Sciences, Engineering and Medicine in the USA ^[11].

Regarding Open Access, the most relevant and up to date resource for researchers active in Switzerland is probably the website maintained by the Swiss National Science Foundation.

For open data, an important thing to realize is that the concept was not born in an academic context. Government and private companies have been adopting open data strategies for well over a decade now. Although the motivations and incentives can be very different, understanding why and how other communities have put in place standards for sharing data openly is an interesting exercise. They have developed technical solutions that can be useful to academic research institutions.

Starting from the Open Definition that was proposed by the Open Knowledge Foundation, you can then explore the 5-star Open Data deployment scheme that was proposed by Tim Berners-Lee, the inventor of the world wide web.

Information that is freely accessible on the web is however not necessarily open. The requirement for it to also be freely reusable requires the author to explicitly allow others to redistribute with clear requirements. The Creative Commons licenses were invented exactly for that and are now widely used for open access scientific publications. Licensing data is a little bit more complicated, but there are a number of resources, including the very good “How to license research data” report ^[12].

Katrin Beyer is an Associate Professor for Earthquake Engineering at the Ecole Polytechnqiue Fédérale de Lausanne (EPFL) in Switzerland. After her undergraduate studies at the Federal Institute of Technology in Zurich (ETHZ), she worked for two years for the consulting firm ARUP in London, UK, on projects related to structural dynamics, impact and seismic analysis. She received her PhD from the Rose School, Pavia, Italy. Her research interests include large-scale structural tests, the seismic behaviour of non-rectangular RC walls and of unreinforced masonry structures and the torsional response of asymmetric buildings when subjected to seismic excitation.

Luc Henry is a Scientific Advisor to the President of EPFL. His mission is to help elaborate an open science strategy for the institution and support its implementation. In 2016, he worked at the Swiss National Science Foundation on a similar mission. Luc studied chemistry at EPFL and Uppsala Universitet in Sweden. He then obtained his DPhil degree (PhD) in chemical biology from the University of Oxford, UK. While spending ten years carrying out research in Switzerland, Sweden, England and Germany, he became increasingly interested in science policy and science communication, eventually leading him to his current role.

Cover picture: “Blue padlock with a key” created by D3images – Freepik.com

Open science and open data: why, how and when?

What is open science? And why has it become so important over the past couple of years?

How much and how quickly should scientists share their data? Who benefits from sharing data?

Should PhD students and postdocs be educated in Open Access and Open Data management?

What are the practical challenges for researchers when sharing data?

Do you see any relative merits of for-profit and non-profit repositories solutions?

Why do you think preprints have become so fashionable?

Could you recommend some websites, reading, training, and documentation to someone interested in learning more about open data, or open science in general?

Leave a comment

Cancel reply

What is open science? And why has it become so important over the past couple of years?

How much and how quickly should scientists share their data? Who benefits from sharing data?

Should PhD students and postdocs be educated in Open Access and Open Data management?

What are the practical challenges for researchers when sharing data?

Do you see any relative merits of for-profit and non-profit repositories solutions?

Why do you think preprints have become so fashionable?

Could you recommend some websites, reading, training, and documentation to someone interested in learning more about open data, or open science in general?

Share this article

You might also like

Leave a comment

Cancel reply