Tuesday, September 26, 2017

Sustaining Open Resources

Cambridge University Office of Scholarly Communication's Unlocking Research blog has an interesting trilogy of posts looking at the issue of how open access research resources can be sustained for the long term:
Below the fold I summarize each of their arguments and make some overall observations.

Lauren Cadwallader

From the researcher's perspective, Dr. Cadwallader uses the example of the Virtual Fly Brain, a domain-specific repository for the connections of neurons in Drosophila brains. It was established by UK researchers 8 years ago and is now used by about 10 labs in the UK and about 200 worldwide. It was awarded a 3-year Research Council grant, which was not renewed. The Wellcome Trust awarded a further 3 year grant, ending this month. As of June:
it is uncertain whether or not they will fund it in the future. ... On the one hand funders like the Wellcome Trust, Research Councils UK and National Institutes of Health (NIH) are encouraging researchers to use domain specific repositories for data sharing. Yet on the other, they are acknowledging that the current approaches for these resources are not necessarily sustainable.
Clearly, this is a global resource not a UK one, but there is no global institution funding research in Drosophila brains. There is a free rider problem; each individual national or charitable funder depends on the resource but would rather not pay for it, and there is no penalty for avoiding paying until it is too late and the resource has gone.

David Carr

From the perspective of the Open Research team at the Wellcome Trust Carr notes that:
Rather than ask for a data management plan, applicants are now asked to provide an outputs management plan setting out how they will maximise the value of their research outputs more broadly.

Wellcome commits to meet the costs of these plans as an integral part of the grant, and provides guidance on the costs that funding applicants should consider. We recognise, however, that many research outputs will continue to have value long after the funding period comes to an end. We must accept that preserving and making these outputs available into the future carries an ongoing cost.
Wellcome has been addressing these on-going costs by providing:
significant grant funding to repositories, databases and other community resources. As of July 2016, Wellcome had active grants totalling £80 million to support major data resources. We have also invested many millions more in major cohort and longitudinal studies, such as UK Biobank and ALSPAC. We provide such support through our Biomedical Resource and Technology Development scheme, and have provided additional major awards over the years to support key resources, such as PDB-Europe, Ensembl and the Open Microscopy Environment.
However, these are still grants with end-dates such as faced the Virtual Fly Brain:
While our funding for these resources is not open-ended and subject to review, we have been conscious for some time that the reliance of key community resources on grant funding (typically of three to five years’ duration) can create significant challenges, hindering their ability to plan for the long-term and retain staff.
Clearly funders have difficulty committing funds for the long term. And if their short-term funding is successful, they are faced with a "too big to fail" problem. The repository says "pay up now or the entire field of research gets it". Not where a funder wants to end up. Nor is the necessary brinkmanship conducive to "their ability to plan for the long-term and retain staff".

An international workshop of data resources and major funders in the life sciences:
resulted in a call for action (reported in Nature) to coordinate efforts to ensure long-term sustainability of key resources, whilst supporting resources in providing access at no charge to users.  The group proposed an international mechanism to prioritise core data resources of global importance, building on the work undertaken by ELIXIR to define criteria for such resources.  It was proposed national funders could potentially then contribute a set proportion of their overall funding (with initial proposals suggesting around 1.5 to 2 per cent) to support these core data resources.
A voluntary "tax" of this kind may be the least bad approach to funding global resources.

Dave Gerrard

From the perspective of a Technical Specialist Fellow from the Polonsky-Foundation-funded Digital Preservation at Oxford and Cambridge project, Gerrard argues that there are two different audiences for open resources. I agree with him about the impracticality of the OAIS concept of Designated Community:
The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.

Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.
Gerrard argues that the two audiences need:
two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, ... Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publications
Gerrard is correct to point out that:
a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally and deposit them at the same time. The library special collection doesn’t need to wait for the professor to die to get their hands on the context of her work. Or indeed, wait for her to become a professor.
This works in an ideal world because:
A further outcome of being able to donate digitally is that scientists become more responsible for managing their personal digital materials well, so that it’s easier to donate them as they go along.
But in the real world this effort to "keep their ongoing work neat and tidy" is frequently viewed as a distraction from the urgent task of publishing not perishing. The researcher bears the cost of depositing her materials, the benefits accrue to other researchers in the future. Not a powerful motivation.

Gerrard argues that his model clarifies the funding issues:
Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it.
university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaply
This sounds plausible but both halves ignore problems:
  • The value of the resource will outlast many grants, where the funders are constrained to award short-term grants. A voluntary "tax" on these grants would diversify the repository's income, but voluntary "taxes" are subject to the free-rider problem. To assure staff recruiting and minimize churn, the repository needs reserves, so the tax needs to exceed the running cost, reinforcing the free-rider's incentives.
  • These open research repositories are a global resource. Once the "tipping point" happens, which of the many university or institutional repositories gets to bear the cost of ingesting and preserving the global resource? All the others get to free-ride. Or does Gerrard envisage disaggregating the domain repository so that each researcher's contributions end up in their institution's repository? If so, how are contributions handled from (a) collaborations between labs, and (b) a researcher's career that spans multiple institutions? Or does he envisage the researcher depositing everything into both the domain and the institutional repository? The researcher's motivation is to deposit into the domain repository. The additional work to deposit into the institutional repository is just make-work to benefit the institution, to which these days most researchers have little loyalty. The whole value of domain repositories is the way they aggregate the outputs of all researchers in a field. Isn't it important to preserve that value for the long term?

1 comment:

David. said...

65 out of the 100 most cited papers are paywalled write Josh Nicholson and Alberto Pepe:

"There are 1,088,7791,088,7791,088,779​ citations of the Open Access articles, so, if they cost the same on average as the Paywalled articles and were paid for individually, they would cost a total of: $35,199,108.44 – that’s 14 Bugatti Veyrons, or enough to buy everyone in New York City a Starbucks Tall coffee and chocolate chip cookie. In comparison, the total amount for the paywalled articles, assuming everyone bought the paywalled articles individually, is $54,722,252.80​."

Tip of the hat to Cory Doctorow.