Sunday, April 5, 2015

The Mystery of the Missing Dataset

I was interviewed for an upcoming news article in Nature about the problem of link rot in scientific publications, based on the recent Klein et al paper in PLoS One. The paper is full of great statistical data but, as would be expected in a scientific paper, lacks the personal stories that would improve a news article.

I mentioned the interview over dinner with my step-daughter, who was featured in the very first post to this blog when she was a grad student. She immediately said that her current work is hamstrung by precisely the kind of link rot Klein et al investigated. She is frustrated because the dataset from a widely cited paper has vanished from the Web. Below the fold, a working post that I will update as the search for this dataset continues.


My step-daughter works on sustainability and life-cycle analysis. Here is her account of the background to her search:
The data was originally recommended to me by one of our scientific advisors at [a previous company] for use in the software we were developing and for our use in our consulting work. On their recommendation I googled "impact2002+" and found my way to the download page. I originally downloaded it in summer 2011.

It is a model for characterizing environmental flows into impacts. This is incredibly useful when looking at hundreds of pollutants and resource uses across a supply chain to understand how they roll-up into impacts to human health, ecosystem quality, and resources. For example it estimates the disability adjusted life years (impact to human life expectancy) associated with a release of various pollutants to air/land/soil. Another example is the estimate of the ecosystem quality loss (biodiversity loss) associated with various chemical emissions. Another example is the estimate of the future energy required to extract an incremental amount of additional minerals or energy resources (e.g. coal).

I looked for it again in summer 2014 when I noticed it was gone. I always assumed that by just searching "Impact2002+" I'd be able to find the data again - how wrong I was!

I reached out to the webmaster listed on the University of Michigan site and actually got a response but after a couple emails requesting the data with no luck I stopped pursuing that path. I ended up purchasing a dataset that has some of the Impact2002+ data embedded in it but there are still some pieces of my analysis that are limited by not having the original dataset.
Here is where the search starts. In 2003, Olivier Jolliet et al published IMPACT 2002+: A new life cycle impact assessment methodology:
The new IMPACT 2002+ life cycle impact assessment methodology proposes a feasible implementation of a combined midpoint/damage approach, linking all types of life cycle inventory results (elementary flows and other interventions) via 14 midpoint categories to four damage categories. ... The IMPACT 2002+ method presently provides characterization factors for almost 1500 different LCI-results, which can be downloaded at http://www.epfl.ch/impact
In its field, this is an extremely important paper. Google Scholar finds 810 citations to it. Unfortunately, this isn't a paper for which Springer provides article-level metrics. The International Journal of Life Cycle Assessment, in which the paper was published, is ranked 8th in the Sustainable Development field by Google's Scholar Metrics. Its h5-median index is 54, so a paper with 810 citations is vastly more cited than the papers it typically publishes.

The authors very creditably provided their data, the 1500 characterization factors, for download from the specified URL. That link, http://www.epfl.ch/impact, now redirects to http://www.riskscience.umich.edu/jolliet/downloads.htm, which returns a 404 Not Found error, so it has unambiguously rotted. The Wayback Machine does not have that page, although it has over 1000 URLs from http://www.riskscience.umich.edu/, nor does the Memento Time Travel service. So not merely has the link rotted, but there don't appear to be any archived versions of the data supporting the paper.

The bookmark my step-daughter had for the dataset was http://www.earthshift.com/software/simapro/impact2002, which links to  http://www.epfl.ch/impact, which redirects to the broken http://www.riskscience.umich.edu/jolliet/downloads.htm.

The Wayback Machine has 11 captures of  http://www.epfl.ch/impact between February 11, 2002 and July 7, 2014. The most recent is actually a capture of the page it redirected to at the Michigan's School of Public Health, which now returns 404. That page said:
In order to access the IMPACT 2002+ model we ask that you provide us with your name, affiliation and email address at the bottom of this page. You do not have to be affiliated with the Center for Risk Science and Commnication or the University of Michigan to access the IMPACT 2002 model. Your information will only be used to notify you of any updates concerning the model. Your data will be kept strictly confidential.
This is the explanation for the lack of any archived versions of the dataset. Web crawlers, such as the Internet Archive's Heritrix, are unable to fill out Web forms without site-specific knowledge, which in this case was obviously not available.

Similarly, in 2005 the Internet Archive captured pages from the EPFL site before the move to Michigan. They included this page describing the IMPACT2002+ method, which used a form to ask for:
your name, affiliation and your email-address, which will will enable us to keep you informed about important updates from time to time. None of your data will be transmitted to anyone else. Then you can download the following files concerning the IMPACT 2002+ method ... Your data are not used to control or restrict the download, but will help us to keep you informed about updates concerning the IMPACT 2002+ methodology.
Again, archiving of the freely download-able data was prevented.

One obvious lesson from this is that authors should be strongly discouraged from forcing researchers to supply information, such as names and e-mail addresses, before they can download data that has been made freely available, because the result is likely to be, as in this case, that with the ravages of time the data will become totally unavailable. It seems likely that this dataset became unavailable as a side-effect of the Risk Science Center migrating to its own website rather than being a part of the School of Public Health's website.

Another lesson is the completely inadequate state of Institutional Repositories. The University of Michigan's IR, Deep Blue, contains only 6 of the 76 "Selected Publications" from Olivier Jolliet's Michigan home page, but it has PDFs for their full text. Infoscience, the EPFL IR lists 58 publications with Olivier Jolliet as an author, including the paper in question, but for that it says:
There is no available fulltext. Please contact the lab or the authors.
and:
The IMPACT 2002+ method presently provides characterization factors for almost 1500 different LCI-results, which can be downloaded at http://www.epfl.ch/impact
which is no longer the case. Note that ResearchGate claims to know about 177 publications from OlivierJolliet.

1 comment:

Michael L. Nelson said...

From the 2005 page, I see a couple of direct links to zip files and don't see evidence that you had to fill in a form. For example:

http://web.archive.org/web/20040115145234/http://gecos.epfl.ch/lcsystems/Fichiers_communs/impact2002/Impact2002+_%28Version1.0%29_AppendixCF_1d.zip

and

http://web.archive.org/web/20040115145234/http://gecos.epfl.ch/lcsystems/Fichiers_communs/impact2002/IMPACT2002-EuropeSingleZone-public1.2.zip

The problem here is that 10+ years ago archives did not "prefer" to crawl .zip and other binary files, so they did not get archived. When (?) that changed, the zip files were hidden behind a form (which, as you point out, is a terrible idea). Our collective archiving rate is much worse if you start including these kind of leaf nodes in graph.

I'm also guessing the original data set (+sw, etc.) suffered from it's own success. impactmodeling.org appears to be the "free" version, but that site looks to be abandoned. quantis-intl.com seems to be the commercial version.