Wednesday, February 14, 2018

Preserving Government Information

Spotlight on Digital Government Information Preservation: Examining the Context, Outcomes, Limitations, and Successes of the DataRefuge Movement  by and examines the issues around preserving access to government information through the lens of the DataRefuge movement. Below the fold, some commentary.

What the authors write is generally accurate as regards the process of ingesting government information, especially via the Data Rescue process. But they pretty much ignore the issue of preserving the ingested content, and I have problems with their sketchy treatment of dissemination.

I've written many times about the importance of basing preservation on a realistic threat model. The motivation for the Data Refuge movement, and its predecessor in Canada, was the threat that a government would use its power to destroy information in its custody. Clearly, Data Refuge's ingest goal of getting a copy into non-governmental hands is an essential first step. But it does not address the threats to preservation that are unique to government information, namely their legal powers over these non-governmental hands. The history of US information on paper in the Federal Depository Library Program (FDLP) shows that the federal government is very willing to use their powers:
The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention ... Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace.
I wrote that more than a decade ago.  The FDLP shows the importance of not just geographic diversity but also administrative diversity in the preservation of government administration. But, especially in the era of National Security Letters, it isn't enough. This is why the LOCKSS Program's effort with Canadian librarians to rescue government information from the depredations of the Harper administration implemented jurisdictional diversity by ensuring that some nodes of the LOCKSS network preserving the at-risk information were outside the Harper administration's jurisdiction.

The authors understand that funding for efforts to preserve government information is massively inadequate even when it exists:
There is a need for more institutional support through organized, well-funded programs and tasking the GPO with perpetual archiving and access to all public government data and websites. With this in mind, there is also a need for advocating for adequate funding for GPO to do this work.
But they don't seem to realize that the whole point of the efforts they describe was that government, and thus the GPO, is not trusted to "do this work"!

They understand that the lack of funding for current ingest efforts means that:
Furthermore, only a fraction of government data was harvested. EDGI reported 63,076 pages were seeded during DataRescue events to the Internet Archive using their Chrome extension and workflow, with 21,798 of these pages containing datasets or other dynamic content. While this is positive at a surface level, over 194 million unique URLs were captured for the EOT 2012 through human-contributed URLs and web crawlers that automatically snapshot the pages. It would be nearly impossible for humans to go through every agency webpage looking for dynamic content or datasets that need to be specially harvested for preservation.
But, like Kalev Leetaru, they respond to the inadequacy of ingest by advocating massively increasing the cost of ingest by requiring much better metadata and documentation:
The most glaring downside of the DataRefuge initiative and DataRescue events was the questionability of the accomplishment of long term preservation of government data. The main goal of DataRescue was to save government data for the future if it ever disappears. However, viewing the datasets indexed and archived in the DataRefuge repository through a lens of data curation for reuse and long term usability finds the metadata and documentation generally lacking for preservation purposes.
It isn't like the original content on agency websites had high-quality metadata and documentation making it easy to find and re-use. In the rare case where it did, the Web crawls probably got it. These incessant demands for expensive metadata are making the perfect be the enemy of the good, and are extraordinarily unhelpful. Increasing ingest cost will mean less content is ingested, which might be a good thing as there would then be less content for the almost completely lacking funds for preservation to cover.

1 comment:

David. said...

"All of Congress’ research would be made available to the public for free under the government spending bill released Wednesday night, which would be a victory for transparency advocates and a boon to members of the public interested in governance.

The fiscal 2018 omnibus spending bill includes a provision that would require Congressional Research Service reports be made available to the public, through a website set up by the the Librarian of Congress." reports Joseph Lawler at the Washington Examiner.