Comments on DSHR's Blog: You get what you get and you don't get upset

Kalev Leetaru continues his examination of the Int...

2016-01-19T17:27:04.464-08:00

Kalev Leetaru continues his examination of the Internet Archive in The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web and figures out:

- That the Wayback Machine provides access to only a part of the Internet Archive's holdings.

- That these holdings are the result of running a wide variety of evolving collection policies in parallel, some driven by the Archive, some by external donors, some by ArchiveIt partners, some in reaction to world events.

- And that the result of the partial view and the range of policies is that the view presented by the Wayback Machine is much messier than scholars would like.

Welcome to the real world, in which the Archive has to sell services and accept donations to fund its activities. It would have been wonderful if, at the birth of the Web, some far-sighted billionaire had provided an archive with an endowment sufficient to ensure that every version of every Web page was preserved for posterity. Or even to run a single consistent collection policy.

Kalev Leetaru continues to want a less messy world...

2015-11-25T10:32:11.551-08:00

Kalev Leetaru continues to want a less messy world, more to his and other scholars' linking. But, as before, he doesn't suggest any way the inadequate budgets of web archives could be increased to fund the work he wants done, or that this would be the best use of incremental funding. If he could find incremental funding, it might have better results from his point of view if it were spent educating researchers about how to work within the limits of the available data.

It urns out that Andy Jackson's summary is wor...

2015-11-20T09:21:27.278-08:00

It urns out that Andy Jackson's summary is worth a great deal. Go read it.

For what it's worth, I tried to summarise our ...

2015-11-20T07:18:56.644-08:00

For what it's worth, I tried to summarise our position here: http://britishlibrary.typepad.co.uk/webarchive/2015/11/the-provenance-of-web-archives.html

Dear David, Many thanks for your long blog post. ...

2015-11-19T11:21:32.221-08:00

Dear David,

Many thanks for your long blog post. Since you include my email text in your comment, I thought that I’d better comment it.

Apparently you read my comment as if I would advocate for “the utopia of clean, consistently collected Web archives complete with detailed metadata describing collection policies is never going to happen”.

This is not the case. Let’s take the words of your sentence one by one:

‘clean’: researchers who have used web archives very well know that web archives are messy, for a number of reasons; but that they are messy or ‘unclean’ does not mean that it’s impossible to get information out of them of relevance for researchers — what we would like to know is just a little bit more about HOW they are messy.

‘consistently collected’: again, many researchers are well aware that web archives are not consistently collected (I have written a number of texts about this).

‘complete with detailed metadata’: well, some web archives actually have metadata (e.g. the Australian Pandora), but web archives based on bulk archiving does not; and, again, researchers know that when you scale up metadata is not an option. But there are other ways of providing relevant documentation.

‘describing collection policies’: some — in fact most — web archives have written collection policies (or strategies, as they are also called), but the IA does not, at least not to my knowledge. Thus, on a collection level documentation exist in many web archives, but in many cases it is not made available for researchers

And then the three reasons:
1) The internet is a best-efforts network. Nothing new here. No matter how well the collection strategies are described one never knows what gets into the web archive — so many things can go wrong (I have described all this in a book from 2005, and in publications before that).

2) The Web is a moving target. It sure is, and again this is not new, also not to researchers (old publications of mine also about this). The web is evolving and the web archiving technology is lagging behind, yes, but this does not exonerate web archives from the task of documenting all this. For a scholar it can be very valuable to know that in the period between x and y it was impossible to archive phenomenon z. If web archives know that in 2000 it was not possible to archive flash, for instance, they should tell researchers.

3) Metadata costs money. Yes it does, and the IA is doing an outstanding job in collecting the web. But not all web archives have the same business model, some are national web archives with national funding, and one of the advantages of this business model is that it’s not impossible to get funding for improving documentation. And initiatives are going on trying to do this. As time goes by, the money not spent now on making documentation available will increase the problem, and, ultimately, diminish the usefulness of the web archive.

Most scholars using web archives know that they get what they can get and, in fact, they don't get upset. We’re quite happy with most web archives, for the simple reason that it’s the only ones we have. But if I researchers could find the same collection in a web archive with no documentation, and in a well-documented web archive — guess what they would choose?

We don’t ask for an utopia and your three reasons do not exclude that web archives should strive to provide as much documentation about their collections as possible. As mentioned in my comment, the information exists, ranging from crawllogs, seedlists, etc. to crawler settings, curator information, and collecting policies and strategies. The task is to make this information accessible to researchers so that it is useful. And just a bit is better than nothing.

We do not ask for clean and consistent collections. What we ask for is information about why and how web archive collections are not clean and consistent. And why do we need this? We need this to make as informed choices as possible, so that we can substantiate the studies we make.

Niels

Dear David, Many thanks for your long blog post. ...

2015-11-19T11:21:07.173-08:00

Words of wisdom from someone who truly understands...

2015-11-19T10:01:13.053-08:00

Words of wisdom from someone who truly understands and accepts how messy the web is and how costly it is to try to organise it.