Tuesday, October 22, 2019

MementoMap

I've been writing about how important Memento is for Web archiving, and how its success depends upon the effectiveness of Memento Aggregators since at least 2011:
In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators
Memento Aggregators turned out to be both useful, and a hard engineering problem. Below the fold, a discussion of MementoMap Framework for Flexible and Adaptive Web Archive Profiling by Sawood Alam et al from Old Dominion University and Arquivo.pt, which both reviews the history of finding out how hard it is, and reports on fairly encouraging progress in attacking it.

In 2013's Re-thinking Memento Aggregation I discussed a number of problems Memento faced, including the impracticality of maintaining the necessary indexes on a per-URI basis:
The two alternatives for Aggregators in the current model are either to poll each archive on each request, which won't scale well, or to search through metadata about every preserved version of every URI in every archive. For each version of each URI an Aggregator needs to store a DateTime (say 8 bytes) and a URI (say on average 77 bytes). For the Internet Archive's quarter-trillion URI instances alone this would be nearly 25TB of data. Clearly, building, operating and updating an Aggregator is a major undertaking.
The problem is how to route queries to archives likely to have a responsive URI. It is a hard problem, but Alam et al show that significant progress has now been made. Their work used log data from MemGator, ODU's Memento aggregator, which shows the need for efficient routing:
[MemGator] has served about 5.2M requests so far. These lookups were broadcasted to 14 different upstream archives for a total of 61.8M requests. Only 5.44% of these requests had a hit, while the remaining 93.56% were either a miss or an error
Any system in which almost 94% of the effort is wasted has a problem.The IIPC supported a project to address this issue. Sanderson et al investigated using complete indexes of the URIs preserved in a set of Web archives, which:
yielded no false positives or false negatives (i.e., 100% Accuracy) while the CDX files were fresh, but they would go stale very quickly. It is a resource and time intensive task to generate such profiles and some archives may be unwilling or unable to provide their CDX files. Such profiles are so large in size (typically, a few billion URI-R keys) that they require special infrastructure to support fast lookup. Acquiring fresh CDX files from various archives and updating these profiles regularly is not easy.
Bornand et al took a usage-based approach, monitoring what users requested instead of what archives contained, but:
While usage-based profiling can be useful for Memento lookup routing, it may not give the real picture of archives’ holdings, producing both false negatives and false positives.
For example:
We found that traffic from MemGator requested less than 0.003% of the archived resources in Arquivo.pt. There is a need for content- based archive profiling which can express what is present in archives, irrespective of whether or not it is being looked for.
Clearly, the indexes need to be compressed by a large factor while retaining low miss rates. The IIPC project investigated various ways to summarize them, including using text search where available:
In our experiments, we correctly identified about 78% of the URIs that were or were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and identified 94% URIs with less than 10% relative cost without any false negatives. Based on the archive profiling framework we established, we further investigated the possibility of content-based profiling by issuing fulltext search queries (when available) and observing returned results if access to the CDX data is not possible. We were able to make routing decisions of 80% of the requests correctly while maintaining about 90% Recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile.
In 2016 Sawood Alam reported on progress with Memento: Help Us Route URI Lookups to the Right Archives.  Now, Alam et al report on MementoMap, which:
is a continuation of this effort to make it more flexible and portable by eliminating the need for rigid profiling policies we defined earlier (which are still good for baseline evaluation purposes) and replacing them with an adaptive approach in which the level of detail is dynamically controlled with a number of parameters. 
The approach consists of passing:
a detailed MementoMap through a compaction procedure which yields a summarized output that contains fewer lookup keys by rolling sub-trees with many children nodes up and replacing them with corresponding wildcard keys.
The results look encouraging:
We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
One of the most interesting details in the paper is their Figure 4, a matrix of the relationship between the number of Mementos of a URI in the Arquivo.pt archive, and the number of requests for that URI in the MemGator logs. The authors discuss it thus (my bold):
Large numbers in the “Zero” column show there are a lot of mementos that are never requested from MemGator. Similarly, the “Zero” row shows there are a lot of requests that have zero mementos in Arquivo.pt. Another way to look at it is that a content-based archive profile will not know about the “Zero” row and a usage-based profile will miss out the content in the “Zero” column. Active archives may want to profile their access logs periodically to identify potential seed URIs of frequently requested missing resources that are within the scope of the archive. Ideally, we would like more activity along the diagonal that passes from the (Zero, Zero) corner, except the corner itself, which suggests there are un-determined number of URI-Rs that were never archived or accessed.
MemGator has apparently served around 5.2M requests in about 3 years, an average of about 200 requests an hour. For comparison, the Internet Archive serves an average of around 167K unique IPs per hour, many of which make multiple requests. This isn't to disparage MemGator, but simply to note that it isn't a very well-known site, and thus it is possible that its spectrum of requests is not representative of the (presumably much larger) spectrum of requests directly to Arquivo.pt. A matrix similar to Figure 4 based on the Arquivo.pt logs might look different.

In a world with Aggregators, archives see two different streams of requests, one directly from users, the other from Aggregators:
  • The direct stream presumably represents the interests of their local community. So, for example, Arquivo.pt's direct stream would be biased toward Portuguese interests.
  • The stream from Aggregators represents the interests of the global community of archive users. It might be expected to resemble the stream seen by the Internet Archive. But, importantly, it is subject to an amplification factor depending on the Aggregator's routing technology. If it had complete knowledge, one request to the Aggregator would result in requests to each of the archives holding Mementos of that URI. If it used no knowledge, one request to the Aggregator would result in one request to each participating archive.
Thus, while profiling their logs, archives need to apply a suitable deflation factor to requests coming from Aggregators. An interesting research topic would be to compare the spectrum of requests for different archives with the spectrum of requests to Aggregators.

There is a lot more fascinating detail in the paper, which is well worth reading.

No comments: