Monday, April 29, 2013

Talk on LOCKSS Metadata Extraction at IIPC 2013

I gave a brief introduction to the way the LOCKSS daemon extracts metadata from the content it collects at the 2013 IIPC General Assembly. Below the fold is an edited text with links to the sources.



As I mentioned in an earlier talk at this meeting, the original design of the way the LOCKSS system disseminated its content (PDF) was totally transparent to the end user. This turned out not to be a good idea for a number of reasons, the chief being that it was very difficult to get people to pay for a system that didn't deliver any visible value. So we had to augment the original design, which acted as a Web proxy, with a Wayback Machine-like ability to serve the content at a URI that pointed to the LOCKSS box rather than to the original publisher.

Now the problem was that no-one knew the content was in the LOCKSS box. So we had to work with the libraries' normal ways of finding e-journal articles and e-books (PDF), which were DOIs and OpenURL resolvers.  This sounds easy, but it isn't.  LOCKSS boxes crawl the e-journal and e-book web sites, so they end up with a pile of URIs. This process is controlled by a parameterized, site-specific plugin encapsulating everything the daemon needs to know to crawl it. We use plugin parameters to organize the pile of URIs into what we call AUs, Archival Units, which typically represent a volume of a journal. But the DOIs and OpenURLs need to resolve to articles, not to the volume or the issue.

So we have to re-construct the articles from the pile of pieces, and then we have to find the DOI and the bibliographic metadata OpenURL resolvers need in the re-constructed articles. The plugin encapsulates the knowledge needed to do both:
  • It implements an ArticleIterator that knows how to find various parts of each article in a AU.
  • It implements a MetadataExtractor that, give the appropriate parts of an article, how to find various kinds of metadata in it.
I prototyped these mechanisms I think about 4 years ago and haven't worked on the code since, so I apologize if this rather hastily-constructed talk is inaccurate.

With that warning, these are the generic classes involved in article iteration:
src/org/lockss/plugin/ArticleIteratorFactory.java
src/org/lockss/plugin/SubTreeArticleIteratorBuilder.java
src/org/lockss/plugin/SubTreeArticleIterator.java
src/org/lockss/plugin/wrapper/ArticleIteratorFactoryWrapper.java
Pretty simple, although there are currently 56 site-specific ArticleIterator implementations of varying complexity in our collection of plugins.

Here's the interface to the ArticleIteratorFactory. The generic SubTreeArticleIterator walks a sub-tree of the AU's file name space looking for the appropriate component of an article, for example the top-level page representing the article abstract.

Here are the generic classes involved in metadata extraction. Less simple, and there are currently 62 site-specific MetadataExtractor implementations of varying complexity in our collection of plugins:
src/org/lockss/extractor/SimpleFileMetadataExtractor.java
src/org/lockss/extractor/SaxMetadataExtractor.java
src/org/lockss/extractor/BaseArticleMetadataExtractor.java
src/org/lockss/extractor/RisMetadataExtractor.java
src/org/lockss/extractor/FileMetadataExtractorFactory.java
src/org/lockss/extractor/MetadataField.java
src/org/lockss/extractor/ArticleMetadataExtractor.java
src/org/lockss/extractor/ArticleMetadataExtractorFactory.java
src/org/lockss/extractor/SingleArticleMetadataExtractor.java
src/org/lockss/extractor/FileMetadataExtractor.java
src/org/lockss/extractor/SimpleXmlMetadataExtractor.java
src/org/lockss/extractor/SimpleHtmlMetaTagMetadataExtractorFactory.java
src/org/lockss/extractor/MetadataTarget.java
src/org/lockss/extractor/XmlDomMetadataExtractor.java
src/org/lockss/extractor/ArticleMetadata.java
src/org/lockss/extractor/SimpleHtmlMetaTagMetadataExtractor.java
src/org/lockss/extractor/MetadataException.java
There are many different metadata fields we can extract. The canonical representation of each is defined by an instance of MetadataField. Here's the definition of DOI, complete with its validator:
  /*
   * The canonical representation of a DOI has key "dc.identifier" and starts
   * with doi:
   */
  public static final String PROTOCOL_DOI = "doi:";
  public static final String KEY_DOI = "doi";
  public static final MetadataField FIELD_DOI = new MetadataField(KEY_DOI,
      Cardinality.Single) {
    public String validate(ArticleMetadata am, String val)
        throws MetadataException.ValidationException {
      // normalize away leading "doi:" before checking validity
      String doi = StringUtils.removeStartIgnoreCase(val, PROTOCOL_DOI);
      if (!MetadataUtil.isDoi(doi)) {
        throw new MetadataException.ValidationException("Illegal DOI: " + val);
      }
      return doi;
    }
  };

Here is the interface to the ArticleMetadataExtractor. Given a MetadataTarget such as DOI and an article made up of a collection of pieces the extract method uses an emitter functor to emit ArticleMetadata objects describing the metadata it found as key-value pairs:
/** Content parser that extracts metadata from CachedUrl objects */
public interface ArticleMetadataExtractor {
  /**
   * Emit zero or more ArticleMetadata containing metadata extracted from
   * files comprising article (feature)
   * @param target the purpose for which metadata is being extracted
   * @param af describes the files making up the article
   * @param emitter
   */
  public void extract(MetadataTarget target,
      ArticleFiles af,
      Emitter emitter)
      throws IOException, PluginException;
  /** Functor to emit ArticleMetadata object(s) created by extractor */
  public interface Emitter {
    public void emitMetadata(ArticleFiles af, ArticleMetadata metadata);
  }
}


Unfortunately, publishers use the keys, the metadata field names, inconsistently. So we have to "cook" the raw metadata extracted from the articles into a canonical form. Again, the plugin supplies the site-specific knowledge needed to do this, in the form of a map saying, in effect, if this publisher says the metadata field is called "foo" what they really meant was "bar". We store both the raw and the cooked form of the metadata, validating the values for each key as we cook it using the validator in the MetadataField object that defines the canonical, cooked key.

This is the method that does the cooking. It iterates through the set of raw keys and, for each it finds, stores the validated value into a set of cooked keys:
  /**
   * Copies values from the raw metadata map to the cooked map according to the
   * supplied map. Any MetadataExceptions thrown while storing into the cooked
   * map are returned in a List.
   * @param rawToCooked
   *          maps raw key ­> cooked MatadataField.
   */ 
  public List cook(MultiMap rawToCooked) {
    List errors = new ArrayList();
    for (Map.Entry ent :
      (Collection>>)
        (rawToCooked.entrySet())) {
      String rawKey = (String) ent.getKey();
      Collection fields = (Collection) ent.getValue();
      for (MetadataField field : fields) {
        cookField(rawKey, field, errors);
      }
    }
    return errors;
  }


Here's a very simple recipe for cooking metadata, the one for Emerald Press journals. It says things like "Emerald Press journals have the DOI in a metadata entry called citation_doi":
  public static class EmeraldHtmlMetadataExtractor
    implements FileMetadataExtractor {
    // Map Emerald's Google Scholar HTML meta tag names for journals to cooked
    private static MultiMap journalTagMap = new MultiValueMap();
    static {
      journalTagMap.put("citation_doi", MetadataField.FIELD_DOI);
      ...
    }

In conclusion, we see that the LOCKSS daemon can automatically extract article-level bibliographic metadata from crawled web-sites that is good enough to support DOI and OpenURL resolution. Only a few sites require even infrequent hand clean-up, typically small publishers. Doing this involves a lot of knowledge about the publishing platforms the various publishers use, including how they structure articles, where in that structure they put the bibliographic metadata, and how to canonicalize it. This is all encoded in the publisher plugin.  The current collection of plugins represents a lot of work; it includes 317 classes implemented in Java, 238 implemented in XML, and 694 individual unit tests.

This is an area in which Google has exercised considerable influence, in that the variability among publishers in how they represent metadata has decreased significantly since I wrote the prototype.

1 comment:

David. said...

Tim Zaitsev, Minh Pham, Mike Vrooman and Morgan Brown, all students of Prof. Ed Katz at Carnegie-Mellon's Silicon Valley campus, in this video present their project aimed at extracting semantics from, and reasoning about, content stored in LOCKSS boxes. They use Mongo DB and Apache Jena.