Monday, December 6, 2010

Machine-Readable Licenses vs. Machine-Readable Rights?

In the article Archiving Supplemental Materials (PDF) that Vicky Reich and I published recently in Information Standards Quarterly (a download is here), we point out that intellectual property considerations are a major barrier to preserving these increasingly common adjuncts to scholarly articles:
  • Some of them are data. Some data is just facts, so is not copyright. In some jurisdictions, collections of facts are copyright. In Europe, databases are covered by database right, which is different from copyright.
  • The copyright releases signed by authors differ, and the extent to which they cover supplemental materials may not be clear
Groups such as Science Commons (a Creative Commons project) and the Open Data Commons are working to create suitable analogs of the set of simple, widely accepted licenses that Creative Commons has created for copyright material.

For material that is subject to copyright, we strongly encourage use of Creative Commons licenses. They permit all activities required for preservation without consultation with the publisher. The legal risks of interpreting other license terms as permitting these activities without explicit permission are considerable, so even if the material was released under some other license terms we would generally prefer not to depend on them but seek explicit permission from the publisher instead. Obtaining explicit permission from the publisher is time-consuming and expensive. So is having a lawyer analyze the terms of a new license to determine whether it covers the required activities.

Efforts, such as those we cite in the article, are under way to develop suitable licenses for data, but they have yet to achieve even the limited penetration of Creative Commons for copyright works. Until there is a simple, clear, widely-accepted license in place difficulties will lie in the path of any broad approach to preserving supplemental materials, especially data. Creating such a license will be more a difficult task than Creative Commons faced, since it will not be able to draw on the firm legal foundation of copyright. Note that the analogs of Creative Commons licenses for software, the various Open Source licenses, are also based on copyright.

When and if suitable licenses become common, one or more machine-readable ways to identify content published under the licenses will be useful. We're agnostic as to how this is done; the details will have little effect on the archiving process once we have implemented a parser for the machine-readable rights expressions that we encounter. We have already done this using the various versions of the Creative Commons license for the Global LOCKSS Network.

The idea of a general "rights language" that would express the terms of a wide variety of licenses in machine-readable form is popular. But it is not a panacea. If there were a wide variety of license terms, even if they were encoded in machine-readable form, we would be reluctant to depend on them. There are few enough Creative Commons licenses and they are simple enough that they can be reviewed and approved by human lawyers. It would be too risky to depend on software interpreting the terms of licenses that had never had this review. So, a small set of simple clear licenses is essential for preservation. Encoding these licenses in machine-readable form is a good thing. That is what the Creative Commons license in machine-readable form does; it does not express the specific rights but simply points to the text of the license in question.

Encoding the specific terms of a wide variety of complex licenses in a rights language is much less useful. The software that interprets these encodings will not end up in court, nor will the encodings. The archives that use the software will end up in court facing the text of the license in a human language.


euanc said...

Hi David,

A language that was both human and machine readable would be great for other areas that just licenses. This presentation at Metadata Australia 2010 talks about translating legislation into formal logic so legal decisions can be automated:

Interested in your thoughts.

Euan Cochrane

David. said...

In an ideal world laws would be written using the kinds of formalism you point to. In the real world, they aren't. There are many reasons for this, but one major one is that ambiguity and imprecision in the laws is essential to the business model of the legal profession. And since the laws are written by lawyers, the prospect that they will be able to be interpreted by software rather than lawyers is a far distant one.

The point I am making is that preserving digital content is an activity that is against the law except if certain conditions apply (such as the safe harbor defense, specific license such as CC, or specific permission from the publisher). The penalties include imprisonment. Thus one needs to be very sure of the legal ground one is on before undertaking preservation.

Delegating this decision to software requires a level of trust in the software that few, if any, responsible archives will have for many years to come. The others will consult human lawyers, who will read the law as it is written, in human language.