DSHR's Blog: 2019

Tuesday, December 31, 2019

Web Packaging for Web Archiving

Supporting Web Archiving via Web Packaging by Sawood Alam, Michele C Weigle, Michael L Nelson, Martin Klein, and Herbert Van de Sompel is their position paper for the Internet Architecture Board's ESCAPE workshop (Exploring Synergy between Content Aggregation and the Publisher Ecosystem). It describes the considerable potential importance of Web Packaging, the topic of the workshop, for Web archiving, but also the problems it poses because, like the Web before Memento, it ignores the time dimension.

Source: Frederic Filloux

Despite living in the heart of Silicon Valley, our home Internet connection is 3M/1Mbit DSL from Sonic; we love our ISP and I refuse to do business with AT&T or Comcast. As you can imagine, the speed with which Web pages load has been a topic of particular interest for this blog, for example here and here. (which starts from a laugh-out-loud, must-read post from Maciej Cegłowski). Then, three years ago, Frederic Filloux's Bloated HTML, the best and the worse triggered my rant Fighting the Web Flab:

Filloux continues:

In due fairness, this cataract of code loads very fast on a normal connection.
His "normal" connection must be much faster than my home's 3Mbit/s DSL. But then the hope kicks in:

The Guardian technical team was also the first one to devise a solid implementation of Google's new Accelerated Mobile Page (AMP) format. In doing so, it eliminated more than 80% of the original code, making it blazingly fast on a mobile device.
Great, but AMP is still 20 bytes of crud for each byte of content. What's the word for 20 times faster than "blazingly"?

Web Packaging is a response to:

In recent years, a number of proprietary formats have been defined to enable aggregators of news and other articles to republish Web resources; for example, Google’s AMP, Facebook’s Instant Articles, Baidu’s MIP, and Apple’s News Format.

Below the fold I look into the history that got us to this point, and where we may be going.

Meta: Blog On Hiatus

I'm not going to be able to blog for a short while, probably a couple of weeks.

Tuesday, November 26, 2019

737 MAX: The Case Against Boeing

The title of Alec McGillis' The Case Against Boeing is misleading. Samya Stumo, one of the victims of the second 737 MAX crash was the daughter of a niece of Ralph Nader:

They were the first American family to sue Boeing, accusing the company of gross negligence and recklessness.

McGillis certainly does discuss some of the ways the culture of Douglas led to Boeing's malfeasance, including blaming the pilots:

Boeing seemed to believe that pilot error had caused the crash. In its response to an initial Indonesian government report, it highlighted the contrasting reactions of the crew on the doomed flight and the crew the day before, saying that the pilots on the second day had not followed the standard “runaway trim” procedures.

But that's not really what the article is about. Follow me below the fold as I try to tease out the real story McGillis tells, and then add more news on the topic.

Seeds Or Code?

Svalbard Summer '69

I'd like to congratulate Microsoft on a truly excellent PR stunt, drawing attention to two important topics about which I've been writing for a long time, the cultural significance of open source software, and the need for digital preservation. Ashlee Vance provides the channel to publicize the stunt in Open Source Code Will Survive the Apocalypse in an Arctic Cave. In summary, near Longyearbyen on Spitzbergen is:

the Svalbard Global Seed Vault, where seeds for a wide range of plants, including the crops most valuable to humans, are preserved in case of some famine-inducing pandemic or nuclear apocalypse.

Nearby, in a different worked-out coal mine, is the Arctic World Archive:

The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.

Github, the newly-acquired Microsoft subsidiary, will deposit there:

The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.

Follow me below the fold for an explanation of why I call this admirable effort a PR stunt, albeit a well-justified one.

Auditing The Integrity Of Multiple Replicas

The fundamental problem in the design of the LOCKSS system was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:

Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.

Our approach, published now more than 16 years ago, was to have each node in the network compare its content with that of the consensus among a randomized subset of the other nodes holding the same content. They did so using a peer-to-peer protocol using proof-of-work, in some respects one of the many precursors of Satoshi Nakamoto's Bitcoin protocol.

Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.

A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.

Academic Publishers As Parasites

This is just a quick post to draw attention to From symbiont to parasite: the evolution of for-profit science publishing by UCSF's Peter Walter and Dyche Mullins in Molecular Biology of the Cell. It is a comprehensive overview of the way the oligopoly publishers obtained and maintain their rent-extraction from the academic community:

"Scientific journals still disseminate our work, but in the Internet-connected world of the 21st century, this is no longer their critical function. Journals remain relevant almost entirely because they provide a playing field for scientific and professional competition: to claim credit for a discovery, we publish it in a peer-reviewed journal; to get a job in academia or money to run a lab, we present these published papers to universities and funding agencies. Publishing is so embedded in the practice of science that whoever controls the journals controls access to the entire profession."

My only criticisms are a lack of cynicism about the perks publishers distribute:

They pay no attention to the role of librarians, who after all actually "negotiate" with the publishers and sign the checks.
They write:

we work for them for free in producing the work, reviewing it, and serving on their editorial boards
We have spoken with someone who used to manage top journals for a major publisher. His internal margins were north of 90%, and the single biggest expense was the care and feeding of the editorial board.

And they are insufficiently skeptical of claims as to the value that journals add. See my Journals Considered Harmful from 2013.

Despite these quibbles, you should definitely go read the whole paper.

Thursday, October 31, 2019

Aviation's Groundhog Day

Searching for 40-year old lessons for Boeing in the grounding of the DC-10 by Jon Ostrower is subtitled An eerily similar crash in Chicago 40-years ago holds lessons for Boeing and the 737 Max that reverberate through history. Ostrower writes that it is:

The first in a series on the historical parallels and lessons that unite the groundings of the DC-10 and 737 Max.

I hope he's right about the series, because this first part is a must-read account of the truly disturbing parallels between the dysfunction at McDonnell-Douglas and the FAA that led to the May 25^th 1979 Chicago crash of a DC-10, and the dysfunction at Boeing (whose management is mostly the result of the merger with McDonnell-Douglas) and the FAA that led to the two 737 MAX crashes. Ostrow writes:

The grounding of the DC-10 ignited a debate over system redundancy, crew alerting, requirements for certification, and insufficient oversight and expertise of an under-resourced regulator — all familiar topics that are today at the center of the 737 Max grounding. To revisit the events of 40 years ago is to revisit a safety crisis that, swapping a few specific details, presents striking similarities four decades later, all the way down to the verbiage.

Below the fold, some commentary with links to other reporting.

Future of Open Access

The Future of OA: A large-scale analysis projecting Open Access publication and readership by Heather Piwowar, Jason Priem and Richard Orr is an important study of the availability and use of Open Access papers:

This study analyses the number of papers available as OA over time. The models includes both OA embargo data and the relative growth rates of different OA types over time, based on the OA status of 70 million journal articles published between 1950 and 2019.

The study also looks at article usage data, analyzing the proportion of views to OA articles vs views to articles which are closed access. Signal processing techniques are used to model how these viewership patterns change over time. Viewership data is based on 2.8 million uses of the Unpaywall browser extension in July 2019.

They conclude:

One interesting realization from the modeling we’ve done is that when the proportion of papers that are OA increases, or when the OA lag decreases, the total number of views increase -- the scholarly literature becomes more heavily viewed and thus more valuable to society.

Thus clearly demonstrating one part of the value that open access adds. Below the fold, some details and commentary.

MementoMap

I've been writing about how important Memento is for Web archiving, and how its success depends upon the effectiveness of Memento Aggregators since at least 2011:

In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators

Memento Aggregators turned out to be both useful, and a hard engineering problem. Below the fold, a discussion of MementoMap Framework for Flexible and Adaptive Web Archive Profiling by Sawood Alam et al from Old Dominion University and Arquivo.pt, which both reviews the history of finding out how hard it is, and reports on fairly encouraging progress in attacking it.

Be Careful What You Measure

"Be careful what you measure, because that's what you'll get" is a management platitude dating back at least to V. F. Ridgway's 1956 Dysfunctional Consequences of Performance Measurements:

Quantitative measures of performance are tools, and are undoubtedly useful. But research indicates that indiscriminate use and undue confidence and reliance in them result from insufficient knowledge of the full effects and consequences. ... It seems worth while to review the current scattered knowledge of the dysfunctional consequences resulting from the imposition of a system of performance measurements.

Back in 2013 I wrote Journals Considered Harmful, based on Deep Impact: Unintended consequences of journal rank by Björn Brembs and Marcus Munaf, which documented that the use of Impact Factor to rank journals had caused publishers to game the system, with negative impacts on the integrity of scientific research. Below the fold I look at a recent study showing similar negative impacts on research integrity.

Nanopore Technology For DNA Storage

DNA assembly for nanopore data storage readout by Randolph Lopez et al from the UW/Microsoft team continues their steady progress in developing technologies for data storage in DNA.

Below the fold, some details and a little discussion.

Real-Time Gross Settlement

Cryptocurrency advocates appear to believe that the magic of cryptography makes the value of trust zero, but they’re wrong. Follow me below the fold for an example that shows this.

The Data Isn't Yours (updated)

Most discussions of Internet privacy, for example Jaron Lanier Fixes the Internet, systematically elide the distinction between "my data" and "data about me". In doing so they systematically exaggerate the value of "my data".

The typical interaction that generates data about an Internet user involves two parties, a client and a server. Both parties know what happened (a link was clicked, a purchase was made, ...). This isn't "my data", it is data shared between the client ("me") and the server. The difference is that the server can aggregate the data from many interactions and, by doing so, create something sufficiently valuable that others will pay for it. The client ("my data") cannot.

Below the fold, an update.

Guest post: Ilya Kreymer's Client-Side Replay Technology

Ilya Kreymer gave a brief description of his recent development of client-side replay for WARC-based Web archives in this comment on my post Michael Nelson's CNI Keynote: Part 3. It uses Service Workers, which Matt Gaunt describes in Google's Web Fundamentals thus:

A service worker is a script that your browser runs in the background, separate from a web page, opening the door to features that don't need a web page or user interaction. Today, they already include features like push notifications and background sync. In the future, service workers might support other things like periodic sync or geofencing. The core feature discussed in this tutorial is the ability to intercept and handle network requests, including programmatically managing a cache of responses.

Client-side replay was clearly an important advance, so I asked him for a guest post with the details. Below the fold, here it is.

Boeing 737 MAX: Two Competing Views

Two long and very detailed articles on the background to the 737 MAX disasters present very different views. William Langewiesche's What Really Brought Down the Boeing 737 Max? is subtitled:

Malfunctions caused two deadly crashes. But an industry that puts unprepared pilots in the cockpit is just as guilty.

Maureen Tkacik's Crash Course: How Boeing's Managerial Revolution Created The 737 MAX Disaster puts the theme in the headline. Below the fold I discuss them, and relate them to my post First We Change How People Behave.

Promising New Hard Disk Technology

It has been too long, two-and-a-half years, since the last of Tom Coughlin's Storage Valley Supper Club events. But he just organized one to coincide with the Flash Memory Summit. It featured an extremely interesting talk by Karim Kaddeche, CEO of L2 Drive, a company whose technology seems likely to have a big impact on the hard disk market. Follow me below the fold for the explanation. I didn't take notes, so what follows is from memory. I apologize for any errors.

Google's Fenced Garden

In the wake of Lina Khan's masterful January 2017 Yale Law Journal article Amazon's Antitrust Paradox, both anti-trust investigations of the FAANGs and anti-trust remedies have been consuming extraordinary numbers of pixels. Although the investigations cover all the major platforms, the discussion of remedies has tended to focus on Facebook and Amazon. Below the fold, I ask whether, assuming any of the multifarious investigations lead to anything other than cost-of-doing-business fines, any of the proposed remedies would be effective against Google. I apologize for the inordinate length of this post; it seemed that the more I wrote the more there was to write.

Interesting Articles From Usenix

Unless you're a member of Usenix (why aren't you?) you'll have to wait a year to read two of three interesting preservation-related articles in the Fall 2019 issue of ;login:. Below the fold is a little taste of each of them, with links to the full papers if you don't want to wait a year:

The Optimist's Telescope: Review

The fundamental problem of digital preservation is that, although it is important and we know how to do it, we don't want to pay enough to have it done. It is an example of the various societal problems caused by rampant short-termism, about which I have written frequently.

Bina Venkataraman has a new book on the topic entitled The Optimist's Telescope: Thinking Ahead in a Reckless Age. Robert H. Frank reviews it in the New York Times:

How might we mitigate losses caused by shortsightedness? Bina Venkataraman, a former climate adviser to the Obama administration, brings a storyteller’s eye to this question in her new book, “The Optimist’s Telescope.” She is also deeply informed about the relevant science.

The telescope in her title comes from the economist A.C. Pigou’s observation in 1920 that shortsightedness is rooted in our “faulty telescopic faculty.” As Venkataraman writes, “The future is an idea we have to conjure in our minds, not something that we perceive with our senses. What we want today, by contrast, we can often feel in our guts as a craving.”

She herself is the optimist in her title, confidently insisting that impatience is not an immutable human trait. Her engaging narratives illustrate how people battle and often overcome shortsightedness across a range of problems and settings.

Below the fold, some thoughts upon reading the book.

SSD vs. HDD (Updated)

IDC & TrendForce data
via Aaron Rakers

Chris Mellor's How long before SSDs replace nearline disk drives? starts with a quote I think the good Dr. Pangloss would love:

Aaron Rakers, the Wells Fargo analyst, thinks enterprise storage buyers will start to prefer SSDs when prices fall to five times or less that of hard disk drives. They are cheaper to operate than disk drives, needing less power and cooling, and are much faster to access.

Below the fold, some skepticism.

Optical Media Durability: Update

A year ago I posted Optical Media Durability and discovered:

Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary.

It is time to repeat the mind-numbing process of feeding 45 disks through the reader and verifying their checksums. Below the fold, this year's results.

A Tribute To Don Waters

Michael Keller has written, in Exploiting the opportunities of the maturing digital age: the first twenty years of the Scholarly Communications Program of the Andrew W. Mellon Foundation, what is effectively a richly deserved tribute to Don Waters as his retirement looms. Below the fold, some commentary and my two cents worth.

Wine on WIndows 10

Source

David Gerard posts Wine on Windows 10. It works.

Windows 10 introduced Windows Subsystem for Linux — and the convenience of Ubuntu downloadable from the Microsoft Store. This makes this dumb idea pretty much Just Work out of the box, apart from having to set your DISPLAY environment variable by hand.

So far, it's mindbogglingly useless. It can only run 64-bit Windows apps, which doesn't even include all the apps that come with Windows 10 itself.

But I want to stress again: this now works trivially. I'm not some sort of mad genius to do this thing — I only appear to be the first person to admit to having done it publicly.

Gerard recounts the history of this "interesting" idea. Although he treats this as a "geek gotta do what a geek gotta do" thing, the interest for Emulation & Virtualization as Preservation Strategies is in the tail of the post:

TO DO: 32-bit support. This will have to wait for Microsoft to release WSL 2. I wonder if ancient Win16 programs will work then — they should do in Wine, even if they don't in Windows any more.

Of course, if they run in Wine on Ubuntu on Windows 10 on an x86, they should run on Wine on Ubuntu on an x86. But being able to run Wine in an official Microsoft environment might make deployment of preserved Win16 programs easier to get past an institution's risk-averse lawyers.

Thursday, August 1, 2019

Emulation as a Service

I've written before about the valuable work of the Software Preservation Network (SPN). Now they have released their EaaSI Sandbox, in which you can explore the capabilities of "Emulation as a Service" (EaaS), a topic I discussed in my report Emulation and Virtualization as Preservation Strategies. Below the fold I try EaaSi for the first time.

Blockchain briefing for DoD

I was asked to deliver Blockchain: What's Not To Like? version 3.0 to a Department of Defense conference-call. I took the opportunity to update the talk, and expand it to include some of the "Additional Material" from the original, and from the podcast. Below the fold, the text of the talk with links to the sources. The yellow boxes contain material that was on the slides but was not spoken.

Boeing's Corporate Suicide

Boeing believed that development of the 787 Dreamliner was a "bet the company decision". As things turned out, after a rocky start, it was a bet that will probably pay off. But the company took another "bet the company" decision that looks like it may not pay off, and it may well take the company with it. Below the fold, the details.

Carl Malamud's Text Mining Project

For many years now it has been obvious that humans can no longer effectively process the enormous volume of academic publishing. The entire system is overloaded, and its signal-to-noise ratio is degrading. Journals are no longer effective gatekeepers, indeed many are simply fraudulent. Peer review is incapable of preventing fraud, gross errors, false authorship, and duplicative papers; reviewers cannot be expected to have read all the relevant literature.

On the other hand, there is now much research showing that computers can be effective at processing this flood of information. Below the fold I look at a couple of recent developments.

Not To Pick On Toyota

Just under five years ago Prof. Phil Koopman gave a talk entitled A Case Study of Toyota Unintended Acceleration and Software Safety (slides, video). I only just discovered it, and its an extraordinarily valuable resource for understanding the risks of embedded software. Especially the risks of embedded software in life-critical products, and the processes needed to avoid failures such as those that caused deaths from sudden unintended acceleration (SUA) of Toyota cars, and from unintended pitch-down of Boeing 737 MAX aircraft. I doubt Toyota is an outlier in this respect, and I would expect that the multi-billion dollar costs of the problems Koopman describes have motivated much improvement in their processes. Follow me below the fold for the details.

The EFF vs. DMCA Section 1201

As the EFF's Parker Higgins wrote:

Simply put, Section 1201 means that you can be sued or even jailed if you bypass digital locks on copyrighted works—from DVDs to software in your car—even if you are doing so for an otherwise lawful reason, like security testing.;

Section 1201 is obviously a big problem for software preservation, especially when it comes to games.

Last December in Software Preservation Network I discussed both the SPN's important documents relating to the DMCA:

Below the fold, some important news about Section 1201.

Finn Brunton's "Digital Cash"

I attended the book launch event for Finn Brunton's Digital Cash at the Internet Archive, and purchased a copy. It is a historian's review of the backstory leading up to Satoshi Nakamoto's Bitcoin. To motivate you to read it, below the fold I summarize its impressive breadth.

The Web Is A Low-Trust Society

Back in 1992 Robert Putnam et al published Making democracy work: civic traditions in modern Italy, contrasting the social structures of Northern and Southern Italy. For historical reasons, the North has a high-trust structure whereas the South has a low-trust structure. The low-trust environment in the South had led to the rise of the Mafia and persistent poor economic performance. Subsequent effects include the rise of Silvio Berlusconi.

Now, in The Internet Has Made Dupes-And Cynics-Of Us All, Zynep Tufecki applies the same analysis to the Web:

ONLINE FAKERY RUNS wide and deep, but you don’t need me to tell you that. New species of digital fraud and deception come to light almost every week, if not every day: Russian bots that pretend to be American humans. American bots that pretend to be human trolls. Even humans that pretend to be bots. Yep, some “intelligent assistants,” promoted as advanced conversational AIs, have turned out to be little more than digital puppets operated by poorly paid people.

The internet was supposed to not only democratize information but also rationalize it—to create markets where impartial metrics would automatically surface the truest ideas and best products, at a vast and incorruptible scale. But deception and corruption, as we’ve all seen by now, scale pretty fantastically too.

Below the fold, some commentary.

The Risks Of Outsourcing

My Cloud for Preservation post was in some sense all about the risks of outsourcing IT infrastructure to the cloud. Below the fold I comment on two recent articles illustrating different aspects of these risks.

Lina M. Khan On Structural Separation

In It's The Enforcement, Stupid! I argued that anti-trust enforcement was viable only if there were "bright lines". I even went further and, following Kim Stanley Robinson's Pacific Edge, suggested a hard cap on corporate revenue, as a way of making anti-trust self-executing.

Much of the recent wave of attention to anti-trust was sparked by Lina Khan's masterful January 2017 Yale Law Journal article Amazon's Antitrust Paradox (a must-read, even at 24,000 words). Now Cory Doctorow writes:

Khan (who is now a Columbia Law fellow) is back with The Separation of Platforms and Commerce -- clocking in at 61,000 words with footnotes! -- that describes the one-two punch of contemporary monopolism, in which Reagan-era deregulation enthusiasts took the brakes off of corporate conduct but said it would be OK because antitrust law would keep things from getting out of control, while Reagan-era antitrust "reformers" (led by Robert Bork and the Chicago School) dismantled antitrust).

You should definitely read Khan's latest magnum opus. OK, maybe you can skip the footnotes, I admit I did. Below the fold I examine two threads among many in the article.

Michael Nelson's CNI Keynote: Part 3

Here is the conclusion of my three-part "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides).

Part 1 and Part 2 addressed Nelson's description of the problems of the current state of the art. Below the fold I address the way forward.

HAMR-ing Home My Point

In Double-headed Seagate disk drives? Yes, on their way, Chris Mellor mentions that Seagate:

expects to intro 20TB+ HAMR-based nearline HDDs in calendar 2020.

Volume production of HAMR drives is still 1 year away. In 2009 Dave Anderson of Seagate presented this roadmap. It shows HAMR drives a year away in 2010. They have been a year away ever since. A decade of real-time slip.

Only the good Dr. Pangloss believes industry roadmaps.

Tuesday, June 18, 2019

Michael Nelson's CNI Keynote: Part 2

My "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides). continues here. Part 1 had an introduction and discussion of two of my issues with Nelson's big picture.

Below the fold I address my remaining issues with Nelson's big picture of the state of the art. Part 3 will compare his and my views of the path ahead.

Michael Nelson's CNI Keynote: Part 1

Michael Nelson and his group at Old Dominion University have made major contributions to Web archiving. Among them are a series of fascinating papers on the problems of replaying archived Web content. I've blogged about several of them, most recently in All Your Tweets Are Belong To Kannada and The 47 Links Mystery. Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides) understandably focuses on recounting much of this important research. I'm a big fan of this work, and there is much to agree with in the rest of the talk.

But I have a number of issues with the big picture Nelson paints. Part of the reason for the gap in posting recently was that I started on a draft that discussed both the big picture issues and a whole lot of minor nits, and I ran into the sand. So I finally put that draft aside and started this one. I tried to restrict myself to the big picture, but despite that it is still too long for a single post. Follow me below the fold for the first part of a lengthy disquisition.

Regulating Cryptocurrencies

Satoshi Nakamoto's Bitcoin emerged not just from three decades of computer science research, but also from two interrelated cult-like ideologies of the right, libertarianism and Austrian economics. Governments are generally happy with computer science research until it gets in the way of law enforcement, but non-kleptocratic governments tend to be unhappy with both libertarianism and Austrian economics, particularly when they get in the way of law enforcement.

Below the fold, a look at the varying approaches governments are taking to the problems they perceive cryptocurrencies pose.

Ten Hot Topics

The topic of scholarly communication has received short shrift here for the last few years. There has been too much to say about other topics, and developments such as Plan S have been exhaustively discussed elsewhere. But I do want to call attention to an extremely valuable review by Jon Tennant and a host of co-authors entitled Ten Hot Topics around Scholarly Publishing.

The authors pose the ten topics as questions, which allows for a scientific experiment. My hypothesis is that all these questions, while strictly not headlines, will nevertheless obey Betteridge's Law of Headlines, in that the answer will be "No". Below the fold, I try to falsify my hypothesis.

Review Of Data Storage In DNA

Luis Ceze, Jeff Nivala and Karin Strauss of the University of Washington and Microsoft Research team have published a fascinating review of the history and state-of-the-art in Molecular digital data storage using DNA. The abstract reads:

Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review , we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology.

They include a comprehensive bibliography. Below the fold, some commentary and a few quibbles.

Storing Data In Oligopeptides

Bryan Cafferty et al have published a paper entitled Storage of Information Using Small Organic Molecules. There's a press release from Harvard's Wyss Institute at Storage Beyond the Cloud. Below the fold, some commentary on the differences and similarities between this technique and using DNA to store data.

Immutability FTW!

There's an apparently apocryphal story that when Willie Sutton, the notorious bank robber of the 1930s to 1950s, was asked why he robbed banks, he answered:

Because that's where the money is!

Today's Willie Suttons don't need a disguise or an (unloaded) Thompson submachine gun, because they rob cryptocurrency exchanges. As David Gerard writes:

Crypto exchange hacks are incredibly rare, and only happen every month or so.

Yesterday Bloomberg reported:

Binance, one of the world’s largest cryptocurrency exchanges, said hackers withdrew 7,000 Bitcoins worth about $40 million via a single transaction in a “large scale security breach,” the latest in a long line of thefts in the digital currency space.

Below the fold, a few thoughts:

Demand Is Even Less Insatiable Than It Used To Be

In Demand Is Far From Insatiable I looked at Chris Mellor's overview of the miserable Q2 numbers from Seagate, Nearline disk drive demand dip dropkicks Seagate: How deep is the trough, how deep is the trough?, and Western Digital, Weak flash demand and disk sales leave Western Digital scrabbling to claw back $800m a year. This quarter was equally dismal. Below the fold, the gory details.

Lets Put Our Money Where Our Ethics Are

I found a video of Jefferson Bailey's talk at the Ethics of Archiving the Web conference from a year ago. It was entitled Lets Put Our Money Where Our Ethics Are. The talk is the first 18.5 minutes of this video. It focused on the paucity of resources devoted to archiving the huge proportion of our culture that now lives on the evanescent Web. I've also written on this topic, for example in Pt. 2 of The Amnesiac Civilization. Below the fold, some detailed numbers (that may by now be somewhat out-of-date) and their implications.

Short talk at Asilomar Microcomputer Workshop

I gave a revised version of Blockchain: What's Not To Like? in the 2019 Asilomar Microcomputer Workshop's Athematic session. Below the fold, the text of the talk with links to the sources. Readers should also consult the "Additional Material" in the original talk, the video of my original presentation, and the podcast interview.

Personal Pods and Fatcat

Sir Tim Berners-Lee's Solid project envisages a decentralized Web in which people control their own data stored in personal "pods":

The basic idea of Solid is that each person would own a Web domain, the "host" part of a set of URLs that they control. These URLs would be served by a "pod", a Web server controlled by the user that implemented a whole set of Web API standards, including authentication and authorization. Browser-side apps would interact with these pods, allowing the user to:

Export a machine-readable profile describing the pod and its capabilities.

Write content for the pod.

Control others access to the content of the pod.

Pods would have inboxes to receive notifications from other pods. So that, for example, if Alice writes a document and Bob writes a comment in his pod that links to it in Alice's pod, a notification appears in the inbox of Alice's pod announcing that event. Alice can then link from the document in her pod to Bob's comment in his pod. In this way, users are in control of their content which, if access is allowed, can be used by Web apps elsewhere.

In his Paul Evan Peters Award Lecture, my friend Herbert Van de Sompel applied this concept to scholarly communication, envisaging a world in which access, for both humans and programs, to all the artifacts of research would be greatly enhanced.

In Herbert's vision, institutions would host their researchers "research pods", which would be part of their personal domain but would have extensions specific to scholarly communication, such as automatic archiving upon publication.

Follow me below the fold for an update to my take on the practical possibilities of Herbert's vision.

The Demise Of The Digital Preservation Network

Now I've had a chance to read the Digital Preservation Network (DPN): Final Report I feel the need to add to my initial reactions in Digital Preservation Network Is No More, which were based on Roger Schonfeld's Why Is the Digital Preservation Network Disbanding?. Below the fold, my second thoughts.

What is Amazon?

In Why It's Hard To Escape Amazon's Long Reach, Paris Martineau and Louise Matsakis have compiled an amazingly long list of businesses that exist inside Amazon's big tent. After it went up, they had to keep updating it as people pointed out businesses they'd missed. In most of those businesses, Amazon's competitors are at a huge disadvantage:

While its retail business is the most visible to consumers, the cloud computing arm, Amazon Web Services, is the cash cow. AWS has significantly higher profit margins than other parts of the company. In the third quarter, Amazon generated $3.7 billion in operating income (before taxes). More than half of the total, $2.1 billon, came from AWS, on just 12 percent of Amazon’s total revenue. Amazon can use its cloud cash to subsidize the goods it ships to customers, helping to undercut retail competitors who don’t have similar adjunct revenue streams.

In the mid-50s my father wrote a textbook, Organisation of retail distribution, with a second edition in the mid-60s. He would have been fascinated by Amazon. I've written about Amazon from many different viewpoints, including storage as a service, and anti-trust, so I'm fascinated with Amazon, too. Now, when you put recent posts by two different writers together, an extraordinarily interesting picture emerges, not just of Amazon but of the risks inherent to the "friction-free" nature of the Web:

Zack Kanter's What is Amazon? is easily the most insightful thing I've ever read about Amazon. It starts by examining how Walmart's "slow AI" transformed retail, continues by describing how Amazon transformed Walmart's "slow AI" into one better suited to the Internet, and ends up with a discussion of how Amazon's "slow AI" seems recently to have made a fundamental mistake.
Izabella Kaminska's series Amazon (sub)Prime? and Amazon (sub)Prime - Part II provides the deep dive to go with Kanter's big picture, looking in detail into one of the many symptoms of the "slow AI's" apparent mistake.

Below the fold, a long meditation on these posts.

Digitized Historical Documents

Source

Josh Marshall of Talking Points Memo trained as a historian. From that perspective, he has a great post entitled Navigating the Deep Riches of the Web about the way digitization and the Web have transformed our access to historical documents. Below the fold, I bestow both praise and criticism.

First We Change How People Behave

Then the system will work the way we want. My skepticism about Level 5 self-driving cars keeps getting reinforced. Below the fold, two recent examples.

The 47 Links Mystery

Nearly a year ago, in All Your Tweets Are Belong To Kannada, I blogged about Cookies Are Why Your Archived Twitter Page Is Not in English. It describes some fascinating research by Sawood Alam and Plinio Vargas into the effect of cookies on the archiving of multi-lingual web-sites.

Sawood Alam just followed up with Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages, another fascinating exploration of these effects. Follow me below the fold for some commentary.

FAST 2019

I wasn't able to attend this year's FAST conference in Boston, and reading through the papers I didn't miss much relevant to long-term storage. Below the fold a couple of quick notes and a look at the one really relevant paper.

Cost-Reducing Writing DNA Data

In DNA's Niche in the Storage Market, I addressed a hypothetical DNA storage company's engineers and posed this challenge:

increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.

Now, a company called Catalog plans to demo a significant step in the right direction:

The goal of the demonstration, says Park, is to store 125 gigabytes, ... in 24 hours, on less than 1 cubic centimeter of DNA. And to do it for $7,000.

That would be 1E11 bits for $7E3. At the theoretical maximum 2 bits/base, it would be $3.5E-8 per base, versus last year's estimate of 1E-4, or around 30,000 times better.

If the demo succeeds, it marks a major achievement. But below the fold I continue to throw cold water on the medium-term prospects for DNA storage.

Compression vs. Preservation

An archive is in a hardware refresh cycle and they have asked me to comment on concerns arising because their favored storage hardware uses data compression, which may not be possible to disable even if doing so were a good idea. This is an issue I wrote about two years ago in Threats to stored data.

Because similar concerns keep re-appearing in discussions of digital preservation, I decided this time to discuss it in the same way as Cloud for Preservation, writing a post with a general discussion of the issues without referring to a specific institution. Below the fold, the details.

It's The Enforcement, Stupid!

Kim Stanley Robinson is a remarkable author. In 1990 he concluded his Wild Shore triptych of novels describing alternate futures for California with Pacific Edge:

Pacific Edge (1990) can be compared to Ernest Callenbach's Ecotopia, and also to Ursula K. Le Guin's The Dispossessed. This book's Californian future is set in the El Modena neighborhood of Orange in 2065. It depicts a realistic utopia as it describes a possible transformation process from our present status, to a more ecologically-focused future.

Why am I writing about this now, nearly three decades later? Follow me below the fold for an explanation.

It Isn't Just Cryptocurrency Mining

Izabella Kaminska's Just because it's digital doesn't mean it's green reports on:

A new report by the carbon emission think-tank The Shift Project out this week highlights that not much has changed since [2014]. ICT still contributes to about 4 per cent of global greenhouse gas emissions, which is still twice that of civil aviation. What is worse, its contribution is growing more quickly than that of civil aviation.

Cryptocurrency mining is definitely a problem, but how big a part of the problem isn't clear. It could be quite big. Follow me below the fold for some surprising details.

Demand Is Far From Insatiable

Based on numbers that IDC conjures from thin air, pundits believe that demand for storage is insatiable because everyone says Lets Just Keep Everything Forever In The Cloud. That idea assumes storage is free, but Storage Will Be Much Less Free Than It Used To Be. (Both links are from 2012). Below the fold I look at some real-world numbers showing how much storage actual customers are buying.

Economic Models Of Long-Term Storage

My work on the economics of long-term storage with students at the UC Santa Cruz Center for Research in Storage Systems stopped about six years ago some time after the funding from the Library of Congress ran out. Last year to help with some work at the Internet Archive I developed a much simplified economic model, which runs on a Raspberry Pi.

Two recent developments provide alternative models:

Last year, James Byron, Darrell Long, and Ethan Miller's Using Simulation to Design Scalable and Cost-Efficient Archival Storage Systems (also here) reported on a vastly more sophisticated model developed at the Center. It includes both much more detailed historical data about, for example, electricity cost, and covers various media types including tape, optical, and SSDs.
At the recent PASIG Julian Morley reported on the model being used at the Stanford Digital Repository, a hybrid local and cloud system, and he has made the spreadsheet available for use.

Below the fold some commentary on all three models.

IT Improves Productivity!

In The Productivity Paradox David Rotman writes:

Productivity growth in most of the world’s rich countries has been dismal since around 2004. Especially vexing is the sluggish pace of what economists call total factor productivity—the part that accounts for the contributions of innovation and technology. In a time of Facebook, smartphones, self-driving cars, and computers that can beat a person at just about any board game, how can the key economic measure of technological progress be so pathetic? Economists have tagged this the “productivity paradox.”

Some argue that it’s because today’s technologies are not nearly as impressive as we think. The leading proponent of that view, Northwestern University economist Robert Gordon, contends that compared with breakthroughs like indoor plumbing and the electric motor, today’s advances are small and of limited economic benefit. Others think productivity is in fact increasing but we simply don’t know how to measure things like the value delivered by Google and Facebook, particularly when many of the benefits are “free.”

My view is that IT is only one of the factors driving the decrease of productivity in the general economy, but that there are some areas of the economy in which IT is greatly increasing productivity. An explanation is below the fold.

Cloud For Preservation

Imagine you're responsible for preserving the long-established digital collection at a large research or national library. It is currently preserved in home-grown software, or off-the-shelf software that's been extensively customized, that you are responsible for running on hardware run by your institution's IT department. You are probably not a large customer of theirs. They are probably laying down the law, saying "cloud first", especially as you are looking at a looming hardware refresh. Below the fold, I examine a set of issues that need to be clarified in the decision-making process.

The Economics Of Bitcoin Transactions

Source

Izabella Kaminska's BIS trolls bitcoin reports on analysis of the economics of Bitcoin transactions from Raphael Auer at the Bank for International Settlements. She starts:

Bitcoin aspires to take over the world. But as we all know (according to poorly sourced conspiracy forums), the world is currently run by the Bank of International Settlements (BIS), the central bank to central banks. That means Bitcoin needs to displace the BIS in the near future if it is to get anywhere.

But it takes one to know one.

So here's the dominant global payments system calling out the aspiring global payments system in an excellent piece of professional trolling this week

Auer's is indeed an excellent piece of work. Follow me below the fold for some details.

Facebook's Catch-22

John Herrman's How Secrecy Fuels Facebook Paranoia takes the long way round to come to a very simple conclusion. My shorter version of Herrman's conclusion is this. In order to make money Facebook needs to:

Convince advertisers that it is an effective means of manipulating the behavior of the mass of the population.
Avoid regulation by convincing governments that it is not an effective means of manipulating the behavior of the mass of the population.

The dilemma is even worse because among the advertisers Facebook needs to believe in its effectiveness are individual politicians and political parties, both big advertisers! This Catch-22 is the source of Facebook's continuing PR problems, listed by Ryan Mac. Follow me below the fold for details.

Blockchain Video and Podcast

CNI has now posted the video of my 20-minute talk Blockchain: What's Not To Like? to YouTube and Vimeo. Here is the YouTube version:

Gerry Bayne interviewed me at the Fall CNI meeting for CNI's podcast series. The 20-minute conversation is a companion piece to the talk. The podcast is on the Educause SoundCloud channel.

I made one, easily spotted, mistake in the interview when I said $3,000 instead of $300,000. But other than that I'm happy with both the video and the podcast.

Tuesday, January 22, 2019

Trump's Shutdown Impacts Information Access

Source

Government shutdown causing information access problems by James A. Jacobs and James R. Jacobs is important. It documents the effect of the Trump government shutdown on access to globally important information:

Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall 😐 ), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).

Maybe when/if the government opens again, we should scrape the NIST and CSRC websites, put all those publications somewhere public. It’s worrying that *every single US cryptography standard* is now unavailable to practitioners.
— Matthew Green (@matthew_d_green) January 12, 2019
Matthew Green, a professor at Johns Hopkins, said “It’s worrying that every single US cryptography standard is now unavailable to practitioners.” He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.

They point out how this illustrates the importance of libraries collecting and preserving web-published information:

Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.

Go read the whole thing, and weep for the way libraries have abandoned their centuries-long mission of safeguarding information for future readers.

Thursday, January 10, 2019

Digital Preservation Network Is No More

In Why Is the Digital Preservation Network Disbanding? Roger Schonfeld examines the demise of the Digital Preservation Network which was announced last month:

An initial announcement said directly that "After careful analysis of the Digital Preservation Network's membership, operating model, and finances, the Board of Trustees of DPN passed a resolution to affect an orderly wind-down of DPN," including committing to consultations with each member to ensure that content would not be lost in the wind-down. Shortly thereafter, messages came out from DPN's hubs, both individually including HathiTrust, and collectively, characterizing their operating and financial strength and ability to provide for an orderly transition. Because DPN was not itself directly preserving anything but rather a broker for preservation services by underlying repositories, it does not appear that any content will be put at risk.

Below the fold, I look at various views of the lessons to be learned.

Trust In Digital Content

This is the fourth and I hope final part of a series about trust in digital content that might be called:

Is this the real life?
Is this just fantasy

The series so far moved down the stack:

The first part was Certificate Transparency, about how we know we are getting content from the Web site we intended to.
The second part was Securing The Software Supply Chain, about how we know we're running the software we intended to, such as the browser that got the content whose certificate was transparent.
The third part was Securing The Hardware Supply Chain, about how we can know that the hardware the software we secured is running on is doing what we expect it to.

Below the fold this part asks whether, even if the certificate, software and hardware were all perfectly secure, we could trust what we were seeing.

Tuesday, December 31, 2019

Thursday, December 5, 2019

Tuesday, November 26, 2019

Tuesday, November 19, 2019

Thursday, November 14, 2019

Tuesday, November 12, 2019

Thursday, October 31, 2019

Thursday, October 24, 2019

Tuesday, October 22, 2019

Thursday, October 17, 2019

Tuesday, October 15, 2019

Thursday, October 10, 2019

Tuesday, October 8, 2019

Thursday, October 3, 2019

Thursday, September 26, 2019

Tuesday, September 24, 2019

Thursday, September 19, 2019

Tuesday, September 17, 2019

Tuesday, September 10, 2019

Thursday, August 29, 2019

Thursday, August 22, 2019

Tuesday, August 20, 2019

Thursday, August 8, 2019

Thursday, August 1, 2019

Tuesday, July 30, 2019

Friday, July 26, 2019

Thursday, July 25, 2019

Tuesday, July 23, 2019

Tuesday, July 16, 2019

Tuesday, July 9, 2019

Tuesday, July 2, 2019

Thursday, June 27, 2019

Tuesday, June 25, 2019

Thursday, June 20, 2019

Wednesday, June 19, 2019

Tuesday, June 18, 2019

Thursday, June 13, 2019

Thursday, May 23, 2019

Tuesday, May 21, 2019

Thursday, May 16, 2019

Tuesday, May 14, 2019

Thursday, May 9, 2019

Tuesday, May 7, 2019

Thursday, May 2, 2019

Thursday, April 25, 2019

Thursday, April 18, 2019

Tuesday, April 16, 2019

Tuesday, April 9, 2019

Thursday, April 4, 2019

Tuesday, April 2, 2019

Thursday, March 28, 2019

Tuesday, March 26, 2019

Thursday, March 21, 2019

Tuesday, March 19, 2019

Thursday, March 14, 2019

Thursday, March 7, 2019

Tuesday, March 5, 2019

Tuesday, February 26, 2019

Tuesday, February 12, 2019

Thursday, February 7, 2019

Tuesday, February 5, 2019

Thursday, January 31, 2019

Tuesday, January 29, 2019

Tuesday, January 22, 2019

Thursday, January 10, 2019

Thursday, January 3, 2019