Thursday, October 25, 2018

Syndicating Journal Publisher Content

There's a lot of good information in Roger Schonfeld's Will Publishers Syndicate Their Content?. It starts:
The scholarly publishing sector has struggled to address the problems that users face in their discovery-to-access workflow and thereby stave off skyrocketing piracy. The top-line impact of these struggles is becoming clearer, starting with Elsevier’s absence from Germany. This makes the efforts to establish seamles single-platform access to all scholarly publications — equal in extent as Sci-Hub but legitimate, and which I term a Supercontinent of Scholarly Publishing — all the more urgent. The technical solutions are challenging, and at the STM meeting in Frankfurt last week it became clear that, although progress is being made, policy, governance, and competition issues may complicate the drive to consensus.
Schonfeld asserts that providing a seamless, uniform view of the publisher's content, whether paywalled or open access, requires two services:
First, it requires an ability to authorize appropriate access in a decentralized distribution environment. A Shared Entitlements System, as it is sometimes called, would be a kind of common authorization service for all publishers. As I will discuss below, there are at least two options for how Entitlements can be addressed. Second, it requires Distributed Usage Logging, which is to say the ability for all usage, wherever it takes place, to be “counted” in measuring the value of articles on behalf of authors and licenses on behalf of publishers.
Below the fold, a rather long explanation of why I think Schonfeld's analysis doesn't go far enough.

As regards Distributed Usage Logging (DUL), which is in effect an enhanced version of the existing COUNTER system, Schonfeld writes:
DUL as it is being implemented seems to be in every publisher’s interests and should become the norm rapidly. From discussions with leaders at Springer Nature and Wiley, I am confident that the underlying DUL approach has broad support and will be widely adopted by other large publishers without delay.
On this, I think he's right, because publishers already supply content and receive usage reports through aggregators. The DUL system that CrossRef is implementing doesn't have a big impact on the publisher's business.

The Shared Entitlements System (SES) is more interesting, especially since the LOCKSS team worked with EDINA to build such a system (from the subscribers not the publishers side) for the UK as part of our National Hosting efforts.

Shonberg identifies two fundamentally different models of SES:
  • Upon being “approved” as entitled, a user could be routed to the publisher site for seamless article access. In this scenario, SES is little more than an improved linking experience. From my perspective, improved linking is the less ambitious option, and likely to meet a far smaller share of user needs.
  • Or, instead, the platform could provide access directly on its site and through DUL ensure that appropriate “credit” is provided. The latter option would require some kind of mechanism for content distribution to platform partners — content syndication — which in turn requires publishers to place a much higher degree of trust in the platform provider. Even moreso, content syndication requires that publishers abandon the idea that they can capture all the value on their own sites and instead radically improve the distribution system for their publications. For users, content syndication is the more valuable approach, because it can ultimately provide for them most of the benefits of having a single user account for all scholarship.
He's right that the linking model isn't what users want. It would be like Google Scholar but without the need to log in to the publishers' sites, which was the on-campus user experience in the good old days of IP address authentication. From the user experience point of view, this SES would be marginal at best. From the publisher point of view, no threat to their business model.

Schonfeld is also right that the experience users want requires publishers to provide their content to platform partners so that it can be delivered via the platform's user experience not the publisher's. He identifies a number of reasons why this is difficult for publishers, but I think Schonfeld misses perhaps the most important reason why publishers will be extremely reluctant to allow access to their content on partners' platforms.

In March 2015 I wrote Journals Considered Even More Harmful, riffing on Eric Hellman's important post entitled 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers. Hellman checked Web pages of 20 top journals. Four had one tracker (Google Analytics) and the remaining 16 had at least one advertising network and at least three trackers. NEJM had multiple advertising networks and fourteen trackers.

I totally missed Hellman's update of this post last year, Reader Privacy for Research Journals is Getting Worse:
The two Annual Reviews journals I looked at, which were among the few that did not expose users to advertising network tracking, now have trackers for AddThis and Doubleclick. The New England Journal of Medicine, which deployed the most intense reader tracking of the 20, is now even more intense, with 19 trackers on a web page that had "only" 14 trackers two years ago.
In September 2015 Maciej Cegłowski's must-read What Happens Next Will Amaze You provided a highly readable explanation of what is going on with the advertising networks:
Ad networks appeared. Publishers no longer had to sell all those empty rectangles themselves. They could just ask Google to fill them, and AdSense would dynamically match the space to available ads at the moment the page was loaded.
And trackers:
Soon the web was infested with all manner of trackers, beacons, pixels, tracking cookies and bugs. Companies learned to pool their data so they could follow customers across many sites. They created user profiles of everyone using the web. They could predict when a potential customer was going to do something expensive, like have a baby or get married, and tailor ads specifically to them

They learned to notice when people put things in a shopping cart and then failed to buy them, so they could entice them back with special offers. They got better at charging different prices to people based on what they could afford—the dream of every dead-eyed economist since the dawn of the profession.
NEJM home page 10/19/18
The readership of academic journals, especially medical journals, is extremely valuable to advertisers. Here, for example, is the top of the NEJM's home page with two pharmaceutical ads:
  • Vascepa from Amarin Pharma Inc:
    Icosapent ethyl is used together with lifestyle changes (diet, weight loss, exercise) to reduce the amount of triglycerides (a fat-like substance) in your blood. Icosapent ethyl is in a class of medications called antilipemic or lipid-regulating agents. Icosapent ethyl may work by decreasing the amount of triglycerides and other fats made in the liver.
  • Nucala from GSK:
    Mepolizumab injection is used along with other medications to prevent wheezing, difficulty breathing, chest tightness, and coughing caused by asthma in children 12 years and older and adults whose asthma is not controlled with their current asthma medication. Mepolizumab injection is in a class of medications called monoclonal antibodies. It works by blocking the action of a certain natural substance in the body that causes the symptoms of asthma.
When I clicked on the link to the NEJM home page, advertisers bid in real-time auctions to get their content into these two slots. When someone else clicked, there would have been a different auction, with different winners. Why did I see Vascepa and Nucala?

Neither Vascepa nor Nucala are medications I take. Both are related in some way to conditions I have researched on the Web, but not from this browser or this computer. Until I wrote this post I had never searched for either their trade or their generic names. I hadn't visited NEJM in years and never from this browser or this computer. Despite all this, how did Amarin and GSK know that it was worth bidding enough to win the auction to have their ads inserted into the only two advertising slots on the NEJM front page when I viewed it?

The answer has to do with the trackers infesting the journal, and indeed all, Web sites. In Web Tracking – A Literature Review on the State of Research Tatiana Ermakova et al define Web tracking as:
a widespread Internet technique that collects user data for purposes of online advertisement, user authentication, content personalization, advanced website analytics, social network integration, and website development ... For these goals, web tracking allows third-party or first-party websites to keep track of users’ browsing behavior, including browsing configuration and history.
And provide a high-level description of how it works:
A user accesses websites from a local device through an Internet Service Provider (ISP). Websites and ISPs may include tracking technology, either in-house or provided by third parties that provide tracking services for multiple sites, which enables cross-site tracking and data aggregation of individual browsing habits and interests. If the user switches to a different device or moves to another location, cross-device tracking and mobile tracking can be applied. Tracking data is often used for targeted advertising. This has created background markets for programmatic advertising, including real- time bidding for available advertising slots on the websites that are displayed to the user. Large-scale data aggregators and other data consumers are also interested to gather tracking and browsing data to enrich data profiles on individual web users.
Amarin and GSK's Web ad placement software has purchased access to the comprehensive profiles of me (and every other Web user) compiled by tracking companies, whose trackers have been placed on as many Web sites as possible.

I've written before about how much of the money spent on Web ads goes to fraud. Once again Maciej Cegłowski explains:
Today we live in a Blade Runner world, with ad robots posing as people, and Deckard-like figures trying to expose them by digging ever deeper into our browsers, implementing Voight-Kampff machines in Javascript to decide who is human. We're the ones caught in the middle. The ad networks' name for this robotic deception is 'ad fraud' or 'click fraud'. (Advertisers like to use moralizing language when their money starts to flow in the wrong direction. Tricking people into watching ads is good; being tricked into showing ads to automated traffic is evil.)

Ad fraud works because the market for ads is so highly automated. Like algorithmic trading, decisions happen in fractions of a second, and matchmaking between publishers and advertisers is outside human control. It's a confusing world of demand side platforms, supply-side platforms, retargeting, pre-targeting, behavioral modeling, real-time bidding, ad exchanges, ad agency trading desks and a thousand other bits of jargon. Because the payment systems are also automated, it's easy to cash out of the game. And that's how the robots thrive.
In some cases, advertisers require trackers on sites carrying their ads to check on fraud. But in other cases tracking companies pay Web sites to include their trackers, because their audience generates especially valuable information. Academic journals, and especially biomedical journals, have such an audience. It is affluent, but more important in many cases it is looking for information about medical conditions from which the pharma companies can reap obscene profits.

The income from ads and trackers accrues to the Web site owner. Were academic journal content syndicated, it would accrue to the site showing the syndicated content, not to the journal publisher. I expect that the income from trackers and ads is much more significant to the major academic publishers than generally realized, and that this is a huge disincentive to syndication.

Update: I really need to track Eric Hellman's blog better. A year ago last August he wrote PubMed Lets Google Track User Searches, which starts:
If you search on Google for "Best Mesothelioma Lawyer" and then click on one of the ads, Google can earn as much as a thousand dollars for your click. In general, Google can make a lot of money if it knows you're the type of user who's interested in rare types of cancer. So you might be surprised that Google gets to know everything you search for when you use PubMed, the search engine offered by the National Center for Biotechnology Information (NCBI), a service of the National Library of Medicine (NLM) at the National Institutes of Health.(NIH).
Hellman writes:
As I've previously written, Google Analytics  only tracks users across websites if the site-per-site tracker IDs can be connected to a global tracker ID like the ones used by DoubleClick. What NLM is allowing Google to do is to connect the Google Analytics user data to the DoubleClick user data. So Google's advertising business gets to use all the Google Analytics data, and the Analytics data provided to NLM can include all the DoubleClick "demographic and interest" data.
And while it's true that "the demographic and interest data" of PubMed visitors cannot be used to identify them as  individuals, the data collected by the Google trackers can trivially be used to identify as individuals any PubMed users who have Google accounts.
According to Goldman Sachs's estimate:
Google could pay Apple $9 billion in 2018, and $12 billion in 2019
I wonder what they pay NCBI?


David. said...

See how out-of-touch I've become. I'm a big fan of Maciej Cegłowski's talks, but I had no idea he had done this:

"Maciej Ceglowski, who runs a grassroots organization called Tech Solidarity that aims to connect tech workers with their communities, cobbled together the Great Slate after meeting Jess King, a candidate running for office in Lancaster County, Pennsylvania. Ceglowski was struck by King’s approach: a fieldwork-focused, populist campaign that goes door-to-door and aims at voter expansion.

After successfully raising money for her campaign from the tech community, he began to look through FEC filings for other candidates like her in districts with similar political leanings. Ceglowski, who is perhaps best known for creating the bookmarking site Pinboard, originally pitched the candidates individually to tech workers. Many in the community either know Ceglowski personally or are familiar with him through his quirky internet presence, and it became apparent that donors preferred to give money across the board to all his picks. They were willing to trust candidates who he had met with and personally chosen, and so the Great Slate was born.

The Great Slate siphons money to candidates through ActBlue, a Democrat-affiliated campaign payment processing service. Campaign finance laws mean that when you donate to the Great Slate’s ActBlue page, you’re asked to then enter separate amounts for each individual campaign. Ceglowski has no control over how the funds are divvied up. But the donors we spoke to tended to donate evenly to the candidates, seeing the Great Slate as a larger cause than just individual candidates.

“There are a lot of people who work in tech who are very progressive-minded, who believe in a strong social safety net, who have very different political beliefs [from the libertarians], and they don’t have a face when we talk about the tech industry,” says Ceglowski."

The Great Slate is a great idea.

David. said...

Cegłowski's explanation of the Great Slate is here.