Tuesday, June 27, 2017

Wall Street Journal vs. Google

After we worked together at Sun Microsystems, Chuck McManis worked at Google then built another search engine (Blekko). His contribution to the discussion on Dave Farber's IP list about the argument between the Wall Street Journal and Google is very informative. Chuck gave me permission to quote liberally from it in the discussion below the fold.


The background to the discussion is that since 2005 Google has provided paywalled news sites with three options:
  1. First click free:
    We've worked with subscription-based news services to arrange that the very first article seen by a Google News user (identifiable by referrer) doesn't require a subscription. Although this first article can be seen without subscribing, any further clicks on the article page will prompt the user to log in or subscribe to the news site. ... A user coming from a host matching [*www.google.*] or [*news.google.*] must be able to see a minimum of 3 articles per day. ... Otherwise, your site will be treated as a subscription site.
  2. Snippets:
    we will display the "subscription" label next to the publication name of all sources that greet our users with a subscription or registration form. ... If you prefer this option, please display a snippet of your article that is at least 80 words long and includes either an excerpt or a summary of the specific article. ... we will only crawl and display your content based on the article snippets you provide.
  3. Robots.txt:
    you could put the subscription content under one section of your site, and apply robots.txt to that section. In this case, Googlebot would not crawl or index anything on that section, not even for snippets.
Until recently, the WSJ took the first option. Google News readers could see three articles a day free, and the WSJ ranked high in Google's search results. Then:
The Journal decided to stop letting people read articles free from Google after discovering nearly 1 million people each month were abusing the three-article limit. They would copy and paste Journal headlines into Google and read the articles for free, then clear their cookies to reset the meter and read more,
The result was:
the Wall Street Journal’s subscription business soared, with a fourfold increase in the rate of visitors converting into paying customers. But there was a trade-off: Traffic from Google plummeted 44 percent.
In the great Wall Street tradition of "Greed is Good", the WSJ wasn't satisfied with a "fourfold increase in ... paying customers". They wanted to have their cake and eat it too:
Executives at the Journal, owned by Rupert Murdoch's News Corp., argue that Google's policy is unfairly punishing them for trying to attract more digital subscribers. They want Google to treat their articles equally in search rankings, despite being behind a paywall.
We're the Wall Street Journal, the rules don't apply to us!

There were comments on the IP list to the effect that it would be easy for the WSJ to exempt the googlebot from the paywall. Chuck's post pointed out that things were a lot more complex than they appeared. He explained the context:
First (and perhaps foremost) there is a big question of business models, costs, and value. ... web advertising generates significantly (as in orders of magnitude) less revenue [for newspapers] than print advertising. Subscription models have always had 'leakage' where content was shared when a print copy was handed around (or lended in the case of libraries), content production costs (those costs that don't include printing and distribution of printed copies) have gone up, and information value (as a function of availability) has gone down. ... publications like the Wall Street journal are working hard to maximize the value extracted within the constraints of the web infrastructure.

Second, there is a persistent tension between people who apply classical economics to the system and those who would like to produce a financially viable work product.

And finally, there is a "Fraud Surface Area" component that is enabled by the new infrastructure that is relatively easily exploited without a concomitant level of risk to the perpetrators.
Chuck explains the importance to Google of fraud prevention, and one way they approach the problem:
Google is a target for fraudsters because subverting its algorithm can enable advertising click fraud, remote system compromise, and identity theft. One way that arose early on in Google's history were sites that present something interesting when the Google Crawler came through reading the page, something malicious when an individual came through. The choice of what to show in response to an HTTP protocol request was determined largely from meta-data associated with the connection such as "User Agent", "Source Address", "Protocol options", and "Optional headers." To combat this Google has developed a crawling infrastructure that will crawl a web page and then at a future date audit that page by fetching it from an address with metadata that would suggest a human viewer. When the contents of a page change based on whether or not it looks like a human connection, Google typically would immediately dump the page and penalize the domain in terms of its Page Rank
But surely the WSJ isn't a bad actor, and isn't it important enough for Google to treat differently from run-of-the-mill sites?
Google is also a company that doesn't generally like to put "exemptions" in for a particular domain. They have had issues in the past where an exemption was added and then the company went out of business and the domain acquired by a bad actor who subsequently exploited the exemption to expose users to malware laced web pages. As a result, (at least as of 2010 when I left) the policy was not to provide exceptions and not to create future problems when the circumstances around a specific exemption might no longer apply. As a result significant co-ordination between the web site and Google is required to support anything out of the ordinary, and that costs resources which Google is not willing to donate to solve the web site's problems.
So this is a business, not a technical issue. There is no free lunch; Google isn't going to do work for the WSJ without getting paid:
both Google and the [WSJ] are cognizant of the sales conversion opportunity associated with a reader *knowing* because of the snippet that some piece of information is present in the document, and then being denied access to that document for free. It connects the dots between "there is something here I want to know" and "you can pay me now and I'll give it to you." As a result, if Google were to continue to rank the WSJ article into the first page of results it would be providing a financial boost to the WSJ and yet not benefiting itself financially at all.

The bottom line is, as it usually is, that there is a value here and the market maker is unwilling to cede all of it to the seller. Google has solved this problem with web shopping sites by telling them they have to pay Google a fee to appear in the first page of results, no doubt if the WSJ was willing to pay Google an ongoing maintenance fee Google would be willing to put the WSJ pages back into the first page of results (even without them being available if you clicked on them).
Chuck explains the three ways you can pay Google:
As has been demonstrated in the many interactions between Google and the newspapers of the world, absent any externally applied regulation, there are three 'values' Google is willing to accept. You can give Google's customers free access to a page found on Google (the one click free rule) which Google values because it keeps Google at the top of everyone's first choice for searching for information. Alternatively you can allow only Google advertising on your pages which Google values because it can extract some revenue from the traffic they send your way. Or you can just pay Google for the opportunity to be in the set of results that the user sees first.
Interestingly, what I didn't see in the discussion on the IP list was the implication of:
discovering nearly 1 million people each month were abusing the three-article limit. ... clear their cookies to reset the meter and read more,
which is that the WSJ's software is capable of detecting that a reader has cleared cookies after reading their three articles. One way to do so is via browser fingerprinting, but there are many others. If the WSJ can identify readers who cheat in this way, they could be refused content. Google would still see that "First Click Free" readers saw their three articles per day, so it would continue to index and drive traffic to the WSJ. But, clearly, the WSJ would rather whine about Google's unfairness than implement a simple way to prevent cheating.

1 comment:

David. said...

Chuck points me to EU's anti-trust penatly for skewing search results, suggesting that News Corp. could use this judgment to force Google to favor its articles. But I disagree with Chuck. The penalty is because Google favored its own price comparison service over competitors; it is being forced to treat large, small and its own price comparison services equally. News Corp. wants Google to treat the WSJ, which is not a Google property, more favorably than its competitors. This would be against the rationale of the judgment, which is still subject to appeal.