Tuesday, June 19, 2018

The Four Most Expensive Words in the English Language

There are currently a number of attempts to deploy a cryptocurrency-based decentralized storage network, including MaidSafe, FileCoin, Sia and others. Distributed storage networks have a long history, and decentralized, peer-to-peer storage networks a somewhat shorter one. None have succeeded; Amazon's S3 and all other successful network storage systems are centralized.

Despite this history, initial coin offerings for these nascent systems have raised incredible amounts of "money", if you believe the heavily manipulated "markets". According to Sir John Templeton the four words are "this time is different". Below the fold I summarize the history, then ask what is different this time, and how expensive is it likely to be?

The idea that the edge of the Internet has vast numbers of permanently connected, mostly-empty hard disks that could be corralled into a peer-to-peer storage system that was free, or at least cheap, while offering high reliability and availability has a long history. The story starts:

The realization that networked personal computers need a shared, remote file system in addition to their local disk, like many things, starts with the Xerox Alto and its Interim File Server, designed and implemented by David R. Boggs and Ed Taft in the late 70s. As IP networking started to spread in the early 80s, CMU's Andrew project started work in 1983 on the Andrew file system, followed in 1984 by Sun's work on NFS (RFC1094). Both aimed to provide a Unix-like file system API to processes on client computers, implemented by a set of servers. This API was later standardized by POSIX.

Both the Andrew File System and NFS started from the idea that workstation disks were small and expensive, so the servers would be larger computers with big disks, but NFS rapidly became a way that even workstations could share their file systems with each other over a local area network. In the early 90s people noticed that workstation CPUs were idle a lot of the time, and together with the shared file space this spawned the idea of distributing computation across the local network:
The workstations were available more than 75% of the time observed. Large capacities were steadily available on an hour to hour, day to day, and month to month basis. These capacities were available not only during the evening hours and on weekends, but during the busiest times of normal working hours.
By the late 90s the size of workstation and PC disks had increased and research, for example at Microsoft, showed these disks were also under-utilized:
We found that only half of all disk space is in use, and by eliminating duplicate files, this usage can be significantly reduced,depending on the population size. Half of all machines are up and accessible over 95% of the time, and machine uptimes are randomly correlated. Machines that are down for less than 72 hours have a high probability of coming back up soon. Machine lifetimes are deterministic, with an expected lifetime of around 300 days. Most machines are idle most of the time, and CPU loads are not correlated with the fraction of time a machine is up and are weakly correlated with disk loads.
This gave rise to the idea that the free space in workstation disks could be aggregated, first into a local network file system, and then into a network file system that spanned the Internet. Intermemory (also here), from NEC's Princeton lab in 1998, was one of the first, but there have been many others, such as Berkeley's Oceanstore (project papers) from 2000.

A true peer-to-peer architecture would eliminate the central organization and was thought to have many other advantages. In the early 2000s this led to a number of prototypes, including FARSITE, PAST/Pastiche and CFS, based on the idea of symmetry; peers contributed as much storage to the network as they consumed at other peers:
In a symmetric storage system, node A stores data on node B if and only if B also stores data on A. In such a system, B can periodically check to see if its data is still held by A, and vice versa. Collectively, these pairwise checks ensure that each node contributes as it consumes, and some systems require symmetry for exactly this reason [6, 18].
(NB - replication meant that the amount of storage consumed was greater than the amount of data stored. Peers wanting reliability had to build their own replication strategy by symmetrically storing data at multiple peers.).

These systems were vulnerable to the problem that afflicted Gnutella, Napster and other file-sharing networks, that peers were reluctant to contribute, and lied about their resources. The Samsara authors wrote:
Several mechanisms to compel storage fairness have been proposed, but all of them rely on one or more features that run counter to the goals of peer-to-peer storage systems. Trusted third parties can enforce quotas and certify the rights to consume storage [23] but require centralized administration and a common domain of control. One can use currency to track the provision and consumption of storage space [16], but this requires a trusted clearance infrastructure. Finally, certified identities and public keys can be used to provide evidence of storage consumption [16, 21, 23], but require a trusted means of certification. All of these mechanisms require some notion of centralized, administrative overhead—precisely the costs that peer-to-peer systems are meant to avoid.
Samsara from 2003 was a true peer-to-peer system which:
enforces fairness in peer-to-peer storage systems without requiring trusted third parties, symmetric storage relationships, monetary payment, or certified identities. Each peer that requests storage of another must agree to hold a claim in return---a placeholder that accounts for available space. After an exchange, each partner checks the other to ensure faithfulness. Samsara punishes unresponsive nodes probabilistically. Because objects are replicated, nodes with transient failures are unlikely to suffer data loss, unlike those that are dishonest or chronically unavailable.
As far as I know Samsara never got into production use.

At the same time Brian Cooper and Hector Garcia-Molina proposed an asymmetric system of "bid trading":
a mechanism where sites conduct auctions to determine who to trade with. A local site wishing to make a copy of a collection announces how much remote space is needed, and accepts bids for how much of its own space the local site must "pay" to acquire that remote space. We examine the best policies for determining when to call auctions and how much to bid, as well as the effects of "maverick" sites that attempt to subvert the bidding system. Simulations of auction and trading sessions indicate that bid trading can allow sites to achieve higher reliability than the alternative: a system where sites trade equal amounts of space without bidding.
The mechanisms these systems developed to enforce symmetry or trading were complex, and it was never really clear that they were proof against attack, because they were never deployed at enough scale to get attacked.

The API exported by services like these falls into one of two classes:
  • The "file system and object store" model, in which the client sees a single service provider. The service decides which peer stores what; the client has no visibility into where the data lives.
  • The "storage marketplace" model, in which the client sees offers from peers to store data at various prices, whether in space or cash. The client chooses where to store what.
There is a significant advantage of the "file and object store" model. Because the client transfers data to and from the service, the service can divide the data into shards and use erasure coding to deliver reliability at a low replication factor. In the "storage marketplace" model the client transfers data to and from the peer from which it decides to buy; the client needing reliability has to buy service from multiple peers and shard the data across them itself, greatly increasing the complexity of using the service. In principle, in the "file and object store" model the service can run an internal market, purchasing the storage from the most competitive peers.

If at first you don't succeed ...

Why didn't Intermemory, Oceanstore, FARSITE, Pastiche, CFS, Samsara and all the others succeed? Four years ago I identified a number of reasons:
  • Their model of the edge of the Internet was that there were a lot of desktop computers, continuously connected and powered-up, with low latency and no bandwidth charges, and with 3.5" hard disks that were mostly empty. Since then, the proportion of the edge with these characteristics has become vanishingly small.
  • In many cases, for example Samsara, the idea was that participants would contribute disk space and, in return, be entitled to store data in the network. Mechanisms were needed to enforce this trade, to ensure that peers actually did store the data they claimed to, and these mechanisms turned out to be hard to make attack-resistant.
  • Even if erasure coding were used to reduce the overall replication factor, it would still be necessary for participants to contribute significantly more space than they could receive in return. And the replication factor would need to be higher than in a centrally managed storage network.
  • I don't remember any of the peer-to-peer systems in which participants could expect a monetary reward. In the days when storage was thought to be effectively free, why would participants need to be paid? Alas, storage is a lot less free than it used to be.
Now I can add two more:
  • The centralized systems such as Intermemory and Oceanstore never managed to set up the administrative and business mechanisms to channel funds from users to storage service suppliers, let alone the marketing and sales needed to get users to pay.
  • The idea that peer-to-peer technology could construct a reliable long-term storage infrastructure from a self-organizing set of unreliable, marginally motivated desktops wasn't persuasive. And in practice it is really hard to pull off.

Bandwidth and hard disk space may be cheap, but they aren't free.

Both Intermemory and Oceanstore were proposed as subscription services; users paid a monthly fee to a central organization that paid for the network of servers. In practice the business of handling these payments never emerged. The symmetric systems used a "payment in kind" model to avoid the need for a business of this kind.

The idea that the Internet would enable automated "micro-payments" has a history as long as that of distributed storage, but I won't recount it. Had there been a functional micro-payment system it is possible that a distributed or even decentralized storage network could have used it and succeeded. Of course, Clay Shirky had pointed out the reason there wasn't a functional Internet micro-payment system back in 2000:
The Short Answer for Why Micropayments Fail

Users hate them.

The Long Answer for Why Micropayments Fail

Why does it matter that users hate micropayments? Because users are the ones with the money, and micropayments do not take user preferences into account.
One of Satoshi Nakamoto's critiques of existing payment systems when he proposed Bitcoin was that they were incapable of micro-payments. Alas, Bitcoin has turned out to be incapable of micro-payments as well. But as Bitcoin became popular, in 2014 a team at Microsoft and U. Maryland proposed:
a modification to Bitcoin that repurposes its mining resources to achieve a more broadly useful goal: distributed storage of archival data. We call our new scheme Permacoin. Unlike Bitcoin and its proposed alternatives, Permacoin requires clients to invest not just computational resources, but also storage. Our scheme involves an alternative scratch-off puzzle for Bitcoin based on Proofs-of-Retrievability (PORs). Successfully minting money with this SOP requires local, random access to a copy of a file. Given the competition among mining clients in Bitcoin, this modified SOP gives rise to highly decentralized file storage, thus reducing the overall waste of Bitcoin.
This wasn't clients directly paying for storage, the funds for storage came from the mining rewards and transaction fees. And, of course, the team were behind the times. Already by 2014 the Bitcoin mining network wasn't really decentralized.

After that long preamble, we can get to the question: What is different about the current rash of cryptocurrency-based storage services from the long line of failed predecessors? There are two big differences.

The first is the technology, not so much the underlying storage technology but more the business and accounting technology that is intended to implement a flourishing market of storage providers. The fact that these services are addressing the problem of a viable business model for providers in a decentralized storage market is a good thing. The lack of a viable business model is a big reason why none of the predecessors succeeded

Since the key property of a cryptocurrency-based storage service is a lack of trust in the storage providers, Proofs of Space and Time are required. As Bram Cohen has pointed out, this is an extraordinarily difficult problem at the very frontier of research. No viable system has been deployed at scale for long enough for reasonable assurance of its security. Thus the technology difference between these systems and their predecessors is at best currently a maybe.

However:

Remember It Isn't About The Technology? It started with a quote from Why Is The Web "Centralized"? :
What is the centralization that decentralized Web advocates are reacting against? Clearly, it is the domination of the Web by the FANG (Facebook, Amazon, Netflix, Google) and a few other large companies such as the cable oligopoly.

These companies came to dominate the Web for economic not technological reasons.
The second thing that is different now is that the predecessors never faced an entrenched incumbent in their market. Suppose we have a cryptocurrency-based peer-to-peer storage service. Lets call it P2, to emphasize that the following is generic to all cryptocurrency-based storage services.

To succeed P2 has to take market share from the centralized storage services that dominate the market for Internet-based storage. In practice it means that it has to take market share from Amazon's S3, which has dominated the market since it was launched in 2006. How do they stack up against each other?
  • P2 will be slower than S3, because the network between the clients and the peers will be slower than S3's infrastructure, and because S3 doesn't need the overhead of enforcement.
  • P2 will lack access controls, so clients will need to encrypt everything they store.
  • P2 will be less reliable, since a peer stores a single copy where S3 stores 3 with geographic diversity. P2 clients will be a lot more complex than S3 clients, since they need to implement their own erasure coding to compensate for the lack of redundancy at the service.
  • P2's pricing will be volatile, where S3's is relatively stable.
  • P2's user interface and API will be a lot more complex than S3's, because clients need to bid for services in a marketplace using coins, and bid for coins in an exchange using "fiat currency". 
Clearly, P2 cannot charge as much per gigabyte per month as S3, since it is an inferior product. P2's pricing is capped at somewhat less than S3's. But the cost base for a P2 peer will be much higher than S3's cost base, because of Amazon's massive economies of scale, and its extraordinarily low cost of capital. So the business of running a P2 peer will have margins much lower than Amazon's notoriously low margins.

Despite this, these services have been raising extraordinary amounts of capital. For example, on September 7th last year Filecoin, one of the more credible efforts at a cryptocurrency-based storage service, closed a record-setting Initial Coin Offering:
Blockchain data storage network Filecoin has officially completed its initial coin offering (ICO), raising more than $257 million over a month of activity.

Filecoin's ICO, which began on August 10, quickly garnered millions in investment via CoinList, a joint project between Filecoin developer Protocol Labs and startup investment platform AngelList. That launch day was notable both for the large influx of purchases of Simple Agreements for Future Tokens, or SAFTs (effectively claims on tokens once the Filecoin network goes live), as well as the technology issues that quickly sprouted as accredited investors swamped the CoinList website.

Today, the ICO ended with approximately $205.8 million raised, a figure that adds to the $52 million collected in a presale that included Sequoia Capital, Andreessen Horowitz and Union Square Ventures, among others.
Lets believe for now these USD amounts (much of the ICO involved cryptocurrencies), and that the $257M is the capital for the business. Actually, only "a significant portion" is:
a significant portion of the amount raised under the SAFTs will be used to fund the Company’s development of a decentralized storage network that enables entities to earn Filecoin (the “Filecoin Network”).
Investors want a return on their investment, lets say 10%/yr. Ignoring the fact that:
The tokens being bought in this sale won’t be delivered until the Filecoin Network launches. Currently, there is no launch date set.
Filecoin needs to generate $25.7M/yr over and above what it pays the providers. But it can't charge the customers more than S3, or $0.276/GB/yr. If it didn't pay the providers anything it would need to be storing over 93PB right away to generate a 10% return. That's a lot of storage to expect providers to donate to the system.

Using 2016 data from Robert Fontana and Gary Decad of IBM, and ignoring the costs of space, power, bandwidth, servers, system administration, etc. the media alone represent $3.6M in capital. Lets assume a 5-year straight-line depreciation ($720K/yr) and a 10% return on capital ($360K/yr) that is $1.08M/yr to the providers just for the disks. If we assume the media are 1/3 of the total cost of storage provision, the system needs to be storing 107PB.

Another way of looking at these numbers is that Amazon's margins on S3 are awesome, something I first wrote about 5 years ago.

Running a P2 peer doesn't look like a good business, even ignoring the fact that only 70% of the Filecoin are available to be mined by storage suppliers. But wait! The cryptocurrency part adds the prospect of speculative gain! Oh no, it doesn't:
When Amazon launched S3 in March 2006 they charged $0.15 per GB per month. Nearly 5 years later, S3 charges $0.14 per GB per month for the first TB.
As I write, their most expensive tier charges $0.023 per GB per month. In twelve years the price has declined by a factor of 6.5, or about 15%/yr. In the last seven years it has dropped about 23%/yr. Since its price is capped by S3's, one can expect that P2's cryptocurrency will decline by at least 15%/yr and probably 23%/yr. Not a great speculation once it gets up and running!

Source
Like almost all cryptocurrencies, Filecoin is a way to transfer wealth from later to earlier participants. This was reflected in the pricing of the ICO; the price went up linearly with the total amount purchased, creating a frenzy that crashed the ICO website. The venture funds who put up the initial $50M, including Union Square Ventures, Andreessen Horowitz and Sequioa, paid less than even the early buyers in the ICO. The VCs paid about $0.80, the earliest buyer paid $1.30.

Filecoin futures 6/16/18
Filecoin is currently trading in the futures market at $7.26 , down from a peak at $29.59. The VCs are happy, having found many "greater fools" to whom their investment can, even now, be unloaded at nine times their cost. So are the early buyers in the ICO. The greater fools who bought at the peak have lost more than 70% of their money.

Ignoring for now the fact that running P2 peers won't be a profitable business in competition with S3, lets look at the effects of competition between P2 peers. As I wrote more than three years ago in Economies of Scale in Peer-to-Peer Networks:
The simplistic version of the problem is this:
  • The income to a participant in a P2P network of this kind should be linear in their contribution of resources to the network.
  • The costs a participant incurs by contributing resources to the network will be less than linear in their resource contribution, because of the economies of scale.
  • Thus the proportional profit margin a participant obtains will increase with increasing resource contribution.
  • Thus the effects described in Brian Arthur's Increasing Returns and Path Dependence in the Economy will apply, and the network will be dominated by a few, perhaps just one, large participant.
The advantages of P2P networks arise from a diverse network of small, roughly equal resource contributors. Thus it seems that P2P networks which have the characteristics needed to succeed (by being widely adopted) also inevitably carry the seeds of their own failure (by becoming effectively centralized).
Thus, as we see with Bitcoin, if the business of running P2 peers becomes profitable, the network will become centralized.

But that's not the worst of it. Suppose P2 storage became profitable and started to take business from S3. Amazon's slow AI has an obvious response, it can run P2 peers for itself on the same infrastructure as it runs S3. With its vast economies of scale and extremely low cost of capital, P2-on-S3 would easily capture the bulk of the P2 market. It isn't just that, if successful, the P2 network would become centralized, it is that it would become centralized at Amazon!

15 comments:

Ben said...

This is a good economic analysis, but I think it ignores the most interesting part of decentralized storage: privacy.

Of course AWS will always be more efficient then any decentralized network, but AWS can never credibly commit to not spying on you. Even if you encrypt your data, they will always have your metadata. You could try for a legislative solution, but it would be hard to enforce and open to abuse / corruption.

Whereas, with a decentralized network, each node not only can't see your data, but only gets a tiny slice of your metadata. Even if most of the nodes are in fact just regular datacenters, there would probably be enough of them to spread your metadata out; and, unlike with AWS, collecting your metadata is not really relevant to their business.

This might not be good enough. People might not value not being manipulated enough to use the service at a higher cost compared to AWS. It might be too hard to do technically, and even if it successful, it might end up being centralized enough that you lose your privacy anyway.

I didn't buy into any of these networks myself, but I do hope they are at least somewhat successful, if only for privacy reasons.

Mike L said...

David,
I enjoyed this article. I think it was well thought out, and you raised several interesting points.

I'm a longtime follower of and contributor to Sia. I have a blog called Space Duck where I write about decentralized storage, so I've thought about these issues quite a bit.

P2 will be slower than S3, because the network between the clients and the peers will be slower than S3's infrastructure, and because S3 doesn't need the overhead of enforcement.

I don't think this is necessarily true.

S3 will always be better and faster at the server and software level, but at the data center level, they're slow and lumbering. S3, for example, has no data centers in Africa and only a single data center in South America.

P2 can outplay S3 in terms of latency due to their ability to host data without the requirement of building an entire data center.

P2 will lack access controls, so clients will need to encrypt everything they store.

For the most part, yes, but not in all cases. There are several use-cases (e.g. CDNs for public websites) where the data doesn't really need to be encrypted, although it will probably always need integrity-checks.

P2 will be less reliable, since a peer stores a single copy where S3 stores 3 with geographic diversity. P2 clients will be a lot more complex than S3 clients, since they need to implement their own erasure coding to compensate for the lack of redundancy at the service.

Agree, but this might not end up mattering. BitTorrent is more complex than an HTTP download, but once there's software to abstract away the complexity, it's pretty easy for clients or end-users to ignore that complexity.

P2's pricing will be volatile, where S3's is relatively stable.

It's certainly more stable at this stage, but it's unclear what will happen when the market matures. My prediction is that it will end up kind of like Uber, where it's unstable in the very short term (surge pricing and whatnot), but averaged over the course of weeks or months, fairly stable.

P2's user interface and API will be a lot more complex than S3's, because clients need to bid for services in a marketplace using coins, and bid for coins in an exchange using "fiat currency".

True, but I think it's pretty plausible that someone would build a layer on top of this and offer an S3-like interface with fiat payments.

I actually spent a lot of time investigating the feasibility of doing this with Sia. I ultimately concluded that I couldn't make it work with under the current state of Sia, but I think it's very possible in the future as Sia or competitors mature.

Mike L said...

But the cost base for a P2 peer will be much higher than S3's cost base, because of Amazon's massive economies of scale, and its extraordinarily low cost of capital. So the business of running a P2 peer will have margins much lower than Amazon's notoriously low margins.

I think that this overlooks a big strength of P2 in that it permits players to enter the storage marketplace that could previously never enter.

Right now, if you have 100 TB of extra space in anticipation of storage you're not yet using, there's not much you can do with it except sell the physical disks and buy new ones later when you need them. It's probably not worth the effort and won't save money.

If a storage marketplace exists, now you can earn money from that 100 TB that would otherwise sit unused.

I think of it as similar to Airbnb and hospitality. If you had a spare room in your home 10 years ago, it would be very difficult for you to rent it out for short-term stays. With Airbnb, now it is pretty easy to do so and profitable for both Airbnb and the hosts themselves.

If you were to analyze it from a perspective of margins, assuming that hosts had the same up-front costs as hotels, then hotels would seem to clearly dominate Airbnb rooms. The cost to add an extra guest room to an existing house is very high. It is much less than Hilton spends per room when it constructs a hotel. But the host already has the room so the comparison of margins isn't capturing the whole picture.

Suppose P2 storage became profitable and started to take business from S3. Amazon's slow AI has an obvious response, it can run P2 peers for itself on the same infrastructure as it runs S3. With its vast economies of scale and extremely low cost of capital, P2-on-S3 would easily capture the bulk of the P2 market. It isn't just that, if successful, the P2 network would become centralized, it is that it would become centralized at Amazon!

I'm skeptical of the idea that S3 could simply turn around and dominate P2.

For one thing, they'd no longer have most of the advantages they currently have as S3. They too would need clients to encrypt everything and deal with the more complicated software.

Importantly, they'd lose the advantage of bundling with AWS. One of S3's big advantages is that you don't pay for bandwidth within AWS, but that couldn't hold if they tried to dominate P2 (or maybe they could if they really tried to get tricky with P2 host detection, but they probably wouldn't).

The dominant providers would also be forced to compete more directly with each other. If S3 tried to be the dominant P2 provider, it loses its lock-in. Storage just becomes a commodity. One reason for developers to stay with S3 now is because there data is already there and the software is already written against S3 APIs, but if it's written against generic P2 APIs, the consumer would just move their data whenever another provider offers better pricing.

Mike L said...

One existential risk to decentralized storage that you didn't touch on, but I think is pretty important: legal compliance.

Many people have this idyllic view that decentralized storage will bring with it a censorship-free piracy playground, but I think that's fairly unlikely. If you can make $0.023 GB as a host on P2 and you've got 5 TB free, that's ~$100/month in free money, so the average person might be happy to host. But what happens when the average user then begins receiving letters from federal law enforcement saying that they're hosting illegal material, must delete it within a specified time limit or face jail time, and must cooperate with investigators to hand over relevant information? That extra $100/mo becomes much less attractive.

Legal compliance is an area where entrenched providers have REALLY good economies of scale. They have departments that are set up to receive requests from law enforcement and entire software pipelines to comply with them. I can imagine companies trying to offer this to hosts in a decentralized storage system, but it will be pretty tough.

David. said...

Thank you, Benjamin and Mike, for these thoughtful comments.

My response would be that you're both assuming that a successful cryptocurrency-based storage network would be decentralized. As we see with Bitcoin, Ethereum, and others, this simply isn't the case. All successful cryptocurrencies are highly centralized, for reasons I set out here, here and here. Centralized cryptocurrency-based networks are not secure, they do not provide privacy, and they will compete away any viable return for small players.

If these networks never achieve a scale that would worry S3, they might remain decentralized. And if no-one was actually making a business out of providing storage to them, they might be cheap. But they wouldn't really have changed the world, would they? And, as we see with the flood of alt-coins, they would be easy to attack.

PS - given Kryder's Law, anyone who has 100TB of storage empty in anticipation of future needs should fire their purchasing manager.

Ben said...

David,

I think you're ignoring two major difference between storage and minding: latency and bandwidth.

In mining, blocks are small, and can be easily transferred over long distances between pools very quickly. This means mining in remote areas of the globe has no drawback compared to mining close to cities. So mining concentrates in areas where it is most efficient, causing centralization of hardware. (Mining pool centralization is a separate but related issue).

Your post makes a similar argument for storage centralization, where storage will centralize where it is more efficient. However, location matters for storage.

Imagine that there was a data center in China that could offer unlimited storage 100x more efficiently then anywhere else on the globe. You might naively assume then that they would have ~100% of the global storage market. However, if I wanted to store a movie in that center and stream it to my laptop in New York, I would run into unsolvable latency and bandwidth problems. Having everyone on the planet use this center only makes the problem worse, as the bottleneck between the US and China are the international undersea cable relays, which would be permanently clogged trying to service the whole planet. Ultimately, such a data center would end up as a giant tape drive storage center. Not useless, but not for universal use.

A local entity is always going to be able to offer better latency and bandwidth costs then a distant entity, just by basic physics. Since all data centers use commodity hardware from many competing low-margin manufacturers, close access to equipment manufacturing is not a large advantage. In addition, electricity costs are much lower % of the cost of providing storage as opposed to mining, so I do not think having access to cheap electricity is going to be as large a barrier to energy as with mining.

I do think there are economies of scale that can be accessed. Using dedicated server racks, a dedicated internet connection, and ~99% uptime are probably going to be required. I do not think "random desktop drives" is a reasonable economic model. But I also think small data centers, operating through a network that takes care of overhead, so data becomes a pure commodity like gasoline, have a strong economic case.

AWS doesn't already do this because the marginal overhead cost of setting up a data center in every city is high, and the cost of making a mistake and damaging their network is large (as has already happened). But if the protocol could handle nodes entering and leaving the network with reasonably strong stability and redundancy guarantees (ideally, it would be plug and play), then there is no overhead for starting up your own center to compete in your local network, where you can offer competitive bandwidth and latency rates.

Sorry for the long rant. I'm not an expert in the field, so any of my assumptions here might be wrong, but this is the case I have been thinking of for some time.

David. said...

Benjamin, Amazon already has massive data centers close to every significant market with extreme bandwidth. It buys bandwidth, power, storage media and equipment in volumes, and thus at prices, that "your own data center" cannot possibly match. But much more important, Amazon's cost of capital is incredibly low. So, no, "your own data center" with marginally better latency and vastly worse everything else is never going to be able to compete with Amazon.

But this only matters if cryptocurrency-based storage networks are successful and profitable enough to make it worth Amazon taking them over, which isn't going to happen for the reasons I set out in the sections "This Time Is Different" and "It Isn't About The Technology" above.

TL;DR: (A) cryptocurrency-based storage networks aren't going to become a threat to Amazon's dominance of the storage market. (B) If they ever did, they would be centralized and the place they'd be centralized would be Amazon.

David. said...

David Gerard links to this post, saying I'm arguing that "all distributed file storage cryptocurrency schemes will eventually become less efficient front-ends to Amazon S3."

I can't have been as clear as I thought I was.

I don't think it will ever be possible for outsiders to make money running file storage nodes on S3. Amazon's margins on S3 are awesome, because of their economies of scale and their extremely low cost of capital.

My points were (a) that S3 pricing caps what storage networks can charge (with a cap that shrinks over time, causing a headache), and (b) if despite that they succeed in worrying Amazon, Amazon can use the infrastructure underlying S3 to run nodes on their own account, and vastly undercut outsiders' pricing because they can accept lower margins). This, in effect, puts a floor under what the outsiders can charge, because the only way they worry Amazon is by being so cheap that lots of people put up with the disadvantages.

Now I come to think of it, (b) is what Bitmain is doing to BTC.

Unknown said...

David,
You didn't mention Jim McCoy's Mojo Nation and I was fortunate to work there with Bram Cohen (BitTorrent) and "Zooko" Wilcox-O'Hearn (MNET, Tahoe LAFS and ZCash). I think we were one of first p2p attempts (circa 2001) to include our own currency, based on resources, to reward those offering resources (e.g., storage and bandwidth) and required those consuming resources to pay. Jim and I attributed Mojo's failure to a number of circumstances, including:
- Our inability to raise capital due to Napster's poisoning the VCs
- The immaturity of the software tools available at the time
- Our too early focus on persistence
- Connection limitations at the time left the client's storage-bandwidth ratio unfavorably high for building an infrastructure with reasonable access times
- Creeping feature elegance
- and "we were far too concerned with security and anonymity" for rollout
Today most or all of these limitations are no longer significant.

I generally agree with your assessment of cost competition between centralized and p2p and think today, like Satochi's vision for Bitcoin, p2p file sharing should (like Freenet) focus on censorship resistance.

Steve Schear
Twitter @P01ndexter
Keybase: @P01ndexter

David. said...

Wow! In just nine days this has become the second most-viewed post in this blog's history, with as I write 35,755 views. The third is from May 2013, and in 5 years has accumulated only 26,043 views.

David. said...

Steve, you're probably right that Mojo Nation was one of the first non-symmetric P2P storage networks, and I should have mentioned it. But there are lots of other P2P storage networks I didn't mention - the post takes quite long enough to get to talking about today's efforts!

There are two problems with "focus on censorship resistance" as a marketing strategy for P2P storage networks.

First, the proportion of the overall storage user base that cares strongly enough about resisting censorship to pay over the odds for it is very small, and their demographics tend toward the impoverished end of the spectrum. So they're not a good target market for, in FileCoin's case, a $257M investment. Even if a network of this kind gained substantial penetration in the censorship resistance market, it wouldn't make any difference to the overall storage market.

Second, there is a small but powerful group that is very interested in censorship resistance. Governments, big media and financial institutions, among others, *really* hate censorship resistance, and marketing a product primarily as censorship resistant attracts their attention. Which can make life rather uncomfortable for both developers and users of such a system. This would not be such a problem for a system that was marketed as having lots of user-friendly features, and incidentally in the small print was hard to censor. See Mike L's comment about legal compliance above. And note that anyone hosting a full Bitcoin mining node is already hosting illegal content, arguably including child pornography.

This is why all content in P2P networks must be encrypted and sharded across multiple nodes. That way node operators can claim (a) not to know what they are storing and (b) not to be storing any entire files.

David. said...

I was apparently too cynical about FileCoin's VCs. In Initial coin offerings: Financing growth with cryptocurrency token sales, Sabrina Howell et al write:

"Pre-sale investors, including prestigious VC firms such as Sequoia Capital and Andreessen Horowitz, paid an average of $0.57 per token and agreed to long vesting periods."

DidgetMaster said...

I am convinced that in order for distributed storage to gain a foothold against the entrenched players, a new system must provide some significant benefit over conventional systems besides simply distributing the data. It is not enough to be just some kind of a generic object store that is distributed. It must be something where people would want to move their data to it even if it were just another data manager with its storage container on the local hard drive. The fact that it uses a distributed peer-to-peer model for its underlying storage layer would just be a big bonus.

I have been building such a system for several years now. It started off as a hobby but has evolved way beyond that. It is a kind of object store that blends a lot of file system and database features (but does not depend on third-party tools to support them). It supports multiple data models.

So I can create a single logical container and store hundreds of millions of objects within it. Some of those objects can form one or more hierarchies of unstructured data and thus mimic a set of file system volumes. Other objects can form databases, each with thousands of relational tables. Still others can form data models found only in NoSQL solutions. Every object can have a set of meta-data tags attached to it to make searching quick and easy.

It can find things thousands of times faster than file systems. It is about twice as fast as conventional RDBMS (MySql, PostgreSQL, or Sql Server). It has a bunch of features that are not found in other systems at all. It is still a work in progress, but I think it is just the ticket to get widespread adoption of a distributed data system.

David. said...

David Gerard reports that:

"The SEC has settled with Nebulous, the company behind the Sia file-storage coin, over its 2014 offering — which was an actual altcoin with its own blockchain, not a token. Nebulous paid in fines about double what they took in in that offering."

sinekonata said...

For those of us who don't care about business models and who also tolerate that P2P networks may be indeed taken advantage to by a few leeches who refuse to share anything and go out of their way to cheat on their contributions, is there a cloud space like that? Some network where I can say "Mi disco est tu disco". I have 4TB on a self hosting server (100% uptime) I do nothing with, I would gladly let such a network of comrades use them, while I could use 50GB to be stored redundantly (maybe 10x, that's probably overkill) on that same network. If such a network exists, I will happily buy another 8TB drive btw, I'm really happy to share with the community.

You seem to show that the idea is old, so there must have been successful attempts at this that are still functioning. Even if as you imply such a net should have some authority or authentication. I'm fine with such an authority as long as it's not a company but a community with the peer's interests at heart. If I have to verify my identity or chat with a few people first I also find that fine :D

I ask because your vision although probably correct when it comes to blockchain, seems overly bleak when it comes to communities sharing and a P2P driven internet. Because so far, by reading your post, the de facto solution, seeing as there is none feasible according to you, is to keep using Amazon & co, which many of us just refuse to do.