Programster's Blog

Tutorials focusing on Linux, programming, and open source

Storj.io and Decentralized Storage

Storj.io is a decentralized "cloud" storage option that aims to allow users to share hard drive space to create a system similar to Dropbox or Seafile as shown in the video below:

The Issues

I had the exact same idea a while ago when I finished reading about Bitcloud, which again is pretty much the same thing, except not specific to storage alone. I started working on the idea, but it never really got past the initial planning phase as I saw some major issues.

The Management Service

There is always going to need to be a management service that:

  • peers can use to find each other.
  • keeps track of what files are where
  • keeps track of nodes going up/down
  • allows people to register
  • handles payments
  • is up 24/7/365.

This means the management service needs high availability over multiple geographical regions. Thus, the system will have to cope with high latency between nodes, and cannot rely on a simple ACID transactional database such as MySQL. The whole design has to have an "eventually consistent" and "consensus based" methodology that have popped up with the NoSQL database trend. This is incredibly difficult to develop for, as it can lead to unforseen consequences. An exploit of these components in Bitcoin's protocol is one of the many ways that bitcoins managed to be "double spent" which contributed to the eventual demise of Mt. Gox, which was one of Bitcoin's most popular marketplaces. For more information, please refer to this wiki.

The management service is an extra overhead that needs paying for, either by the members of the community running the servers themselves, or through whatever payment system is setup. This is also an issue that Bitcoin will have whenever all the bitcoins have been found, and I believe the current proposal is for a tiny charge being incurred on all bitcoin payments when that happens in order to pay for these servers.

Redundancy

The video clearly showed a user gaining the equivalent amount of space that they were providing to others. This would terrify me becuase it meant that there was no redundancy in place. Each of th files is divided into hundreds/thousands of small chunks, and then spread across to hundreds of other people's computers that you don't know. Do you really expect none of these computer's to have a hard drive failure, a power failure, or for their owners to switch them off? One needs a system where each chunk of data is not sent to just one node, but copied to preferably at least two additional nodes. This way, the system should hopefully have enough time to spot when a node becomes unavailable and duplicate the necessary data before all the nodes become available. It only takes all the nodes of a single chunk to become unavailable in order to lose your data, and a file may have hundreds of these chunks, making it stastically very probable (almost guaranteed) with a redundancy level of just two. By having a redundancy level of 3, one would only gain a third of the capacity they provide to others. However, this does mean that when fetching files, one can just download from the fastest peers, and the others just act as "backups", giving a role to users with slow connections. Gaining only a third of the capacity that you provide to other means that user's are unlikely to utilize RAID volumes with redundancy, increasing the likelyhood of node failures.

Connection Speeds

Home users tend to have terrible upload speeds. This is the main reason why I have two pipes going into my pfsense router that I load balance. This means that if there was a single connection to share a file between two users, it would be incredibly slow. The workaround is to download from hundreds of peers instead, just like with bittorrent. However, bittorrent is different because most users are sharing the same limited number of files. However, with this system, user's are just downloading their own files. Imagine if 25% of the system's users were currently trying to utilize the system? In a best-case scenario where the files are perfectly balanced, all user's upload speeds would be maxed out. This is because a user's download speed is usually much more than 4 times that of their upload speed.

The other side of this issue is that it takes a very long time for the files be uploaded onto the system in the first place. For a user to upload a single TiB of data on a 5 megabit upload speed (I think that's generous, my parents have only 0.5), it would take 20.36 days, just Google 1 TiB / 5 megabit / 60 / 60 / 24.

Payment System

Everything gets complicated as soon as real money is involved, especially with anything that is opensource. Hence it sounds like a crypto-currency is going to be utilized instead, which will, for all intents and purposes, be worthless. It will help facilitate the "trading" of storage space fairly, but don't start building a massive storage array in hopes of getting rich.

Legal Implications

I saved this for last because other people have mentioned it, but I can only really see this being an issue if data chunks weren't encrypted, or if user's were able to share/transfer their content with other user accounts. I feel that I have probably missed something. If you mention something in the comments, I will either add it to the article, or explain why it is not an issue.

Conclusion

Decentralization is going to happen, in fact it's already happening with systems such as Seafile and Owncloud which allow users to host their own alternatives to the central brands. The key difference is that these systems do not yet utilize a "sharing" core where users make use of each other's hardware, which Storj.io and Bitcloud are trying to achieve. Personally, I hope that Storj.io becomes a successful opensource project, but I'm not going to wait. I recommend setting up your own Seafile server, either from home, or on a Vultr Sata storage instance which will have better bandwidth and a static IP, starting at 160GB for 5 USD.