In the
first part of the article , the problem of storing application data in the blockchain was identified. To understand the essence of what is happening, we recommend to
read it . In this part of the article, we will designate our wishes for the properties of an ideal data warehouse, as well as consider existing approaches to solving this problem.
So, if we have a decentralized application, what data warehouse requirements would be worth? We offer the following requirements
- Distribution - since the entire blockchain infrastructure and applications on it are distributed, the data storage must also be distributed and decentralized.
- Publicity - blockchain allows everyone to add their equipment to the network. It would be logical to expect the same from the data warehouse.
- Resistance to the problem of Byzantine generals and other types of attacks in the public network - in a public distributed network without it in any way.
- Sharding support - if we expect the application to be popular and to store huge amounts of data, then it would be good to use the power of the network to increase the maximum storage volumes. Full replication of data on each node, of course, reduces the chances of data loss in case of problems with individual nodes. However, in the case of high-capacity networks, for example, hundreds or thousands of servers, duplicating all data on all servers is extremely redundant, and you can reduce the level of replication in favor of increasing the maximum total amount of data. That is, if we have N servers, then each record should be replicated only to m of them, m < N . This will allow a linear increase in the total amount of stored data by adding servers.
- Speed - popular applications may require hundreds of thousands, if not millions of transactions to save or read data per second
- Structured — The storage must be able to preserve the internal structure of the data to allow applications to link individual records to each other.
- Deletion of data - the storage must support the deletion of more unnecessary data for the application to free up space
- Secondary keys,
full-text search, query language - Applications should be able to perform a quick search on the stored data, given their internal structure
Let's see how existing technologies meet these requirements.
IPFS
IPFS (InterPlanetary File System) is a distributed file system technology based on
DHT (Distributed Hash Table) and
BitTorrent protocol. It allows you to combine file systems on different devices into one using content addressing.
')
Advantages :
- Each device stores only those files that it needs, plus meta-information on the location of files on other devices. Therefore, do not require additional. file storage motivation.
- There is no need to trust peers, because files are addressed by content.
- Resistance to flooding (downloading unnecessary files to the network), because the files are placed only on their own device.
- High bandwidth (thanks to BitTorrent)
Disadvantages :
- Storing only files (unstructured information).
- After placing the file you can not leave the network until it disperses.
- Data storage by other devices is not guaranteed, in order to guarantee the provision of your file, others need to be online
- Files are static (immutable)
- Deleting a file is basically not provided.
It is with the use of IPFS that the
AKASHA (Ethereum + IPFS) social network and the
OpenBazaar marketplace are built. They fully inherit the above-mentioned disadvantages of IPFS, the main of which is that it is impossible to place information on the network and go offline until it has spread around the feasts.
Distributed File Storage
Such storage allows you to combine individual devices in a shared cloud storage. As a result, users can store their files there as well as they could do in classic centralized storage, for example,
Dropbox , but cheaper. The owners of the devices (“farmers”), by providing a place to store other people's files, receive money from users for this, according to their contribution. To measure the contribution, ensure reliable storage and prevent abuse, various checks are used, for example, proof of storage, proof of retrievability (proof that the file is available and can be extracted) based on cryptography. The user pays for successful verification, and the farmer receives a certain amount in cryptocurrency.
Such projects are built, mainly using DHT technology and content addressing (when the hash from the file is its identifier). Some additionally use smart contracts.
Currently there are quite a few such projects on the market, for example,
Sia ,
Storj ,
Ethereum Swarm ,
MadeSAFE . They are all built on similar principles. And Ethereum Swarm was conceived, among other things, to provide a convenient storage of files for dApps.
Advantages :
- Files are stored in the cloud and are available regardless of the availability of their owner.
- High throughput.
- Due to financial motivation, the reliability of storage and retrieval of files is ensured.
- Delete unnecessary files possible
Disadvantages :
- Storing only files (and not structured information)
- Files are static
- Storage is not free
Distributed file storages look attractive for file storage. However, to store structured dynamic information, such as social network user data, storing data in static files is a serious problem. The fact is that file storages do not know anything about the contents of the file, and the application may be required to search for information not only by the identifier (or name) of the file, but also by its contents. For example, find all users with the blockchain keyword. Or located in a particular city. Or even carry out a full search for publications. Therefore, we continue to look for a better implementation.
Distributed Databases
Unfortunately, by virtue
of the CAP theorem, it is impossible to obtain a fully distributed database that simultaneously ensures consistency, accessibility, and resistance to separation (the latter means that the database continues to function even if part of the nodes are disconnected from the network or their messages do not reach).
For our needs, we need a distributed database, of course, resistant to separation and accessible - we need to quickly get an answer from it, although perhaps not the most recent one. This limits our choice to NoSQL databases, because
ACID SQL DBMS primarily ensures consistency.
The implementation of distributed NoSQL databases is great. For example,
MongoDB ,
Cassandra ,
RethinkDB . All of them are able to work with a large number of replicas, united in a cluster. The client works with one of the replicas, and the data is automatically synchronized with the others. For load balancing, sharding can be used when part of the data is stored only on part of the replicas. Adding a new replica to the cluster almost linearly scales the cluster, with some implementations (for example, Cassandra) allowing the replica to automatically take over part of the cluster's work.
NoSQL databases provide “integrity ultimately” (eventual consistency), that is, the data becomes consistent after a while, when individual replicas are synchronized. In this they are similar to the blockchain - the more likely is the confirmation of a transaction, the more time has passed.
NoSQL databases can store just a key value, and maintain the internal structure of the value, as well as additional indexes. The most advanced also have basic transaction support and a SQL-like query language (for example, Cassandra).
For all the above, this class of databases may seem ideal for use in the blockchain. But there is a problem. Imagine that someone added a malicious replica to a well-coordinated cluster of such databases, which begins to inform other replicas in the cluster that all data must be deleted! All other replicas obediently all data will be deleted, and the database will be hopelessly corrupted. That is, such well-coordinated work of replicas is now possible only in a trusted environment (a cluster of such databases is unstable to the problem of Byzantine generals). If a maliciously functioning replica is placed in the cluster, it can cause the destruction of the data of the entire cluster.
Advantages :
- High speed
- Linear scaling of storage speed and size
- Resistance to unavailability of individual replicas
- Mature implementations
Disadvantages :
Bigchaindb
There is a blockchain implementation called
BigChainDB or, as it is also called, IPDB (InterPlanetary DataBase), which is often referred to as a solution to all problems with data storage. The authors declare a very high transaction rate (1 million / sec), a huge storage capacity (due to distributed storage with partial replication). BigChainDB gains these benefits through a simplified consensus on building blocks, as well as by storing all the blocks and transactions in an existing noSql database implementation —
RethinkDB or
MongoDB .
Unfortunately, this architecture has a significant drawback - each node has full rights to write to the general data repository, which means that the system as a whole is unstable to the problem of the Byzantine generals. The authors of this project are
aware of
this , promising to think about it later. However, the correction of fundamental problems in the basic architecture after the release of the product is very time consuming and often impossible, because it can lead to a significantly different product with a different architecture. Such a light attitude to the fundamental problem causes
criticism of the project from the community, because the demonstrated high speed and volumetric characteristics of BigChainDB in the absence of BFT (Byzantine Fault Tolerance) are not so different from the characteristics that were initially demonstrated by noSql databases used by
RethinkDB and
MongoDB it for data storage. But since you still need full trust between the nodes, then why not use these databases directly?
Thus, the actual use of BigChainDB is limited to private networks. In order not to mislead people, BigChainDB should be called BigPrivateBlockChain, then there would be no questions. For public networks, it does not fit.
Advantages :
- Speed ​​and storage comparable to distributed noSql databases
Disadvantages :
- In essence, this is the usual noSql database, complemented by the blockchain’s shortcomings.
- Immutability (data can not be deleted legally, but you can maliciously)
- Instability to the problem of the Byzantine generals, therefore, the impossibility of use in the public network
Thus, BigChainDB is completely unsuitable for storing data of decentralized applications in public networks.
findings
We considered several approaches to organizing data storage for public networks that can be used by distributed applications. There were few of them, but no more at the moment. Unfortunately, none of the approaches satisfies all the requirements that we formulated at the beginning of the article.
This situation resembles the stage of the formation of computers, when programs could save data only in files, and this was inconvenient. Therefore, for computers they have implemented DBMS, and
some have made a fortune on this.
As a result, existing decentralized applications are interrupted by storing data directly in the blockchain or in distributed file systems,
as in the Stone Age . They are forced to independently implement indexes on files, invent their own data format and generally spend a lot of time inventing bicycles, albeit decentralized.
But the world of decentralized applications cannot remain without a convenient data storage. Therefore, in the next part of the article, you will be presented with the concept of a repository that claims to satisfy all the requirements set out above.
→
The third part of the article→
The first part of the article