How many TPS are in your blockchain?

A favorite question about any distributed system from a non-technical specialist is “How many tps in your blockchain?”. However, the number named in response usually has little to do with what the questioner would like to hear. In fact, he wanted to ask whether your blockchain would suit my business requirements, and these requirements are not one number, but many conditions - here both network resiliency, requirements for finality, size, nature of transactions and many other parameters. So the answer to the question “how many tps” is unlikely to be simple, and almost never be complete. A distributed system with tens and hundreds of nodes performing fairly complex calculations can be in a huge number of different states related to the network state, blockchain contents, technical failures, economic problems, attacks on the network, and many other reasons. Stages in which performance problems are possible are different from traditional services, and the blockchain network server is a network service that combines database functionality, a web server and a torrent client, which makes it extremely difficult in terms of the load profile on all subsystems : processor, memory, network, storage

It so happened that decentralized networks and blockchains are quite specific and unusual software for centralized software developers. Therefore, I would like to highlight important aspects of the performance and sustainability of decentralized networks, approaches to measuring and finding bottlenecks. We will look at various performance issues that limit the speed at which blockchain users are provided with the service and note the features that are typical of this type of software.

Stages of service request by the blockchain client

In order to honestly talk about the quality of any more or less complex service, you need to take into account not only the average values, but also the maximum / minimum, medians, percentiles. Theoretically, it is possible to talk about 1000 tps in some blockchain, but if 900 transactions were executed at a tremendous speed, and 100 “stuck” for a few seconds, then the average time collected for all transactions is not quite an honest metric for a client who for a few seconds could not complete the transaction. Temporary "pits" caused by missed rounds of consensus or network separation can greatly spoil the service, which on the test benches showed excellent performance.

To identify such bottleneck-and it is necessary to understand well the stages at which a real blockchain may have difficulty in servicing users. Let's describe the cycle of delivery and processing of a transaction, as well as receiving a new state of the blockchain, from which the client can verify that his transaction has been processed and taken into account.

transaction is formed on the client
transaction subscribes to client
the client selects one of the nodes and sends its transaction to it
the client subscribes to the state database updates of the node, waiting for the results of the execution of its transaction
node spreads transaction over p2p network
several, or one BP (block producer) processes accumulated transactions, updating the state database
BP forms a new block by processing the required number of transactions.
BP distributes new p2p network block
A new block is delivered to the node that the client is accessing.
node updates state database
The node sees the client-related update and sends it a transaction notification.

Now let's take a closer look at these stages and describe the potential performance problems at each stage. Unlike centralized systems, we also consider the execution of the code on the clients of the network. Quite often, when measuring tps, transaction processing time is collected from the nodes, and not from the client - this is not entirely fair. The client doesn’t care how quickly the node has completed its transaction, the most important thing for him is the moment when reliable information about this transaction included in the blockchain becomes available to him. It is this metric that is essentially a transaction execution time. This means that different clients, even sending the same transaction, can get completely different times, which depend on the channel, workload and proximity of the node, etc. So it is absolutely necessary to measure this time on clients, since it is this parameter that needs to be optimized.

Client side transaction preparation

Let's start with the first two points: the transaction is formed and signed by the client. Oddly enough, this could also be the bottleneck of blockchain performance from a client’s point of view. This is unusual for centralized services, which all calculations and data operations take for themselves, and the client simply prepares a short request that can request a large amount of data or calculations, getting the finished result. In blockchains, client code is becoming more and more powerful, and blockchain core is becoming more and more lightweight, and massive computing tasks are usually given to client software. In blockchains, there are clients who can prepare a single transaction for quite a long time (I’m talking about various merkle proofs, succinct proofs, threshold signatures and other complex operations on the client side). A good example of easy on-chain verification and heavy preparation of a transaction on a client is proof of belonging to a list based on a Merkle-tree, here is an article .

Also, do not forget that the client code does not just send transactions to the blockchain, but first asks for the state of the blockchain - and this activity can affect the network load and blockchain nodes. So, taking measurements, it will be reasonable to emulate the behavior of the client code as completely as possible. Even if in your blockchain ordinary light clients who put a regular digital signature on the simplest transaction for transferring some kind of asset, every year massive computing on the client still becomes larger, the cryptoalgorithms grow stronger, and this part of processing can turn into a weighty bottleneck in the future. Therefore, be careful and do not miss the situation when in a transaction lasting 3.5s, 2.5s is spent preparing and signing a transaction, and 1.0s is sending to the network and waiting for a response. To assess the risks of this bottleneck occurrence, it is necessary to collect metrics from client machines, and not just from the blockchain node.

Sending a transaction and monitoring its status

The next step is to send the transaction to the selected blockchain node and get its acceptance status to the transaction pool. This stage is similar to the usual access to the database, the node must write the transaction to the pool and start distributing information about it through the p2p network. The approach to evaluating performance here is similar to evaluating the performance of traditional Web API microservices, and the transactions themselves in the blockchains can be updated and actively change the status. In general, updating information about a transaction in some blockchains may occur several times, for example, when switching between forks of a chain or when BP reports its intention to include a transaction in a block. Restrictions on the volume of this pool and the number of transactions in it can affect the performance of the blockchain. If the transaction pool is crammed to the maximum possible size, or does not fit in the RAM, network performance may drop dramatically. Blockchains do not have centralized protections against the flow of garbage messages, and if the blockchain supports high volume transactions and low fees, this can lead to a transaction pool overflow - this is another potential performance bottleneck.

In blockchains, the client sends the transaction to any blockchain node he likes, the transaction hash is usually known to the client before sending, so all he needs is to make the connection and wait after the transfer when the blockchain changes its state by turning on its transaction. Note that by measuring "tps" you can get completely different results for different ways to connect to the blockchain node. This can be plain HTTP RPC or WebSocket, which allows to implement the "subscribe" pattern. In the second case, the client will receive a notification earlier, and the node will spend less resources (mainly memory and traffic) on responses about the status of the transaction. So when measuring "tps" it is necessary to take into account the way clients connect to nodes. Therefore, to assess the risks of the appearance of this bottleneck, the blockchain benchmark should be able to emulate clients with both WebSocket and HTTP RPC requests, in fractions corresponding to real networks, as well as change the nature of transactions and their size.

To assess the risks of this bottleneck, you also need to collect metrics from client machines, and not just from the blockchain node.

Transfer of transactions and blocks on a p2p network

In blockchains, peer-to-peer (p2p) networking is used to transfer between participants in transactions and blocks. Transactions are distributed over the network, starting with one of the nodes, until they reach peer-block producers, which pack transactions into blocks and use the same p2p to distribute new blocks to all the nodes of the network. The basis of most modern p2p networks is various modifications of the Kademlia protocol. Here is a good overview of this protocol, but here ’s an article with various dimensions on the BitTorrent network, from which you can understand that this type of network is more complicated and less predictable than a rigidly configured network of centralized service. Also, here is an article about measuring various interesting metrics for Ethereum nodes.

In short, each peer in such networks maintains its own dynamic list of other peers, from which it requests blocks of information that are addressed by content. When a request is received, peer either gives the necessary information, or sends the request to the next pseudo-random peer from the list, and after receiving the answer, it sends it to the requester and caches it for a while, returning this block of information earlier next time. Thus, popular information turns out to be in a large number of caches for a large number of peers, and unpopular information is gradually being supplanted. Peers keep records of who transmitted information to whom, and the network tries to stimulate active distributors by increasing their rating and providing them with a higher level of service, automatically displacing inactive participants from peer lists.

So, now the transaction must be distributed over the network, so that block-producers can see it and include it in the block. Noda actively "distributes" a new transaction to everyone and listens to the network, waiting for the block, in the index of which the necessary transaction will appear to notify the waiting client. The time until the network transfers to each other information about new transactions and blocks in p2p networks depends on a very large number of factors: the number of honest, working alongside (from a network point of view) nodes, the warmth of the caches of these nodes, the size of blocks, transactions, changes , network geography, the number of nodes and many more factors. Comprehensive measurements of performance metrics in such networks is a complicated matter, it is necessary to simultaneously evaluate query processing time both on clients and peers (blockchain nodes). Problems in any of the p2p mechanisms, incorrect preemption and caching of data, inefficient management of lists of active peers, and many other factors may cause delays affecting the efficiency of the entire network, and this bottleneck is the most difficult to analyze, test and interpretation of results.

Block chain processing and state database update

The most important part of the work of the blockchain is the algorithm of consensus, its application to the new, received from the network blocks and the processing of transactions with the recording of results in the state database. Adding a new block to the chain and the following selection of the main chain should work as fast as possible. However, in real life, “must” does not mean “works,” and you can, for example, imagine a situation where two long competing chains constantly switch among themselves, changing the metadata of thousands of transactions in the pool at each switch, and producing constant rollbacks of the state database. This stage, in terms of determining bottleneck, is simpler than the network p2p layer, since transaction execution and consensus algorithm are strictly deterministic, and measuring something here is easier.
The main thing is not to confuse the random degradation of the performance of this stage with the network problems - the nodes are slower giving blocks and information about the main chain and for an external client it may look like a slow network, although the problem lies in a completely different place.

To optimize performance at this stage, it is useful to collect and monitor metrics from the nodes themselves, and include those related to updating state-datbase: the number of blocks processed at the node, their size, the number of transactions, the number of switching between forks, the number of invalid blocks , virtual machine runtime, data commit time, etc. This will not confuse network problems with errors in the chain processing algorithms.

A virtual machine transactional transaction can be a useful source of information that can optimize blockchain performance. The number of memory allocations, the number of read / write instructions, and other metrics relating to the effectiveness of executing contract codes can provide a lot of useful information to developers. At the same time, smart contracts are programs, which means that in theory they can consume any of the resources: cpu / memory / network / storage, so transaction processing is a rather indefinite stage, which, in addition, varies greatly when switching between versions and change the contract code. Therefore, metrics related to transaction processing are also needed to effectively optimize the performance of the blockchain.

Receipt by the client of a transaction inclusion on the blockchain

This is the final stage of receiving service by the blockchain client. Compared to other stages, there is no large overhead, but you should still consider the possibility of the client receiving a lengthy response from the node (for example, a smart contract giving an array of data). In any case, this moment is the most important for the one who asked the question "how many tps in your blockchain?", Because At this moment, the time of receipt of the service is recorded.

At this point, there is always the sending of full time that the client had to spend waiting for a response from the blockchain, this is the time the user will wait for confirmation in his application, and it is his optimization that is the main task of the developers.

Conclusion

As a result, it is possible to describe the types of operations that are performed in blockchains and divide them into several categories:

cryptographic transformation, building evidence
peer-to-peer networking, transaction and block replication
transaction processing, smart contract execution
applying changes in the blockchain to the state database, updating data on transactions and blocks
read-only requests to the state database, blockchain node API, subscription services

In general, the technical requirements for the nodes of modern blockchains are extremely serious - these are fast CPUs for cryptography, a large amount of RAM in order to store and quickly access the state database, network interaction using a large number of simultaneously open connections, and volumetric storage. Such high requirements and the abundance of various types of operations inevitably lead to the fact that the resources of the nodes may not be enough, and then any of the above stages may become another bottleneck for the overall network performance.

Developing and evaluating the performance of blockchains, you will have to take into account all these points. To do this, you need to collect and analyze metrics simultaneously from clients and network nodes, look for correlations between them, estimate the time for providing services to clients, take into account all the main resources: cpu / memory / network / storage, understand how they are used and influence each other. All this makes comparing the speeds of various blockchains in the form of "how many TPS" extremely ungrateful, as there are a huge number of different configurations and states. In large centralized systems, clusters of hundreds of servers, these problems are also complex and also require collecting a large number of different metrics, but in blockchains, because of p2p networks, virtual machines, contracts, the internal economy, the number of degrees of freedom is much larger, which makes the test even on several servers, it is not indicative and shows only very approximate values, which have almost no connection with reality.

Therefore, when developing in the blockchain's core, to evaluate the performance and answer the question "did we use a rather complicated software, orchestrating the start of the blockchain with dozens of nodes and automatically launching the benchmark and collecting metrics, to answer the question" has it been difficult to debug protocols that work with multiple participants.

So, having received the question "how many TPS are in your blockchain?", Offer your interlocutor tea and check if he is ready to familiarize himself with a dozen charts and also listen to all three boxes of blockchain performance problems and your suggestions for solving them ...

Source: https://habr.com/ru/post/459763/

All Articles