Managing Multi-Party Contracts on the Blockchain: The Curious Case of Batching — Part III

18 min readOct 7, 2021

Welcome back! This is the last part of a three-episodes article. In Part I we discussed the business context of the solution, provided an overview of the platform capabilities, and introduced the business rationale to batching for multi-party contracts. In Part II we looked under the hood, discussed the challenges of the implementation of batching both from a logical and a technical perspective. We discovered the additional complexity of dealing with a transactional system such as Hyperledger Fabric. In this part we will describe some solution options and wrap up the discussion.

As usual, buckle up and let’s get started!

Exploring the Solution Space

The implications associated to the transactional behaviour of Hyperledger Fabric, require us to rethink our initial approach and identify a solution that is both correct and usable. By correct we mean being able to commit to the ledger only state changes associated successful operations in the batch, while by usable we refer to potentially avoiding an implementation that has an “all or nothing” behaviour when it comes to batches.

Here are some of the options that we can consider:

Modelling the batch execution as an atomic transaction.
Taking away the batch capability from the smart contract and orchestrate it in the backend (API layer).
Altering the execution logic of the batch to cater for a different handling of the operations that fail.
Implementing the batch capability without reusing the logic already implemented for single operations and design it “batch first”.

There is no silver bullet here. The answer is very much dependent of the specific functionality and system being built. Let’s explore the pros and cons of all the options.

Option 1 — Batches as Atomic Transactions

This option entails designing the submission of the batch as an atomic transaction, thus falling back to the “all or nothing” semantic. With this approach the first failing operation incurred in the batch will cause the entire batch submission to fail.

This is perhaps the most effective and quickest solution… when considered in isolation. By making the batch operation failing as soon as one of the operations contained in it fails, we rely again on the services provided by Hyperledger Fabric to smart contract developers. The approach guarantees no corruption of the ledger data at the cost of completely re-executing the batch submission. Strong points of this approach are: safety, simplicity, and reuse out of the box of the transactional behaviour of Hyperledger Fabric.

This solution may be acceptable in scenarios where the failure due to sporadic errors is limited and rare and the cost of repeating the batch is not excessive. Costs in this case are associated to a variety of factors: user experience, technical complexity, etc… One main driver for cost consideration is the fact the batch submission to the smart contract is just a step of the overall submission process, which has already accrued state in previous steps. Should the batch fail, such state would need to be reverted.

In cases where the submission of the batch to the smart contract is the only step of the process, the “all or nothing” logic will not cause significant disadvantages as there is no (or minimal) accrued state to revert in case of failure. In our case, the batch has already gone through an internal approval process which would require rollback, thus making the complexity of the system increase. The key disadvantages here are: i) building complex rollback logic that is only invoked sporadically to retain the coherence of state between the on chain and off chain; and ii) potential execution of un-necessary rollbacks on successful operation caused by one single failure.

To compensate this disadvantages we can build into the process an automatic retry logic, that may help in overcoming the sporadic failures without the need to roll back, but this eventually adds to the complexity of the design and implementation and does not essentially eliminates the fact that the failure in one sub-system has consequences on already successfully completed tasks.

Finally, from a user experience standpoint this approach would lead to redo or re-execute a more significant amount work. If we adopt the “all or nothing” approach a user would eventually be exposed to repeating the execution of operation that were — if performed in isolation — successful, versus only repeating those that failed. The design may decide to expose the details of the failure thus highlighting that we are re-doing work that was successful, or be more opaque leaving the user in the dark about the specific operation that failed. In both cases, we deliver a negative user experience.

Option 2 — Implementing Batching Outside the Ledger

This approach eliminates batching capabilities from the smart contract layer and delegates the management of batches outside the ledger. The smart contract layer needs only to be enriched with the workflow to import a single contract.

This solution removes the headache completely… at first. Moreover, it also limits the amount of necessary changes to make to the system to support batching across the solution stack. The main benefits again appear to be simplicity, and reduced intrusion into the solution stack.

Fantastic. Is it? Well, complexity comes to the table again. Managing batches up until the backend-API layer, implies that the services in the backend are required to implement the orchestration logic as well as the collection of results and the tracking of their execution. One advantage of running batched execution in the ledger is the “all-or-nothing” execution model of the transaction and the fact that we don’t need to deal with restartable processes and partial failures. As we orchestrate the batching logic outside the ledger, we can no longer rely upon the transactional behaviour of Fabric for the entire batch. The batch is no longer executed as a single transaction but as a sequence of transactions (i.e. one for each operation in it) that need to be individually accounted for and tracked. While this seems similar to the already existing behaviour we now have to deal with grouped operations submitted by the user as a single unit. If a failure in the system occurs in the middle of the batch execution we need to implement logic to make the process resumable from where it was left. Again, it introduces an additional layer of complexity that previously was absorbed by the implementation of Hyperledger Fabric.

Besides complexity, another aspect that this solution brings to light is decreased throughput. As we process one transaction for each operation in the batch, rather than a single transaction, we introduce additional network communication, more opportunity for failures to occur and reduce the amount of work that we execute per transaction.

To conclude, disadvantages of this approach are: i) increased complexity in the off-chain logic handling batches, that needs to be more sophisticated and resilient; and ii) reduced throughput.

Option 3 — Implementing a More Refined Batch Execution Logic

This solution entails modifying the execution logic of the single operations when they are executed within the context of the batch to cater for the generation of the correct read/write sets. The smart contract is altered to execute operations in a different context when they are submitted within a batch.

This is a lot of fun! and a lot of complexity, and deserve a more in-depth explanation than the other approaches, as it is also quite interesting. There is very much chance that you are calling for troubles if you’re not doing it right. In essence we want to achieve the following:

track the read set, but more importantly the write set, of each individual operation;
discard those associated to a failed operation;
retain only those associated to a successful operation; and
commit them to the ledger.

Seems easy at first, except for the fact that from within the smart contract we don’t have direct control on the generation of the read/write set created for the transaction, neither we want to alter Fabric’s transaction flow behaviour.

One technique that we can use is proxying the ChaincodeStubInterface during the invocation of operation within a batch. The implementation of this techniques entails a combination of decorator and proxy, since it control the access to the real ChaincodeStubInterface implementation and adds a couple of useful functionalities. The proxy will perform the following for each operation:

intercept GetState(key string) calls and record their result internally;
intercept PutState(key string, value[] byte) calls and cache the value;
intercept DelState(key string) calls and mark the key for deletion.

Upon termination of the operation execute the following:

is the status code identifies a successful execution then forward the changes to the proxied ChaincodeStubInterface by calling the corresponding PutState(….) and DelState(….) methods for the keys that have been recorded.
if the status code identifies a failure discard the cached keys.

We don’t need to necessarily operate and intercept the calls to the iterator methods GetStateByRange(….), GetStateByCompositeKey(…), and the associated version with pagination because these are only contributing to the read set, which will tend to be larger than one resulting by composing the read/write set of the successful operations.

This approach works because Hyperledger Fabric does not have a “read-after write” semantic, and therefore we can buffer the changes in the context of a transaction.

The implementation of this logic is rather sophisticate but it acts transparently to the existing operations that are unaware of running in batched context and do not require modification.

What is required for this system to work is enforcing the correct semantic for each operation, where by:

200xx and 3xx status codes identify successful execution
4xxx and 5xx identify failure.

This is in compliance with what Hyperledger Fabric advises in relation to chaincode development. One additional measure that I would consider (if possible) is tying 4xx error codes to authorisation/authentication checks to be executed prior to performing any update to the ledger, so that errors are properly identified and can be safely rolled back. If we impose that 4xx error can only be generated prior to writing to performing any update of the ledger than we can restrict the discarding behaviour to only sporadic errors 5xx. The differences from an implementation perspective are minimal, since they only affect the check of the operation status code.

The diagram below explains the process for executing batches at smart contract level with this approach. The smart contract method that receives the batch submission sequentially executes the operations in the batch by invoking the corresponding smart contract method with a modified ChaincodeStubInterface, which implements the behaviour described above. After each successful invocation it queries the interface for the modified keys and call the real stub interface to propagate changes. Any error in this phase will cause the method to abort thus preserving the integrity of the ledger by relying on the all-or-nothing semantics.

Figure 1— Invocation of the (original) smart contract methods with a Chaincode Stub Proxy. The proxy buffers all the updates to the ledger in a local key cache and changes are applied if and only if the method terminates successfully. If there is any error while updating the keys the batch execution is aborted to prevent the creation of inconsistencies. By buffering the changes locally we prevent the peer to record updates that will be added to the read/write set regardless of the result of the method. For PutState(K,V) and DelState(K,V) we may issue a GetState(K) if we want to distinguish between creation/update and we want to check the existence of the key during deletions.

The main advantages of this technique are: i) increase throughput as we can execute all the batch operations within a single transaction; ii) flexibility as it enables us to re-execute only failed operations; and iii) minimal changes to the existing methods (i.e. ensuring that 500 is returned only for errors that are not under our control).

The approach also comes with disadvantages, one of them being the complexity of the ChaincodeStubInterface proxy which essentially replicates within a context of a transaction what Hyperledger Fabric already does for the entire transaction. This replication is necessary to ensure that only the changes that are not leading to an error are propagated. The other disadvantage is the natural coupling of this algorithm with the rest of the implementation of the smart contract. The implementation strongly relies on the semantic of the status code returned by the single operation and enforces a contract with invoked methods. This is not particularly bad as it does relies on the same assumption that Hyperledger Fabric uses to interpret the outcome of a single transaction.

Complexity is definitely the biggest disadvantage of the two. While the concept of proxy and the logic driving the batch execution may seem simple, in order to verify their correctness we need to perform an extensive set of tests, which require a low level understanding of the ledger key updates for each operation that can be subject to batching. This require a deep understanding of the existing smart contract code base that goes beyond the execution of the business flow. The burden of testing in this case is quite relevant as the implementation of the batching support is generic in its nature, and therefore needs to be integration-tested with all the operations of the smart contract. This is primarily done to ensure that should batching be extended to other functions than IMPORT we don’t have bad surprises.

Option 4 — Implementing a Batch First Approach

This approach entails providing a completely different logic to execute batches in the smart contract layer. We will not be relying on the existing implementation for the operations being batched, but rather we will be pursuing a new implementation designed to operate on batched entities from the ground up. In contrast, all the previous solutions built upon the existing logic and added a layer on top of the existing implementation to manage batches.

This option is cited more for the sake of completeness than anything else. But it is worth to explore potential benefits and drawbacks.

As in the previous option we are able to execute the batch in a single transaction, possibly with management of partial failures. The more interesting aspect is that we achieve complete separation of the batching logic from the existing implementation of the single operations, and this is perhaps the only advantage that this approach brings. As a result of this we also isolate the impact of this new feature and prevent the solution from becoming unstable as we are not modifying existing smart contract code. Another derived benefit that comes from the separation is the ability to independently modify this capability from the others, while in the other approaches the implementation of batching is reliant on the existing codebase to perform the single operation.

The benefits of separation are clear, less clear is how this implementation could actually diverge from the existing codebase. After all, at a conceptual level batching is still doing “more of the same” and it is more likely that pursuing this path will end up in re-inventing the wheel (i.e. redo most of the work we have already implemented for the single operations). Therefore, thought must be given to whether it brings any value.

Downstream Thoughts: Implications on Dependent Processes

So far we have discussed the execution of batching up until the interaction with the smart contract layer, where the information relevant to multiple parties is persisted. This is not the end of the story, as there are other processes that depend on this information to complete the life-cycle of a batch execution. The majority of these processes is fuelled by the eventing system associated to the smart contract, which triggers an event for any of the transaction successfully committed to the ledger.

In the use case under study, there exist three main sub-systems that leverage this stream of events to perform their duty:

the internal logic that keeps in sync the off-chain data with the ledger data;
the notification subsystem, which alerts users mainly across organisations that something of interest to a contract has happened;
additional services, which provide a materialised view of the contracts in the ledger for each organisation.

These systems, being pre-existent to the batching capabilities, are designed to operate with the assumption that ledger events are associated to a single contract and not a batch of contracts. As a result, they require some degree of change to empower them to listen for events associated to batches. These considerations apply to all the solution options discussed previously.

Before diving into the challenges posed to downstream processes it is worth to quickly recap what is the role of transaction events in the transaction life-cycle. The figure below provides a brief overview, sufficient for the present discussion.

Figure 2 — Simplified lifecycle of a transaction in Hyperledger Fabric. The picture shows the case of a successfully executed proposal which is endorsed by the peer to be committed in a block. In most cases, endorsing peers and committing peers are the same peer nodes. The figure also shows that while the initial steps of the transaction execution are orchestrated by the same application, events may be of interest of distinct applications. This is the case of the platform under study.

Three important aspects that emerge from the figure are that: i) events are triggered at the end of the transaction life-cycle once the ledger updates are committed by the peers; ii) that the applications listening to the events aren’t necessarily the same or only the application that submitted the transaction; and iii) only one event per transaction is raised by the committing peers.

The key point here is that a transaction event is the result of a committed change to the ledger. In our use case it has the dual role of being a confirmation of the success of the operation for the application that submitted the transaction and a notification for other applications that the ledger has been updated by another party. The first function supports the synchronisation tasks that the internal logic is responsible for, while the second function supports the notification subsystem and it is also primary mean of communication across the distributed components of the platform.

The applications listening to the events, implement a loop that process these events and makes them available to other system components by streaming them. At the very minimum, changes to the implementation of the event loop are required to properly process batches downstream.

Figure 3 — Eventing Behaviour of the different solution options.

The figure above shows how batching affects the propagation of events, according to the different solution options. It is worth noting that all the solution that introduce batching support at smart contract level also break the 1:1 correspondence assumed in the pre-existing implementation between a business operation and the transaction event. The problem does not exist in Option 2 because the batch is broken down into independent transactions (one for each operation in the batch).

At the very least, we want to ensure that Option 1, Option 3, and Option 4 replicate the behaviour produced by Option 2: erasing the existence of the batch and streaming the corresponding collection of events associated to the operations in the batch. This constitutes the most naive implementation but it is the minimum required for downstream systems and services to function correctly. The alternative to such approach is make every downstream system aware of batching and leave to each of these systems the responsibility to pull the detailed information corresponding to the summary delivered with the batch event. The latter also introduces opportunities for optimising for large numbers, but presents the disadvantage of replicating this logic across distinct systems and services, with the additional consequence of further coupling the system.

In this article we are going to explore an approach is essentially close to the behaviour produced by Option 2, with the insertion of additional metadata that enable downstream systems (if they desire to) to identify the beginning and the end of a batch, thus giving them the opportunity of introducing optimisations. What we want to achieve is the behaviour shown in the figure below: the presence of a batch can be “ignored” if needed and downstream systems can behave in the same manner as before, without loosing essential information.

Figure 4— Event stream generation for sequences that include transaction executing batches. The index of the transactions identifies the sequence and order in which they are committed to the ledger. These correspond to the index of the event that is triggered for that transaction. The index on the operations identify the type of the operation executed within the context of the transaction.

As shown in the figure the sequence of published events is generated by unpacking all the events compressed in the batched transaction and replaying them in the sequence in which they have occurred. The sequence of published events also includes two “synthetic” events that delimit the sequence of events that is contained within a batch. This allows systems that are aware of batching capabilities to take advantage of this information and optimise their processing, for instances by generating summaries or behaving differently. Those systems that are unaware of batching capabilities can simply discard such events.

How can we generate such sequence if what we get from the ledger is the event associated to the batch transaction? We have two options:

packing all the information of the events generated by the operations in the batch event; or
persisting the events in the ledger and pulling them once we see a batch event coming in the event loop.

It should have become obvious by now that the best place to implement this logic is in the code that processes the event triggered by the ledger (i.e. the event loop). This allows us to implement this logic only once and decouple the downstream systems from knowing too many details about the batching implementation and be allowed to ignore batch events.

Packing information on the single batch event is the easiest way to solve the problem but it may not be sustainable for batches that have a large number of operations, as the size of the byte payload associated to the transaction event is limited. Storing events in the ledger and pulling them on demand, causes additional round-trips and network communications but does not suffer the limitation of the payload size. It comes also with the additional burden of persisting the events in the ledger, but in the system under study, this was already a necessity to have a reliable and robust stream of events that could be processed without loss of information. Hyperledger Fabric does not store events and therefore when applications disconnect from the peers they loose all the events triggered in the meantime. By persisting the events in the ledger we are able to make point in time queries and restart the stream from the last event processed without loss of information. This capability also comes helpful in the case of batching as we have all the infrastructure necessary to make option two work seamlessly.

The figure also shows the introduction into the stream of two “synthetic” events, which mark the boundaries of the batch. These events aren’t strictly necessary: applications consuming the stream of events can detect the change of transaction identifier to determine the collection of events belonging to a batch, but adding synthetic events makes the identification simpler for downstream processes, at the additional cost of writing minimal discard logic for those consumers unaware of batching.

All the Cards on the Table

The analysis of the downstream impact of batching concludes the overview of the challenges and technical solutions to address them. With a good understanding of the end-to-end process, we are equipped to compare the different approaches and draw some conclusions.

Figure 5 — Comparative table of the solution options presented.

The figure above puts together the different characteristics of the solution options. The alternatives investigated are categorised into two main classes: those that introduce batching support at smart contract level (i.e. Option 1, Option 3 and Option 4) and those that don’t (i.e. Option 2). This is the main divide between the approaches.

What emerged from the analysis of the alternatives is that the different approaches do not seem to vary too much in terms of complexity. Rather, they simplify one aspect at the cost of re-introducing complexity in other parts of the system, or eventually limiting the opportunities to optimise. For a realistic implementation I see both Option 2 and Option 3 being contenders. Option 2 has the additional advantages of not touching the smart contract layer and the downstream systems, but it does so making them unaware of batching, thus preventing further optimisations.

Calling it a Day

What a ride! We finally reached the end. What originally started as a simple feature to implement — in the form of “more of the same” — turned out to be an interesting challenge when contextualised to specific business case and the technical stack adopted.

We explored challenges both from a logical and technical implementation perspective. On one side, the multi-party nature of the contracts managed by the platform made the nature of batches rather ephemeral. On the other side the interaction with a blockchain technology, which is driven by a transactional model, revealed unforeseen problem that called for addressing additional complexity.

While the feature of batching and the problem discussed may seem tied very much to the case under study, they have characteristics that can be applied to more general use cases. In particular, those traits that are applicable to a wider set of use cases are:

batching for workflows that are composed multiple steps performed by different parties (very common for scenarios where you have contracts or similar instruments);
batching within the context of transactional systems, which adopt an “all-or-nothing” semantic (i.e. Hyperledger Fabric).

The problem has been contextualised within an already existing platform that has been enriched with such functionality, and from a solution design perspective we have discussed different design alternatives with trade-offs.

If there is something to take away, is perhaps the approach to the problem and the process of developing solution alternatives. What has emerged — and I find this rather common — is that complexity is often something that you cannot fully eliminate. Surely, applying good engineering and architecture principles helps building better systems, but for any system being built there seems to be a residual and intrinsic complexity that is going to remain. In the scenario used as a reference, we have seen that we can simplify one aspect of our solution design at cost of making other parts of the system more sophisticated. How you decide how to distribute such complexity in your solution design, and more importantly what type of analysis informs your decisions, is what really matters … and a very valuable skill to learn.

“There is an essential complexity in software systems that acts very much like energy, you cannot increase or reduce it, rather you can only mutate it into different forms.” — Myself