Lessons Learned from the Babylon Protocol Upgrade: A Retrospective

slarquie1 · October 30, 2019, 3:07pm

Babylon (aka protocol 005 ), the second Tezos protocol amendment jointly developed by Nomadic Labs and Cryptium Labs, was successfully activated on block 655361 . Since then, we’ve continued analysing and monitoring the new features, but have also engaged in a deeper reflection on the upgrade process from its development period, pre-injection, to the period following the activation.

This article summarises the lessons learned in five parts: the development of Babylon, the proposal period, continued testing during the on-chain test period, activation, and post-activation. For each of these parts we identify the missteps made over the past months and draw lessons for current and future core developers to improve on the process for future protocol upgrades.

The process of developing Babylon had many firsts. It was the first time that two independent core development teams worked together on a proposal, which was a non-trivial step towards decentralisation of core development. Furthermore, given its large set of features, it was a steep step from Athens. While Athens proved that live upgrades worked, Babylon showed that large parts of the codebase can be amended in order to deliver meaningful improvements.

TL;DR : Less rush, more testing, more documentation, more community involvement.

Developing Babylon

Issue: controlling the set of changes and timeline

An issue during the development of Babylon was that, despite agreeing beforehand on a set of features, we ended up adding a few more along the way. This was due partly to some backlog from Athens and partly because we received requests for features that we were afraid to leave unanswered for 3 more months. Furthermore we were trying to respect a specific timeline for injection because we were afraid of a never ending development cycle where changes kept piling up.

Working towards a moving target on a fixed timeline complicated considerably the internal development and rushed the final stages of the release, impacting testing and the communication with the community.

How to improve

In future upgrades we will compile a list of features early on and gather input from the wider ecosystem. We will limit development towards the specific features and this will give us a clearer vision of a possible timeline that, if needed, will be delayed until proper review, testing and communication is done.

Proposal Period for Babylon

The first proposal (Babylon PsBABY5n ) was injected on the 26th of July 2019. Shortly after this, a new updated version (Babylon PsBABY5H ) was proposed during the same proposal period.

Issue: Binary format of endorsements

The second injection was in response to feedback we received from wallet developers which led us to revert a change in the binary format of endorsements. The binary format of manager transactions was kept in order to ensure that user transaction from 004 could not be executed on 005 .

The proposal phase was indeed conceived with the idea that there could be a series of proposals, counter proposals and iterations.

However, this extra injection implied more off-chain coordination, and it could have been avoided if we had gathered feedback from developers impacted by the binary format of operations before the first proposal.

How to Improve

We see three areas of improvement:

Ensure that the binary formats are well documented, before the injectionAll the binary formats of the protocol can be obtained with the new binary tezos-codec for any of the encodings used in the Tezos codebase. We are also working on a new tool that will parse and display the binary format of a specific operation.Any breaking change introduced by a proposal is highlighted in the Changelog that we publish together with every release and there are specific guidelines that are suggested to handle the migration.We will continue improving the quality and accessibility of this documentation and we will make sure that it is available to third party developers a few weeks before a proposal.
Allow developers to test properly, before the injectionUsing the simple bash sandbox is usually enough to interact with a Tezos node and client and rapidly test by hand. Additionally a number of more complex end-to-end tests can be automated using one of the two frameworks that are present in the code base, Flextesa and the Python test framework.These frameworks are used by core developers, which means that they will be maintained in the future and that there are a number of existing tests that can serve as inspiration to third party developers. We believe that with the tools presented any future incompatibility in the format of operations can be detected early in the development process and independently, without the need for a testchain.However, these tools might not have been advertised enough to third party developers, and we will make an effort to better document them and encourage their adoption and feedback.
Publish the code and changelog before the injectionFor Babylon we published a description of the features as well as links to their implementation as we build them over time, for example Quorum Caps and Making implicit accounts delegatable.However for both Athens and Babylon the changelog and final code was only published at the same time as the proposal was injected. In the future we will make sure to leave a few weeks of time between publication and injection to leave ample time for last minute corrections.

Exploration & Testing Periods for Babylon

Issue: the issue with Big Maps and the Hotfix

During the test period of Babylon PsBABY5H we discovered a bug affecting big maps. More details are in the documentation. The bug was particularly hard to catch because it was a single line in one large patch and because it didn’t break any functionality but caused a performance degradation.

Despite the above we believe that with a more rigid review process and more automated tests we should be able to avoid this situations in the future.

The natural course of action would have been to vote nay in the promotion vote and to propose a patched version of Babylon, PsBabyM1 , incurring a delay of 3 months for the voting procedure to complete again.

However, given that multiple big maps (together with entry-points) were strongly requested by smart contract developers, we felt a sense of urgency from some parts of the community to deliver this feature as soon as possible.

For this reason we decided to add the possibility to download a new version of the Tezos node that in case of a positive vote for Babylon PsBABY5H would activate the patched version PsBabyM1 , explained in the related blog post.

Adding this possibility was not meant as a suggestion, but merely as an additional option for the community. The majority of nodes preferred this option and PsBabyM1 was activated successfully.

We have always considered user-activated updates a legitimate part of the governance process when used for urgent and critical hotfixes, like we did in the past and as explained in the “Amendments at work” blog post.

On the positive side, this hotfix allowed to unlock the many new possibilities given by big maps. We are preparing a blog post on the topic for smart contract authors.

How to Improve

There were several issues at play here.

To allow for faster iteration over multiple improved versions of a proposal, we’ve discussed several ways to reduce the length of a voting procedure in case of a negative vote.

The hot fix also led us to reflect on the user-activated update mechanism. Because of a technical limitation in the current implementation, user-activated updates can only be introduced by releasing new binaries or recompiling the code, a process which can leave non developers out. We are implementing a node where user-activated updates can be set easily by users in a config file, instead of depending on a mainnet release.

As for the big map itself, this kind of bug can be found through more thorough testing and stricter review requirements during the development process. As the number of developers competent in protocol development grows so will the number of reviews per merge request.

In addition, more developer activity on the testnets prior to injection would be helpful. One way to achieve that is to look into incentivizing participation in the testnet. Another way, which has already been enacted, consists in always having a testnet running the proposal currently being voted on.

Activation of Babylon

Issue: the mempool glitch at activation

Right after the activation of Babylon there was a slowdown of block production in the network due to some nodes being blocked. These nodes had correctly migrated to Babylon but were still receiving operations from Athens which they were no longer capable of deserialising, causing a failure of the mempool and of the node.

Fortunately the situation could easily be resolved by simply restarting the nodes, thus erasing the old operations, however, those operations would still be propagated by the network for an hour, requiring multiple restarts of the node. A patch to fix the mempool was ready minutes after we realised the problem and many bakers readily updated their nodes.

The failure of the mempool was caused by a bug which was known and fixed on the master branch of the Tezos code base but was incorrectly ported to the mainnet branch.

How to Improve

Once again this kind of bug can be avoided using a simpler release process which we are currently implementing and will be in place before the next release. The goal is to reduce to a minimum the number of steps needed to release Mainnet starting from the master development branch. This will simplify testing and leave less room for errors.

In the future, we will run testnets before injection that simulate the migration from the old state machine to the new. In this instance it would have meant running a testnet with 004 and then upgrading the running network to 005 , while at the same time generating load by randomly sending transactions. This kind of testnet would have caught the previous bug.

Post-Activation of Babylon

Issue: the formula for block rewards

After the activation of Babylon, users realised that the computation of rewards was slighly different with respect to the formula published in the blog post describing Emmy+, the improved consensus algorithm.

The reason for this difference is a bug in the implementation of the formula, which results in a loss of precision when calculating the baking rewards.

This precision loss does not affect the security of Emmy+ and a fix will be offered in the upcoming protocol upgrade.

How to Improve

This kind of bugs should be more easily spotted and avoided, we see two lessons to learn here.

Can we write fewer bugs in the future?We are going to improve our release process for protocols by having

comprehensive unit tests
a clearer, stricter, policy for code reviews
a clearer freeze and review period before injection

Bugs still happen, how can we make sure we notice them?Despite having Babylonnet running since September 27th, we didn’t sufficiently encourage participation and for this reason the traffic on the network has not been a realistic sample of mainnet.Furthermore the lack of testnet support from block explorers has strongly limited the capacity of users and developers to inspect the data of the testnet.In the last month there was a reorganisation of the testnets infrastructure to improve the service and engage more members of the community.Moreover several block explorers, such as TzStats, are working to support Babylonnet and will support future test chains.

Issue: restricting originated (KT1) contracts from paying transaction fees

With the implementation of the delegation process simplification, originated accounts (KT1) can no longer pay for transaction fees. The goal of this change is to ensure that all transaction fees are always paid by tz1 addresses, and remove the computational overhead produced by fees paid through KT1 accounts, as smart contracts need to be fully executed in order to verify their validity. This results in the potential for dramatic mempool optimisations and increased throughput.

However, the now legacy multi-step delegation process led to a common scenario, where all the funds of the tz1 account were transferred to the KT1 account to maximise the amount delegated. Before Babylon, this was not an issue as the KT1 account was able to pay for transaction fees.

With Babylon and KT1 accounts no longer able to pay for transaction fees, implicit accounts that had empty balances before the protocol upgrade where funded with 1µꜩ (0.000001ꜩ) to prevent the requirement for a new allocation burn.

This led to two issues. First, the creation of these accounts and their funding with 1µꜩ was not documented, which led to trouble for block explorers. Second, the 1µꜩ balance was not sufficient to pay for a transaction.

To assist affected accounts, Cryptium Labs funded all the implicit accounts in this situation with 0.01ꜩ, which is high enough for the account to pay for at least one transfer transaction (funding, for instance, the tz1 address with enough fees to pay for several transactions).

How to Improve

Regarding the effect of token creation during protocol migration and the way they are interpreted by block explorers, we feel the cleanest, most consistent, approach is to introduce a receipt attached to migration blocks showing all of the balance updates caused by the migration. This however, requires a change in the Tezos environment, and not merely a protocol change. Protocol environments are versioned and designed to change over time (for instance to accomodate new cryptographic libraries).

Regarding the effect of the transaction fee, funding the accounts directly outside of the protocol upgrade was a simple lo-tech solution which worked. However, if anything of the sort were to be repeated in the future, it should involve clearer communication with the affected users in order to ensure a smooth upgrade for everyone.

Things that Worked

This is the first time that a running decentralised, permissionless and censorship resistant blockchain protocol evolved in a meaningful way. Babylon paves the way for a chain that can evolve over time and adapt the best technologies from the entire ecosystem.

Furthermore, it is the first time, for Tezos, that two independent core development teams worked on the same protocol. Hiccups were abundant, but it worked and it showed that core development can be decentralised while also moving fast and evolving the protocol.

Lastly, let’s not forget the many features that worked:

Closing Remarks

In summary, we have identified the following as key areas to improve the proposal process:

Outline the desired features ahead of time, stricly focus development on them and wait all reviews, testing and documentation to be done.
Make feature documentation and changelogs more accessible and visible ahead of time, to give the community more time to engage.
Release of independent features to testnets, so ecosystem developers have an early access and can provide feedback in advance.
Reorganisation and maintainance of testnets.
More reviews per merge requests and more unit and integration testing.

Lastly it is important to remember the scope of these changes for Tezos. Babylon is, feature-wise, a steep step from Athens, which touched almost all the main areas of the protocol: Michelson, voting procedure, accounts and consensus. With the Babylon upgrade the Tezos community proved to the world that we are the first blockchain that can significantly amend a running protocol. Although there were more than a few drawbacks, it should not deter us from improving Tezos over time, as now is the time to create the foundation for a long lasting and relevant blockchain protocol.

murbard · October 30, 2019, 3:17pm

In response to @fredcy on Riot asking

I wrote:

So the way user activated upgrades work in Tezos is fairly neat. You specify in the shell that, as of a given block height, the protocol should have a particular hash.

This piggy backs on the over-the-air, hot-swap, whatever you call it, on chain upgrade mechanism.

Essentially, the person controlling a node tells their node: regardless of what the chain says, activate this protocol at this height.

Currently this is done in a .ml file and requires a recompile of the node (or a rebuild of docker) to happen.

The paragraph you quoted suggests making this into a config file instead. So that these user activated upgrades, if they happen at all, aren’t tied to mainnet releases.

They would happen instead by users manually editing the config file.

AlexL · October 31, 2019, 9:07pm

Before I go in depth about my concerns, let me make a few qualifiers so you get some context of where I am coming from:

I am by no means well verse or even fluent in computer coding or analyzing code and it’s way out of my expertise
I ultimately voted Yay on passing this amendment
I have no ill will towards Nomadic Labs or Cryptium Labs, as I very much enjoyed meeting members of each team during TQuorum and think they do good work
My comments are treating Nomadic and Cryptium as one as touted in their reflections post
I might be wrong in my opinions expressed here in terms of the coding process, and I’m always happy for feedback and debate, and with that said, my views are dynamic and not static, always willing to listen to different views and consider everything brought to the table.
Lastly, I am by no means saying that Babylon was/is a failure (as i agree with the article that Babylon had many successful parts to it), and I hope these comments spur an on-going conversation on how to improve in the future while raising the Tezos development standard.

I was originally going to post why I voted yay, but I got caught up with work and it fell off the radar, but to keep this part of the post short, when my delegates and I discussed on how they wanted me to vote, we ultimately sided with the majority view that no code is perfect as all operating systems (windows, mac, linux) push updates periodically when bugs are found etc. That was the overall reasoning why we decided to vote yay.

Part of being a DAO-like organization is that everyone in the ecosystem has an autonomy in their views and how they vote and this is a beautiful thing, but a limitation is that a good majority of the communities that make up the DAO rely and trust the core devs in any code development because the core devs have a background and expertise in code development. This is also a two-way street where core-devs rely on the community stakeholders and community stakeholders rely on core devs for testing and feedback, but in this amendment phase, in reality, it amounted to a one-way street. This trust element thrusted upon the core dev teams puts the core devs at a higher standard, namely, we will treat core-devs and evaluate their actions at a higher standard than non-core devs /public because they’re the groups that have accepted this higher standard when they agreed to be a core dev members.(Stated differently, I am treating core devs at a minimum of how the eth and btc community would treat their core dev teams and making comparisons through those lenses rather than if this was a proposal was coded by a member of the community/non core dev team. A very rudimentary example is the same concept that we do not treat the actions of children like we do the actions of adults.) Further, there is a little more to this trust element we are thrusting upon core-devs and that is an extra element of expectations, in that the community trusts and expects that any code proposal proposed will be complete in the sense that it has undergone sufficient testing and peer review (by the community and others).This trust-element is somewhat damaged in that there were more than two “known” bugs. Now, when I say “known”, I am going to apply an American legal principle of constructive knowledge - (a person is presumed by law to have known, regardless of whether he or she actually does, since such knowledge is obtainable by the exercise of reasonable care.) This should have known element, really refers to the last issue of transferring from kt1 to tz1 which will be discussed further down.

Core-devs can make the argument that being a DAO, it cuts both ways, in that it’s also the community’s burden to help check the code, which they would be correct, but this argument doesn’t hold water in this instance based on the reflection’s post, specifically,

“However for both Athens and Babylon the changelog and final code was only published at the same time as the proposal was injected. In the future we will make sure to leave a few weeks of time between publication and injection to leave ample time for last minute corrections.”

In other words, there was no way the community could have attempted to help look for bugs when the code isn’t even posted for community review. This itself is a problem, because this past amendment, the community did not have adequate time to review all the changes and give feedback. Once again, this goes back to one of the underlying principles of DAOs which core devs and the community work together rather than core devs working in a black-box and then springing a bunch of surprises on the community.

If there are two teams working simultaneously, why was there 0 communication to the community as the code was being developed? This also raises the question of how much communication occurred between NL and CL, which is another issue altogether. (There’s no way I would have been able to help in any of this as I am not skilled in coding, but I do know there are community members who are and could have helped.) Once again, the lack of community raises many concerns and in the cryptocurrency sphere, transparency is king, which in this amendment phase was sorely lacking.

Now, I will applaud that the post-babylon reflection post highlights that the core-dev team recognizes this problem and have vowed to do better, but the issue of damaged trust still remains. The damaged trust can be viewed this way: my vote during the testing phase ultimately was based on incomplete information in which I relied on the core-devs to explain the potential issues and when it seemed like there were only 2 small issues, in reality there were much more, including an out of the blue change, of changing the binary form of endorsements, and now I feel like as if I were somewhat misled and now I have to answer to my delegates and try to explain why the post-babylon events happened the way they did.

Ok, now that I’ve aired my thoughts, I am going to only compare Sept 30 post to Oct 30 post in terms of the core devs explanations of issues to the community and show why I felt somewhat misled.

Sept 30’s post conclusion: “About a week ago, Nomadic Labs discovered two issues: one that limits the efficiency in the multimap Michelson feature included in Babylon2.0; and another that prevents the code_trace RPC call from producing logs. The fixes to the latter require a change in two of the protocol source files.”

Now. compare it to Oct. 30 post:

Issue: Binary format of endorsements (not reflected in Sept 30 post)

Why was there 0 mention of this in any of the posts by CL and NL? I understand that the core devs got a lot of feedback from wallet developers and that was the reasoning behind it, but the issue is why was this not even mentioned to any of the community? If you were getting feedback from wallet developers, why not mention that this could be a possibility instead of an ambush like injection? This might not be a big deal in terms of the overall amendment, but it goes back to the principle of trust. To beat this dead horse some more, this is like when I go through the redlining process in contract negotiation and we agree to the terms of the contract and I’m waiting for you to send me a finalized draft of the agreement to sign and when you do send me a “final” agreement, I review it once more before I sign it and I find a new clause added to the contract. Think about how you would feel in that situation, to me, not good.

Issue: the issue with Big Maps and the Hotfix (reflected in Sept 30 post)

Post-activation issues:

Issue: the mempool glitch at activation (not a big issue because it’s a bug)

Cause:The failure of the mempool was caused by a bug which was known and fixed on the master branch of the Tezos code base but was incorrectly ported to the mainnet branch.

Issue: the formula for block rewards (not a big issue because it’s a bug)

Cause: The reason for this difference is a bug in the implementation of the formula, which results in a loss of precision when calculating the baking rewards.

Issue: restricting originated (KT1) contracts from paying transaction fees

Cause: Not stated. (You might say the cause was explained in that, “the now legacy multi-step delegation process led to a common scenario, where all the funds of the tz1 account were transferred to the KT1 account to maximise the amount delegated. Before Babylon, this was not an issue as the KT1 account was able to pay for transaction fees.” However, that’s not an explanation of what caused/led to this issue, but rather that’s an explanation of how the event played out. This was not an issue of a bug, and the situation that was explained is/should be common knowledge to core-devs, if it’s not, then that’s another issue altogether. This last issue is the one that I have the most trouble with. I cannot come up with any reasonable explanation of what caused it without saying that the duty of care was sorely missing here. This last issue caused a ton of problems that we are still addressing today and although I applaud CL’s airdropping some xtz to implicit accounts that were affected, not everyone received those airdrops, on 10-29-19, i personally sent someone on telegram who had this issue some xtz so he could move funds back to his tz1. I won’t even go into the problems that exchanges and 3rd party wallets are facing right now because this encompasses them.

To me, these issues do not exactly inspire confidence to those contemplating on whether to use our blockchain or not, especially in the STO context.

Conclusion: Yes, my trust in core devs is damaged, but that doesn’t mean it cannot be rehabilitated. As a matter of fact, I have faith in NL and CL to make sure we don’t encounter these problems in the future as they have demonstrated much retrospection on how to prevent these issues from occurring in the future, along with their current commitment to help remediate these problems. I also do not believe the issues we have encountered so far with post-Babylon damage the overall bottom line of xtz. As a community, we certainly have had our fair share of trials and tribulations, but what doesn’t kill us only makes us stronger as we continue to persevere through hardships and improve xtz as a whole. However, what is clear is that post-babylon, as explained in ad nauseum, 1. The amendment was not ready for implementation and the communication from core-devs to the community was poor.

I’d like to end with a quote from Batman Begins that is actually said twice in the movie (once by his dad and once by Alfred): “Bruce, why did you fall?” Answer: “So that we can learn to pick ourselves up again”.

Zed · November 1, 2019, 5:52pm

Alex, thank you for your great response!