Smart contract vulnerabilities due to Tezos’ message passing architecture

montyly · July 1, 2020, 4:45pm

Hi

I’m a security researcher at Trail of Bits. We recently audited the Dexter contracts and found two critical security issues that are directly related to Tezos’ message-passing architecture. We believe that both issues are likely present in several other contracts. Due to the complexity and likely prevalence of these issues, we think it would be beneficial for the Tezos community to discuss them and consider changing the message-passing architecture.

Message passing

In Tezos, a call to an external contract in a function is not made during the execution of the function, but is queued in a list of calls to be executed. The call order follows breadth-first search (BFS).

Consider Figure 1:

function a():
     call b()
     call d()
     return

function b():
     call c()
     return

function c():
     return

function d():
     return

In a traditional smart contract, the functions are executed in the following order:

start a()
- start b()
  - start c()
  - end c()
- end b()
- start d()
- end d()
end a()

In a Tezos contract, the order is:

Execute a() # Next calls: [b, d]
Execute b() # Next calls: [d, c]
Execute d() # Next calls: [c]
Execute c() # Next calls: []

In Tezos, the code of d() is executed before the code of c().

This unusual order of execution leads to two types of vulnerabilities:

Callback authorization bypass
Call injection

Callback authorization bypass

Description

The message-passing architecture of Tezos prevents contracts from reading the return value of an external call, so a callback must be used. Since there’s no built-in mitigation, using a callback will likely lead to access control issues. We call this attack a callback authorization bypass.

To illustrate this, consider a contract that needs to read its balance in a token. Figure 2 shows how a contract in a direct call system would work:

### contract Token
function balanceOf(address addr):
    return balances[addr]

### contract MyContract
function f():
     my_balance = token.balanceOf(self)

Figure 3 is a naive implementation of its equivalent in a message-passing system:

### contract Token
function view balanceOf(address addr, callback addr):
    call addr(balances[addr])

### contract MyContract
function f(uint value):
     assert sender == token
     my_balance = value

In Figure 3, the sender must be the token to prevent anyone from calling this function. The functions that allow a callback are called view.

If the token has any other view functions, they can be used to callback to MyContract.f and compromise it.

This can be resolved by implementing a complex state-based transition that ensures MyContract.f is only called after the execution of getBalance. However, such access controls are difficult to implement and significantly increase the complexity of the codebase.

Typically, the FA1.2 token standard specifies three view functions:

(view (address :owner, address :spender) nat) %getAllowance
(view (address :owner) nat) %getBalance
(view unit nat) %getTotalSupply

Any contract that wants to interact with one of these functions must implement complex access control to prevent security issues.

Although there is a hidden view standard warning, it’s not likely to be taken into account by the majority of developers:

Note that using view may potentially be insecure: users can invoke operations on an arbitrary callback contract on behalf of the contract that contains a view entrypoint. If you rely on SENDER value for authorization, please be sure you understand the security implications or avoid using views.

Exploit Scenario

Bob creates a contract that allows a callback from tokens to receive the balance from getBalance. Eve uses getAllowance to write an arbitrary value and compromise Bob’s contract.

Call injection

Description

An attacker can compromise a contract by injecting calls between a function and the external calls it generates.

As we saw, when a function is executed, all its generated calls will be queued in the list of calls to be executed. An attacker can place any call in the queue and then execute any code between the end of the function’s execution and its generated calls.

In particular, the contract’s balance or the memory of the called contracts might be in an invalid state when the attacker’s call is executed. We call this attack a call injection.

Exploit Scenario

Consider the Receiver contract in Figure 4:

type parameter is
    Receive of unit
  | Set of unit

function entry_Receive (const current_balance : tez) : (list(operation) * tez) is
        block {
            const alice : address = ("some_address": address);
            const to_contract: contract(unit) = get_contract(alice);
            const op : operation = transaction(unit, amount, to_contract);
        } with ((list [op]), current_balance)

function entry_Set (const current_balance : tez) : (list(operation) * tez) is
        block {
            if(amount > 0mutez) then {
                failwith("Do not send tezos")
            } else{
                current_balance := balance
            }
        } with ((nil : list(operation)), current_balance)

function main (const action : parameter; const current_balance : tez) : (list(operation) * tez) is
        case action of
            Receive(x) -> entry_Receive(current_balance)
            | Set(x) -> entry_Set(current_balance)
        end

and its high-level representation in Figure 5:

storage:
     current_balance   

function receive():
     transaction("some_addess", amount)

function set():
     require(amount == 0)
     current_balance = balance

Receiver has two functions:

Receive, which forwards the funds sent to a fixed address
Set, which stores the contract’s balance to current_balance

If we do not consider call injection, an invariant of Receiver would be that current_balance is always zero as:

All the tezos sent to receive are sent to a fixed address
Set cannot receive tezos

But this invariant can be broken if an attacker injects a call to set after the end of receive’s execution and before its external call is executed (transaction("some_addess", amount)). Figure 6 shows an example of such an attack:

function main (const param : unit; const current_balance : tez) : (list(operation) * tez) is
    block {
      const dst : address = ("receiver_address": address);
      const receive: contract(unit) = get_entrypoint("%receive", dst);
      const op1: operation = transaction(unit, amount, receive);

      const set_balance: contract(unit) = get_entrypoint("%set", dst);
      const op2: operation = transaction(unit, 0mutez, set_balance);

      op_list := list op1; op2; end;
    }
    with (op_list, current_balance)

Figure 7 is a high-level representation of the exploit:

function main():
      call receive()
      call set()

In the exploit, set() is executed before the tezos are transferred from the Receiver contract to the static address. As a result, current_balance will be a non-zero value.

Recommendations

The callback authorization bypass can be prevented if the developer implements specific access controls. Yet this comes with a significant cost for the developers. In other platforms, there are many contracts interactions, which makes us think that porting several existing smart contracts to Tezos is difficult and highly error-prone.

We are not aware of a general mitigation for the call injection vulnerability. Contracts that have external calls cannot rely on their balance. Moreover, developers must be extra careful with the value of the variables in called contrats.

One of the reasons Tezos has this unusual call order is to prevent reentrancy. However, the current solution creates vulnerabilities that are likely to be more difficult to prevent than reentrancy. Additionally, the call order is unusual and difficult to reason with, which is likely to confuse developers. Two critical issues found in Dexter were the results of this design.

Reentrancies can be prevented while keeping a traditional call order. For example, the VM can have a contract-level reentrancy lock that can be enabled by default. Function-level locks can also be used for more granularity, such as those in Vyper.

Overall, we recommend the Tezos developers 1) consider using direct calls instead of message passing and 2) implement reentrancy mitigation in the VM.

Related discussions:

haochenx · July 2, 2020, 5:17am

To be clear, I think the alerts and caveats in the post are in no doubt both valid and very valuable, but I am against the recommendations in the post.

haochenx · July 2, 2020, 5:17am

It looks like that the general developer community is against continuation-passing style? Well, not a surprise but on a system like blockchain IMO call stacks among contracts are more a problem maker than saver.

IMO the common mistakes associated with the “BFS-style”, or no-stack approach as I’d like to call it, should be addressed by developer education for those who are not familiar with continuation-passing style; and/or better toolchain support for those who prefer direct function-call style interaction between contracts.

Porting of contracts from a stack-style approach (the “DFS-style”) blockchain to a no-stack style blockchain is inherently technical and elaborated in nature and should be done by “well educated professionals” who well know the potential peculiars of both styles. IMO such porting should not be considered as easy and be attempted by devs who are not willing to study those peculiars before better toolings are available.

Call injection is another thing though. Our team is working on a framework to formally describe the behavior of a set of known smart contracts (we call them “relevant” contracts) beyond single invocations. One condition, the RA-Separability condition (RA=Relevant-Affected) as we call it, that we are imposing on the internal operations that could be generated by the relevant contracts is related to this topic. It requires that, when a “root invocation” of a relevant contract produces a series of internal operations, ordered in “BFS” style of course, the calls to relevant contracts must precede the calls to non-relevant contracts (contracts that are not in the control of the same contract author). This resembles the common migration for re-entrancy attacks by doing all “internal calls” first and then “external calls” and will prevent the call injection problem and we believe it is possible to transform any smart contract that doesn’t exhibit RA-separability to one that does. So again this is a programming paradigm and tooling problem IMO.

Function-style calling to untrusted code (in this setting, contract that is authored by others whose address passed in as arguments/storage contents) is always a source of security disaster and IMO allowing that while implementing mechanism to prevent abuses by the untrusted code is far more complicated and prone to error at the architecture level. A no-stack approach provides a clean architecture and clear separation of security responsibility: sanitize and protect against the inputs on each contract and you will be on the safe side.

Not implementing access control on continuation-style callback in the no-stack approach falls automatically into the category of mistakes of not guarding contract inputs properly, where it become an obvious source of mistakes and is easy to prevent by setting up a simple and natural policy to audit input sanity checks & access-control. This simplicity is in contrary to the would-be more complicated and seemingly arbitrary checks one would need in order to migrate every kind of creative exploits that could come with the stack-style approach.

This (no-stack approach vs stack-style approach, or “BFS-style” vs “DFS-style”) is clearly a design choice rather than a “should do” or “should not do” and I think Tezos is taking the right choice for security when used by educated developers.

haochenx · July 3, 2020, 12:29am

Also what might be related to this topic is the upcoming “ticket” mechanism, which could be used to help implementing the now-difficult integrity check when a CPS style re-entrance is necessary

mcapretto · July 3, 2020, 3:50pm

I agree that the call order used in Tezos is unusual and difficult to reason with, however there is a way in which the balance can be properly updated in BFS and maybe it can lead to a more general mitigation for the call injection vulnerability. The idea is the following:

In the storage we add a variable pending_to_transfer that keeps track of the amount of Tezos that is going to be transferred from this contract but has not been transferred yet.
We add an entry_point update that takes as argument transferred_tezos and decreases pending_to_transfer by it. This entry point can only be called by SELF and does not accept tezos.
Whenever we make a transfer op for t tezos we increase pending_to_transfer by t and in the list of emitted operations we add next to op a call to update with parameter t.
Then, since BFS is used, this both operations are executed back to back. Hence, the contract can know immediately when op was executed.

Therefore, the current_balance is actually balance - pending_to_transfer.

After adding these modifications, the Receiver contract in Figure 4 will be:

type parameter is
    Receive of unit
  | Set of unit
  | Update of tez

type storage is
  record
    current_balance     : tez;
    pending_to_transfer : tez;
  end  

function entry_Receive (const store : storage) : (list(operation) * storage) is
        block {
            const alice : address = ("some_address": address);
            const to_contract: contract(unit) = get_contract(alice);
            const op1 : operation = transaction(unit, amount, to_contract);
            store.pending_to_transfer := store.pending_to_transfer + amount;
            const self_update: contract(tez) = get_entrypoint("%update", self_address); 
            const op2 : operation = transaction(amount, 0tz, self_update);
            op_list := list op1; op2; end;
        } with (op_list, store)

function entry_Set (const store : storage) : (list(operation) * storage) is
        block {
            if(amount > 0mutez) then {
                failwith("Do not send tezos")
            } else{
                store.current_balance := balance - store.pending_to_transfer;
            }
        } with ((nil : list(operation)), store)

function entry_Update(const transfered_tezos : tez; const store : storage) : (list(operation) * storage) is
        block {
          if(amount > 0mutez) then {
                failwith("Do not send tezos")
            } else if (sender =/= self_address) then {
              failwith("Do not call update from outside this smart contract")
            } else {
              store.pending_to_transfer := store.pending_to_transfer - transfered_tezos
            }
        }
        with ((nil : list(operation)), store)

function main (const action : parameter; const store : storage) : (list(operation) * storage) is
        case action of
            Receive(x) -> entry_Receive(store)
            | Set(x) -> entry_Set(store)
            | Update(t) -> entry_Update(t,store)
        end

Notice that in some smart contracts it is possible for the current_balance to be negative (here, it has tez type to simplify the code) without that implying that the whole transaction will fail.

montyly · July 6, 2020, 1:02pm

Thanks @haochenx for the detailed reply.

I am not sure to see why the direct call architecture is always a source of security disaster. Its main issue is reentrancy, which can be easily prevented at the protocol or language level. The complexity to reason with message passing contracts, and the vulnerabilities showed above outweigh the risk of reentrancy vulnerabilities imho.

In terms of security, I would always favor a language that can kill entire classes of bugs, rather than relying on the developer’s experience and awareness of complex components.

For example, we know about buffer overflow vulnerabilities for decades, but even today, experimented developers can have buffer overflow in their C code. On the other hand, “modern” languages managed to mitigate this class of bugs entirely, which is a great improvement.

Put in another word, I think it is the responsibility of the language (or underlying platform) to prevent or mitigate vulnerabilities when possible rather than to rely on the developers to write bug-free programs.

Do you have documentation about the RA-separability condition? Does it mean that a transaction leading to violate the condition will revert, or does it mean that the calls are re-ordered to follow the condition? I am also curious to know more about the classification of a contract to “relevant”.

haochenx · October 7, 2020, 2:11pm

Sorry for the very late answer as I just read your reply. A direct call architecture allows untrusted code to modify the global state before resuming executing. If the security of the calling contract depends on certain property of the global state and such checks are done before the external call, it is a security vulnerability; and programmers will surely write contracts that does all the checks at the beginning and forget to perform necessary rechecks after an external call. Thus not only reentrancies, but allowing direct call architecture would likely enable a whole category of vulnerabilities.

Think about two smart contracts A and B implementing a certain feature together, when A checks properties of B’s state before calling an untrusted contract E that calls B so B’s state is changed. When the call of E finishes and A continues, the assumptions of A about B’s properties might be invalid at this point. As far as I know this is not ruled out by simply preventing reentrancies.

I don’t have any public document yet (hopefully eventually), but RA-separability states that invocations to external/untrusted contracts always comes at the end of the final effective call graph as the result of an initializing invocation to the author’s controlled contracts.

haochenx · October 7, 2020, 2:13pm

this article seems to be relevant about the type of vulnerabilities I was talking about, and RA-separability essentially implements the author’s suggestion: “Call external functions last, after any changes to state variables in your contract”.

Ryan · March 9, 2021, 4:37am

Sorry for resurrecting a dead thread, but for the folks on this post (and anyone else who sees this), it looks like depth-first execution has been proposed in the “F” protocol - Tezos calling convention migrating from Breadth-First to Depth-First Order (BFS to DFS)