An in-depth look on how we measure test coverage of the OCaml implementation of Tezos.
As we have discussed earlier on this blog, notably in the post on regression testing, testing is an important complement to formal verification. A critical task when testing is to ensure a high degree on test coverage. In this post we discuss how we measure test coverage of the OCaml implementation of tezos using bisect_ppx
, and our work towards ensuring that the Michelson interpreter of the node is fully covered by tests.
Test coverage
Testing is simpler to set up than formal verification but comes with a major caveat: it is incomplete. Each test run verifies only one execution path — as opposed to formal methods that can give guarantees on all possible execution paths of the system under test.
The incompleteness of testing provokes a natural question in the mind of the wary test engineer: to what extent do my tests cover the set of behaviors of the tested system? Or, what parts of the code are tested and what parts are not? Many measures have been proposed to answer these and related questions, collected under the term of “test coverage”.
In this blog post, we focus on one such measure, namely program point coverage. This consists of measuring the ratio of program points (informally, a program point can be thought of as a location in the source code) that have been visited during the test run. Other measures include function coverage (which functions have been visited?), edge coverage (which edges in the program’s control-flow graph have been visited?) and condition coverage (which boolean sub-expressions in guard-expressions have been evaluated to both true and false).
However, full coverage does not imply that all behaviors of the system has been explored, not for any of the measures above. Notably, the behavior of a program statement may change depending on the program state when the statement is executed. But none of the above measures quantify to what extent possible such program states are explored. Instead, they are defined in terms of control-flow graph properties of the tested executions. The takeaway is that full code coverage does not guarantee full system correctness.
Nonetheless, there is still value in pursuing a high degree of test coverage. First, it is an important proxy measure for the amount of testing that has been performed. And second, because it detects code that has not been tested at all . Such code may contain bugs that are triggered when executed in any program state. Systematically testing code is a cheap insurance against such bugs.
In this section, we will detail our usage of the bisect_ppx
tool to measure program point coverage in the Michelson interpreter by our test suite.
The bisect_ppx
code coverage tool
bisect_ppx
is a code coverage tool for OCaml and Reason programs. A program that has been compiled with bisect_ppx
writes a log file with coverage data just before terminating. A set of such log files can be converted into an HTML report detailing and summarizing the set of program points that has been visited during the execution.
For each instrumented .ml
file that has been visited we obtain a coverage measure (e.g. “73% of the program points were visited”), and each file can be inspected in detail to understand what program points have been visited and not.
An example of using bisect_ppx, courtesy of bisect_ppx
bisect_ppx
instruments the tested program by inserting code that associates each program point with a counter that is incremented when the program point is visited. Then, when compiled, the instrumented program is linked with bisect_ppx
‘s run-time library. This library registers an __atexit
hook that is called at the end of the instrumented program’s execution. This hook writes the program-point counters to a log file. At this point, the developer can run their test suite to gather coverage data. Finally, the bisect-reporter-ppx
program included in bisect_ppx
, reads the log files and generates an HTML report.
An example of bisect_ppx
instrumentation
Consider the following test file for which we want to measure coverage:
let add n1 n2 = n1 + n2
After running bisect_ppx
we obtain the following, instrumented source code:
module Bisect_visit___src___lib_addition___addition_lib___ml =
struct
let ___bisect_visit___ =
let point_definitions =
"\132\149\166\190\000\000\000\004\000\000\000\002\000\000\000\005\000\000\000\005\144\160P@" in
let `Staged cb =
Bisect.Runtime.register_file "src/lib_addition/addition_lib.ml"
~point_count:1 ~point_definitions in
cb
end
open Bisect_visit___src___lib_addition___addition_lib___ml
let add n1 n2 = ___bisect_visit___ 0; n1 + n2
We see that bisect_ppx
has added a module Bisect_visit___src___lib_addition___addition_lib___ml
containing a serialized definition of the program points in the source file, and a function ___bisect_visit___
that increments a program point counter. The function add
has been associated to program point 0. Hence, the ___bisect_visit___
function is called in the definition of add
with the argument 0. Consequently, each call to add
increments the counter for program point 0.
Test coverage for the Michelson interpreter
We have used bisect_ppx
to measure the test coverage of the Michelson interpreter in the Tezos node, and to detect Michelson instructions that were lacking test cases. We had two goals: first, to make sure that each branch of the step
function in the interpreter (corresponding roughly to each Michelson instruction) is covered by at least one test. Second, to make sure that each instruction has at least one dedicated test. Our work on this topic can be followed in merge request !1261, that also contains information on how to set up bisect_ppx
for Tezos.
To measure progress on the first objective, we run bisect_ppx
on the full test suite of the Tezos node. For the second objective, we run it using a test suite that targets individual Michelson instructions.
Baseline results
We take as our baseline commit #0478afc7
of the Tezos node. This version precedes our work towards full coverage, and for this reason has a low coverage.
The results using bisect_ppx
for the Michelson instruction test suite are available here. We see that the test suit for instructions covers 60.79% of the interpreter. Several instructions are missing tests: instructions for big maps such as EMPTY_BIG_MAP
, MEM
and GET
; the instructions SOURCE
and SENDER
related to the transaction in which a contract is executed; and arithmetic operations such as INT
that converts an integer ( int
) to a natural number ( nat
).
The coverage obtained when using the full Tezos node test suite is better, at 70.21% test coverage. However, it is still worrying to see that some instructions, such as SIZE
for lists and MAP
for maps, have no tests at all!
Adding Tests
There is a simple solution to the lacking coverage: writing more tests. Fortunately, this is often quite straightforward using the Python Execution and Testing Framework for Tezos. This framework provides a programmable API for the tezos-client
and tezos-node
binaries. Writing tests is especially simple for instructions such as INT
, that depend only on the state of the stack. We will have to do some more work for instructions such as SOURCE
that interact with the block chain.
Testing simple instructions
For simple instructions, our testing approach consists of writing a simple contract that takes a sample input of the instruction by parameter, and leaves the result of the instruction in storage. For instance, to test the INT
operation, we write the contract int.tz
:
parameter nat;
storage (option int);
# this contract takes a natural number as parameter, converts it to an
# integer and stores it.
code { CAR; INT; SOME; NIL operation; PAIR };
Then, we write a pytest that executes the contract with a sample input, retrieves the resulting storage and asserts that it is equal to the expected one. As we have a large number of tests like this, we use pytest
parameterization. This enables the execution of the same test with different parameters. In our case, we write one test method that is parameterized by a tuple (contract, param, storage, expected)
.
def test_contract_input_output(self, client, contract, param, storage, expected):
contract = f'{CONTRACT_PATH}/{contract}'
run_script_res = client.run_script(contract, param, storage)
assert run_script_res.storage == expected
It runs contract
, using the run script
command of tezos-client
. Here, client
is a fixture provided by the Tezos Python Execution and Testing framework, that wraps a tezos-client
binary. The run script
command is called with the parameter param
and the initial storage storage
. Then, the test verifies that after execution, the final storage contains expected
.
For instance, to test the INT
instruction, we could instantiate the test parameters with ('int.tz', 'None', '0', '(Some 0)')
, meaning that the when the int.tz
contract is passed 0
by parameter, then the final storage should contain (Some 0)
. These instantiations are given by decorating the test method. The resulting code looks like this:
@pytest.mark.parameterize(..., [
# ...
# Test INT
('int.tz', 'None', '0', '(Some 0)'),
('int.tz', 'None', '1', '(Some 1)'),
('int.tz', 'None', '9999', '(Some 9999)'),
# ...
])
def test_contract_input_output(self, client, contract, param, storage, expected):
# ...
More involved testing
Testing instructions that depend on the state of the block chain requires more effort, as exemplified by our tests for the SOURCE
instruction. This instruction pushes the address of the contract that initiated the current transaction, called the source . Note that this may differ from the sender , which is the address of the contract that initiated the current internal transaction and that is obtained by the SENDER
instruction.
Let us demonstrate the difference between the two instructions with a scenario involving two contracts. The first contract, receiver
, puts the result of SOURCE
in storage:
parameter unit;
storage address;
code { DROP; SOURCE; NIL operation; PAIR }
The second contract, proxy
, transfers whatever amount it receives to another contract, as specified by the parameter of the transaction:
parameter (contract unit);
storage unit;
code {
UNPAIR;
AMOUNT;
UNIT;
TRANSFER_TOKENS;
DIP { NIL operation };
CONS;
PAIR
}
Now, assume that the contract bootstrap2
creates a transaction to the proxy
contract sending 99 ꜩ and giving the address of the receiver
contract as parameter . Then the smart contract proxy
creates an internal transaction that transfers the received 99 ꜩ to receiver
. This is illustrated by the following figure:
In this scenario, the source of the internal transaction (from proxy
to receiver
) is bootstrap2
, whereas the sender of that transaction is the proxy
.
We test SOURCE
to verify that it operates as expected. First, we test a direct transfer from bootstrap2
to the receiver
contract, and then a transfer through the proxy
contract as in scenario described above. In both cases, we assert that source address left in storage of receiver
is that of bootstrap2
:
def test_source(self, client):
init_store = IDENTITIES['bootstrap4']['identity']
init_with_transfer(client,
f'{CONTRACT_PATH}/receiver.tz',
f'"{init_store}"',
1000, 'bootstrap1')
# direct transfer to the contract
client.transfer(99, 'bootstrap2', 'receiver', ['--burn-cap', '10'])
bake(client)
source_addr = IDENTITIES['bootstrap2']['identity']
assert_storage_contains(client, 'receiver', f'"{source_addr}"')
# indirect transfer to the contract through proxy
contract_addr = client.get_contract_address('receiver')
client.transfer(99, 'bootstrap2', 'proxy',
['--burn-cap', '10', '--arg', f'"{contract_addr}"'])
bake(client)
assert_storage_contains(client, 'receiver', f'"{source_addr}"')
Final results
From the beginning, we’ve had a reasonable degree of testing in place; which was reflected in the core’s stability. Despite this, we committed to achieving 100% (or close to it) of code coverage. To that end, since April 2019, we have added 719 new Python integration tests, using 66 new test contracts. As a result of this effort, we now obtain a code coverage of 99.12% for the tests for the instruction integration tests and 100% for the full contract test suite. We cover all branches of the interpreter function in script_interpreter.ml
.
In addition to quantifying the confidence in the test suite of the interpreter, bisect_ppx
helped us discover dead code in the interpreter. The interpreter contained a “peephole optimization” for the UNPAIR
macro. This optimization was supposed to match the expansion of the UNPAIR
macro and transform directly a stack on the form (Pair a b) : S
to a stack of form a : b : S
, without executing the full expansion. To our surprise, we discovered that this branch of the interpreter is never visited during testing. Quite unexpectedly, since the UNPAIR
macro is very commonly used in our test contracts.
It turns out that the optimization of UNPAIR
was never applied. The expansion of this macro had been changed since the optimization was added to the interpreter, but the optimization had not been updated to match. Consequently, the pattern match in interpreter no longer matched the actual definition of UNPAIR
. Thanks to bisect_ppx
, we were able to detect this dead code. As gas updates in Babylon removes much of the gains from this optimization, we removed it in the current Carthage protocol.
Conclusion
Now that we have achieved a high level of code coverage in the interpreter, the next goal is to maintain this level and to extend it to remaining parts of the Tezos code base.
For the first objective, we have added code coverage measurements to our continuous integration system. This lets developers calculate test coverage through a button push in the CI’s interface. In the future, we would also like a developer to be automatically informed if, for example, they add a new Michelson instruction but forget to add tests verifying its implementation.
For the second objective, we are working on covering fully the type checker of Michelson, currently at 75.94% percent. To achieve this goal, we are writing both unit tests and integration tests. We are also encouraging developers to add automatic tests for all new functionality and when modifying existing behaviors.
End Notes
- For Python test cases, this counts the difference between test items collected by pytest when running
pytest tests_python/tests/test_contract*
in commit#0478afc7
and in commit#395bee9dc
. For test contracts, this counts the difference between the number of*.tz
files present in thesrc/bin_client/test/
respectivelytests_python/contracts
in commit#0478afc7
respectively commit#395bee9dc
. - The missing 0.88% percent comes from a case of the deprecated
STEPS_TO_QUOTA
instruction which is unreachable through thetezos-client
commands used for integration testing of contracts.