The new architecture of the TezEdge node: bridging the gap between formal specification and implementation

jurajselep · November 19, 2021, 7:07pm

Whenever developing a system that controls critically-important data, security is paramount. To achieve the highest degree of security, we need to check for potential vulnerabilities in every possible case. A system like this may have many different states, and the more complex it is, the more states it can have.

To ensure the highest possible degree of security, we have to perform so-called exhaustive testing, in which we go through every possible state (every possible data combination) to leave no scope of further testing. In other words, if exhaustive testing is thoroughly executed, then all testing parameters will be covered. This is to say that we have full confidence that our system has all the intended desirable properties and no undesirable properties.

However, exhaustive testing is infeasible if the system is too complex, as is the case with the TezEdge node. In systems with this level of complexity, there are an astronomical number of states to check; complexity leads to state space explosion. In order to formally verify such a system, we need to create a mathematical representation of each part of the TezEdge node.

Writing such mathematical representations of the node was impossible due to the actor model-based architecture that we have used until now. For this reason, we needed to redesign the architecture in a way that would reduce the node’s complexity and allow us to write a representation of the node in TLA+, which we could then formally verify. There are various methods for performing formal verification, and in our case, we are using a formalism known as Temporal Logic of Actions (TLA+).

You can see how we used TLA+ for the node’s specification by clicking here.

Differences between formal specification and implementation

However, the main problem of formal verification is that even if we have a formal specification of our node’s architecture, that specification may significantly differ from the actual implementation. After all, one of the main objectives of writing such a specification is to simplify the problem. Therefore the main challenge of using methods such as formal verification is bridging the gap between the formal specification and the implementation that is actually used.

To bridge the ‘verifiability gap’, we are working on trace checking our implementation against the formal specification. Essentially, this consists of defining a (refinement) mapping from the complex implementation states and actions to the simple versions in the specification. This way we can mechanically verify that the implementation conforms to the specification.

The old architecture

The previous TezEdge architecture was based on the actor model. Actors are independent processes with internal state that perform tasks and communicate with one another only by passing messages. One of the benefits of this design is that actors provide an abstraction barrier between one another so errors in one actor do not disrupt any other actor.

The downside of the actor model, as with any non-deterministic system,

is that there were a very large number of possible states due to the large number of parallel threads. Due to this, it was practically impossible to debug via record/replay, i.e. deduce the causes of errors by recording and replaying the program from a stable state to the point where the error was encountered. Another issue was that we could not really use the actor model architecture for formal verification to validate safety or liveness conditions because the state space (the number of possible states) was simply too large to get a handle on.

Another issue with the multi-actor architecture is that the state of the application is split and partially shared between different asynchronously running components, which leads to a problem of keeping the state consistent in each such component. Failing to do so can lead to various errors that are very difficult to detect and troubleshoot.

Towards a single-threaded implementation

Having the application state in one thread eliminates that kind of error. Still, for server or peer-to-peer-like applications it is almost impossible to use an imperative style for managing that state since they should react to external network events that appear asynchronously. Additionally, while doing I/O operations, the thread that manages the state of such an application shouldn’t be blocked to be able to handle other network requests or other I/O operations.

To solve that problem, a “reactive” architecture is usually used, where the application is built as a deterministic finite state machine (DFSM) that changes its state as an event arrives (e.g. incoming HTTP request). On each event the state of the application is immediately updated according to its current state and the event kind and payload, and the application is ready to handle the next event.

In order to manage long-running operations, like I/O, network requests, or heavy computational tasks, such operations are performed in a non-blocking way. For example, a long-running computational task can be moved out to a separate thread or a process that will trigger an event on its completion. As for I/O, non-blocking mode should be used, so the application only reads or writes as many bytes as available.”

The New TezEdge Architecture

The new TezEdge architecture is based on state machines, computational models that can be in exactly one of a number of states at any given time. A state machine consists of a global state (assignment of values to variables) paired with a collection of actions which specify how the state can change. There is a predetermined state in which we start called the initial state. From each state, we can only move to a specified subset of states which are determined by the enabling conditions of the actions.

We exploit a different abstraction layer in this architecture, as the reducer is the only entity that can change the state when given an action. This is analogous to the purely functional fragment of Haskell. Pure functions are functions that, when fed with a particular input, will always return the same output. For example, a function that takes two integers returns the sum is a pure function; give it the same inputs, it will always return the same result. Contrast this with impure functions where the same input may return different outputs. For example, a function which reads the contents of a directory and returns the number of the files it contains is an impure function. We can call it when the directory is empty and it will return 0, then again after we add a file and it will return 1.

The reducer is responsible for managing state, however we still need some way to define control-flow and incur side-effects such as do file, database, socket I/O, DNS, or interactions with any other piece that isn’t part of the state machine. For that, we have the notion of effects (short for side-effects). It is a special function which is executed after the reducer on each action. That function may incur side-effects and dispatch actions.

Our program is now able to do side-effects, which are essential for it to function. The next step is to test how it works. Testing a reducer is obviously easy, since it is a pure function, but testing side-effects is another story. To be able to test them, we need to abstract those pieces that are outside of the state machine with which effects interact, so that we can mock them. We decided to call them services.

Inspiration from Redux

For those that know redux, terms: “state”, “action”, “dispatch” will sound familiar. As redux website describes it: “Redux is a predictable state container”. It is a library widely used for managing the global state as a single source of truth. We took that library, implemented in Rust and modified/extended it to better meet our needs.

Changes

In Redux, reducers aren’t allowed to mutate(change) the state. Instead they take the state and output a new state, without changing or affecting the previous state in any way. This allows cheap change detection, which is essential for the front-end, but for us it’s not useful and to achieve something similar, we would have to sacrifice performance, efficiency and complicate state a lot by using Rc everywhere in our state.

For that reason we decided to deviate in that sense from the original Redux and make reducers mutate the state.

Extensions

Effects — We have added a notion of effects, which are executed after the reducer. See more details above.
Enabling Conditions — We define a condition using which we check if the given action is “enabled” (allowed to execute/be dispatched) for a given state. This is very important especially for testing, since we have a simple method to check if action could happen in real execution when being in a certain state, so we can significantly decrease possible action sequences or possible states that we need to test for.
Safety Conditions — We define safety conditions on a state, which will check if the current state is valid. These checks will be used in testing to determine if we have reached invalid state. It could also be used during runtime to detect, log and auto quickfix issues.

Inspiration from TLC

The new architecture takes a lot of inspiration from TLC, a model checker for TLA+ (a formal specification language).

Essentially, our node starts in an initial state which satisfies all required invariants (state predicates).

The collection of all possible actions is finite and in each state we can always distinguish between actions which are enabled, i.e. they can be applied to the current state, and those which are disabled, i.e. they cannot be applied to the current state because some condition is not satisfied.

Then, from our current state, we can apply any of the enabled actions to the current state, which results in a successor state. We check whether the successor state satisfies all invariants; an invariant is a condition that we want to hold in every state. For example, if our state machine is modeling a simple counter, we may want to check that the counter value is nonnegative in every state. If so, we can again determine all enabled actions in our current state and apply one. As long as there are enabled actions, we can always move to another state. If no actions are enabled, then we are in a deadlocked state.

This design lends itself naturally to simulation testing and fuzzing since from any state, we can randomly choose an enabled action to apply and thus, generate a valid execution. Since the middleware handles all side-effects, we can have a middleware component for random simulation and fuzzing.

TLA+ trace-checking

As the actions are applied one-by-one by the reducer, this architecture is amenable to formal verification, especially formal specification and trace-checking. Trace-checking consists of logging state or action information and checking an execution against a formal model. The reducer will utilize an action-logging middleware component to log the actions it applies which we will then be able to check against a formal specification in TLA+. The TLA+ specification declares an initial state and all possible actions along with their enabling conditions to define the collection of all possible behaviors/executions of the system. We will instantiate the TLA+ model with constants provided by the program’s actual execution and determine whether the execution conforms to the formal specification.

The advantages of working with a formal specification are that they are considerably simpler than the actual program and we have a framework for verifying properties of all possible behaviors. We use a symbolic model checker, Apalache, to formally verify desired properties of the specification, i.e. we have proof that the spec behaves as intended. Trace-checking then allows us to check the program’s execution against our formally verified spec to make sure that they do not deviate. Any initial deviations can be converted into regression tests (re-running tests to ensure that the previously developed and tested version of the node still performs after this change) and used to strengthen our test suite, giving even more confidence that our system works as intended.

Visualizing how actions change the global state

We’ve developed a front end that visualizes the flow of actions and their effect on global state, with an emphasis on the replay function. Please visit this link to view the front end:

https://tezedge.com/#/state

Viewing the sequence of actions

The sequence of the actions executed by the state machine can be visually represented using a flow diagram where each block is represented by an action. Actions are connected by arrows. The state chart can be expanded for a bigger view or it can be collapsed entirely if the user wants to focus on the other information on this page. These latter two are stored inside the browser’s local storage, so the exact same view the user used will be mapped back on page refresh.

https://miro.medium.com/max/1750/1*BbGlgMx6L7eFzDNM5uBxDg.gif(image larger than 4 MB)

List of actions

In the bottom left section, we can see the chronological sequence in which the actions were triggered inside the state machine, one after another. Each action includes three types of information: content, state and diffs. Each row represents an action and shows the action’s execution start time, its name and its duration.

State

The state tab represents the global state of the state machine. This is the new state after the selected action has been executed.

Diffs

The diffs tab shows the differences between the state before and after the action has been executed.

Content

Shows the payload of the action, in other words, this is the data that has led to a new state.

Replaying recorded actions

By using the player at the top of the page, we can replay the sequence of actions executed by the reducer.

When we click on an action in the list of actions, we see a visualization of where that action is coming from, and the action is also highlighted in the diagram in the top part of the screen.

The blue arrow connects the action with the previous action. This can also be seen in the list of actions, which is ordered chronologically — above each action is the action that preceded it.

https://miro.medium.com/max/1750/1*rBCmkAmeg9Mt83tJhUS2sQ.gif(image larger than 4 MB)

Next action

An action can have multiple next actions in the flow diagram, but in reality there is only one next action. In the flow diagram, the action that succeeds the highlighted action is connected with a thick white line.

Benchmarking actions in real time

Here you can see which actions are most time-intensive, which points us to those actions that are most in need of optimization. You can view how much time is needed for every individual action.

https://miro.medium.com/max/1750/1*9a4jYuMFf2Z6W2ujXN9CXw.gif(image larger than 4 MB)

Statistics

By clicking on the statistics tab, we can see various statistics for each type of action that has been executed, including its number of calls and its total duration.

We hope you have enjoyed our update article. Feel free to contact me by email, I look forward to receiving your questions, comments and suggestions. To read more about Tezos and the TezEdge node, please visit our documentation, subscribe to our Medium, follow us on Twitter or visit our GitHub.

Ali · November 21, 2021, 7:57am

Congrats on the update.

Does this affect how bakers run their nodes?