Tezos Storage / Irmin Summer 2021 Update

samoht · October 2, 2021, 8:49pm

Irmin July/August 2021 Update

Tarides continues to develop fast, reliable storage components for Tezos, so this report contains our progress on the Tezos storage projects during July and August 2021. We’re excited to showcase our advancements to both the OCaml and Tezos communities.

As the chain’s activity keeps growing, it’s of the utmost importance to continue improving the overall performance and scalability of the Tezos storage. This summer, we mainly focused on index, a component used to index Tezos’ context elements which is very I/O intensive. Therefore, we are trying to (1) reduce the number of indexed objects and (2) explore alternative index implementations with better performance characteristics. We’ve also been working on optimizing how data is organized in the context; hence, our focus—in collaboration with DaiLambda—was on flattening the Tezos data model, now part of the Hangzhou proposal. Finally, to better understand the performance impact of these changes, we have continued to make progress on the record/replay benchmarks for the bootstrap trace, and we’re setting up Irmin/Tezos benchmarks on the Raspberry Pi. Read more details about each of these in the report below.

Before continuing, we’d like to extend special thanks to the engineers who contributed to this report and all their hard work. If you are new to Irmin, please read a short introduction in our Irmin 2021 Update and on our website Irmin.org.

Improve the Index Performance

Index is a scalable component of irmin-pack—the Irmin backend used in the Tezos storage layer. irmin-pack writes data in an append-only file called a pack file. In order to efficiently retrieve the data, it uses the Index library, which maps hashes of objects to their location in the pack file.

The index implementation in irmin-pack has a hard job to solve because the Tezos context stores many millions of individual objects—all contained in the pack file—and each of them must be addressable by their hash. Not only does it need to be compact on disk, it also needs to be very fast to search for every object we read.

The index was originally optimised for speedy read performance at the cost of potentially-slow writes on very large stores. However, this initial design choice is becoming less tenable as the Tezos blockchain grows and the index used in Octez increases in size. We’re working on improving the indexing mechanism to better meet the requirements of a modern Tezos node.

In the search for an alternative way to store data in the index, we’re exploring a method based on the structured keys. This consists of adding more information to the store keys to avoid accessing the index for every read. This new method would lead to fewer stored objects in the index at the cost of potential duplication of objects on disk, so it’s still a WIP. We prepared the code for this feature by refactoring it through the use of Schemas, a more compact way of instantiating the Irmin Make modules. Moving forward, we’ll continue this work on the structured keys approach, which will enable fewer stored objects in the index and therefore improve its performance.

Additionally, we experimented with using mmap instead of preads and pwrites for the I/O calls, and we benchmarked it with a small operation trace that corresponded to the first 100K commits in Tezos. These initial benchmarks showed around 10% performance improvement, but the improvement was less significant when calling msync regularly. See more details at index/350. While the change to mmap still seems promising, it will probably have a small impact on performance.

Relevant PRs and Issues: irmin/1389; irmin/1510; irmin/1470; index/mmap-wip.

Alternative Index Implementation

Another possible way to improve irmin-pack's indexing performance would be to change the design to one that’s better suited to the huge contexts of the modern Tezos chain. Along these lines, we’ve been experimenting with using B-trees as part of an index implementation.

Over the summer, we added data integrity to our implementation of B-trees, so it now allows recovery from a crash. Although we’re sad to report that Gabriel Belouze’s internship on B-tree finished this month, we’re thrilled with his work on cleaning the code, releasing Cactus.1.0.0, and writing a report.

We also implemented a simpler version of B-trees—mini-btree—to produce some baseline benchmarks. We created a simple log using mmap to compare it with the preads and pwrites used in index.

Relevant links: cactus; mini-btree.

Flatten Tezos Data Model

The proposed Hangzhou protocol brings with it a change to the structure of the context: flattening internal paths in order to improve the efficiency of the storage layer. This flattening requires that nodes undergo an automatic migration.

In collaboration with DaiLambda, we’ve been working on reducing the memory usage of this migration process, so even nodes with limited available RAM (<= 8GB) are able to upgrade seamlessly.

Relevant PRs and issues: tezos/2771; tezos/1682; irmin/1506.

Record/Replay of the Tezos Bootstrap Trace

We continued to make progress on the new record/replay benchmarks. In July, we finished implementing the record and the conversion from the raw trace to a replayable one, and in August we’ve started to implement the replay phase.

These new replay/record benchmarks can be used to record traces of live nodes and also include more metrics. Plus, we’ve almost finished implementing the summary computation for all the stats gathered during the record or the replay phase.

Stay tuned for a blog post covering the technical details of these benchmarks and showing how to reproduce them.

Run Irmin/Tezos Benchmarks on a Raspberry Pi 4

We’re in the process of setting up several Rapsberry Pis with different configurations to use for our monthly benchmarks. This is fun project, so we look forward to reporting the outcome in one of our next public reports.

Continuous Benchmarking

We continue maintaining current-bench, fixing bugs, and refactoring the docker-compose component of the pipeline. We published a blog post explaining the benchmarks infrastructure.

Maintain Tezos’s MirageOS Dependencies

General Irmin Maintenance

Part of the maintenance work in July 2021 included refactoring the Config module, which resulted in more uniform configuration options across backends.

As the CI added support for IBMz s390x machines, we found several spots in Irmin that weren’t working on big-endian systems, thus causing stack overflow. We detected a discrepancy between the memory usage reported by our memtrace-filters and the memtrace-viewer, so we are investigating this. We’re also investigating the bugs in the graph traversals of an Irmin object graph, using Gospel (see more details below).

Relevant PRs and issues: progress.0.2; irmin/1492; irmin/1505; irmin/1503; repr/71.

Recover and Debug Corrupted Stores

We concentrated most of our efforts this month on understanding and repairing an occurrence of a corrupted context caused by the node crashing unexpectedly. The issue, including more details on how it occurred, was tracked in irmin/1476. While investigating the issue, we developed several tools:

a **diff** tool for commits to highlight the difference between the same commit but in a corrupted and a normal store
a brute-force integrity check tool to traverse the entire store and check for all types of inconsistencies
a light version of this tool that only checks for missing entries in the index, which was the source of the corruption and is now integrated in the storage subcommand in the ./tezos-node cli
a script that launches, kills, and restarts nodes in a loop to look for a reproduction of the issue

The corruption was caused by the merge threads (running concurrently to the main thread and updating the on disk index with in-memory data) being killed on an out of memory failure, but they were restarted when the node recovered. When the merge thread restarted, all data added to the store before the failure was lost. We fixed the issue and released index versions 1.4.1 and 1.3.2 containing the fix.

We merged the new releases of index and Irmin that contained bug fixes, and we also included the light version of the integrity checking commands mentioned above.

Relevant PRs and issues: irmin/1476; irmin/1478; index/337; index/338; index/339; irmin/1477; index/344; index/345; tezos/3282.

Respond to and Track Issues Reported by Tezos Maintainers

We released Irmin.2.7.1, which contains a small bug fix for the reconstruct-index subcommand of ./tezos-node storage, and we responded to a bug reported by Nomadic Labs regarding the configuration of irmin-pack.mem.

Sometimes, nodes have to list all contracts as a response to an RPC call. With the flattened store, listing all of them might consist of millions of entries, so such a request can freeze the node for 20 minutes. We are working on adding pagination for the “get_all” RPCs to avoid listing all contracts at once.

Relevant PRs and issues: tezos/3421.

Verify Existing Bits of the Stack using Gospel

We are developing Ortac, a framework for runtime assertion checking of OCaml programs based on Gospel behavioural specifications. It provides a flexible solution for traditional assertion checking, monitoring misbehaviours, and automated fuzzing of OCaml programs.

We applied the tool to two projects used by Irmin and Tezos: the optint library and Irmin’s Object_graph. For the latter, we managed to write and check an original implementation using some tricks to overcome the tool’s limitation.

During the summer, we presented our work on fuzzing through Ortac at the OCaml’21 Workshop. Ortac’s general architecture and design will be presented during the RV’21 conference later this fall.

Relevant links: Gospel; Ortac.

Improve Tezos Interoperability

C Bindings for Irmin

We wrote libirmin to generate a C library using ctypes inverted stubs. This can be used by C clients to directly interface with the Tezos storage.

Storage Appliance

We are continuing to experiment with having a separate storage deamon (irmin-server) to handle the full Tezos storage. This will help with operating the storage and to better handle concurrency.

We started outlining future work on the irmin-server and how we can add a irmin-graphql client.

We are also implementing the msgpack serialisation format in the irmin-server and adapting it to work with Repr. We’re maintaining the current implementations of irmin-server and irmin-rpc, and we hope to release them with Irmin 3.0.

Relevant links: https://github.com/mirage/irmin-server; GitHub - mirage/irmin-rpc: RPC client/server for Irmin.

Improve Irmin Documentation

We are working on integrating the tutorial into the release process and getting it ready for the Irmin 3.0 release.

Follow the Tarides blog for future Irmin updates.

nicolasochem · October 7, 2021, 8:03pm

Thank you @samoht for the update. At Oxhead Alpha we have been struggling to efficiently test the context migration on mainnet and storage performance in general.

The existing yes-node method (or “manual migration testing”) is not ideal because it forces us to recompile.

Instead of a yes-node, we need a yes-context: from an existing mainnet data directory, a tool that rewrites the context to change all baker keys to derivations of a key that we control, passed as parameter to this tool: this allows to run tezos-client bake for or more advanced mainnet simulation scenarios with a recent build of tezos without recompiling.

This would be essential for us to quickly retest hangzhou migration time as you merge improvements to the storage layer. Is is possible to build such a tool? Is it easy?