Tezos Storage: Autumn Irmin Report: September - November

Tarides continues to develop fast, reliable storage components for Tezos, so this report contains our progress on the Tezos storage projects during September, October and November 2021. We’re excited to showcase our advancements to the Tezos community.

As the chain’s activity keeps growing, it’s of the utmost importance to continue improving the overall performance and scalability of the Tezos storage. Therefore, this autumn, we mainly focused on migrating the Tezos context to a new schema (the “path flattening” operation in H) and improving the chain’s I/O performance, which will go live in Q1 2022.

Before continuing, we’d like to extend special thanks to the engineers and technical writers who contributed to this report and all their hard work. If you are new to Irmin, please read more on our website Irmin.org.

Improve the Performance

Flatten Tezos Data Model
We completed necessary changes to Irmin for a memory-efficient, flattening migration in Tezos and these changes were released part of the Hanghzou protocol upgrade.

Nomadic Labs and Dailambda have each set up reproducible benchmarks for the migration, which served to guide us in optimising the code. Two parts of the code have been optimised: Tree.export, called when serialising an in-memory tree during a commit, and Tree.fold, called during the migration to traverse the tree’s converted parts to the new flattened format. After the path-flattening migration in Tezos, the store will contain very large directories (around 3 million entries, whereas before it wasn’t more than 256); however, the fold operation won’t be able to traverse such large directories. The Tree.fold can guarantee a traversal of nodes in the key’s lexicographical order, but only when loading the whole node in memory. This is problematic for folding over large nodes after they have a few million children, and Tree.fold’s memory-optimised version can’t traverse low memory consumption or traversal order. After discussions with Nomadic Labs, we’ve added an additional flag to Tree.fold that specifies whether the entries are traversed in sorted, undefined, or random order. Tree.fold will be available as a flag in a future release of Irmin.

We released Index 1.4.2 and Irmin.2.8.0, which contain optimisations for Irmin’s memory usage during the migration to flattened stores. The release of Irmin 2.8.0 revealed some bugs in the lib_proxy code, namely that empty trees replaced the shallow nodes at various places in the code, which is problematic when computing hashes. In collaboration with Nomadic Labs, we proposed a fix.

During the migration of a Tezos node, two processes were performing the migration: the read/write process, which constructs a flattened tree and then commits to disk, and the read-only process, which also constructs this new tree. Since the read-only process only needs the tree to verify its hash, it discards the tree after the migration completes. The memory of the read/write process is kept under 6GB because the flattened tree’s subtrees will be regularly flushed to the disk. However, this isn’t possible for the read-only instances, so Nomadic Labs suggested disabling the read-only migration process. We experimented with a possible memory optimisation for read-only instances in order to discard subtrees and only keep their hash, but this resulted in a loss of efficiency.

The benchmarks that drive the migration’s memory optimisations have only used “small” stores (obtained after a snapshot import, where the context size is 4GB). Nomadic Labs reported a significant performance drop on bigger stores. To investigate, we benchmarked a small store (8GB), which included a snapshot of a recent state, and a “one month” store (30GB), consisting of the historical blockchain state. The OS tried to cache data from both of the stores, so if there isn’t enough free memory to hold the files in the cache, performance can degrade considerably due to blocking on disk I/O.

A way to detect this is to time the preads syscalls. We looked for possible ways to speed up the migration by using vmtouch or changing configuration parameters to the store, and we wrote a document explaining the issue and possible mitigations. We also experimented with changes in Irmin to reduce the number of reads in order to rely less on the OS cache, but those attempts haven’t yet been successful.

In Tezos, we exposed an order flag in Tree.fold to be used when folding over large directories in a memory-efficient way.

Relevant links: tezosagora.org/will-tezos-freeze-during-hanhzhou-migration, irmin/1508, irmin/1526, irmin/1531, irmin/1532, mirage/irmin#1545, mirage/index#362, mirage/irmin#1555, tezos/3679, tezos#3735, mirage/irmin#1590, mirage/irmin#1619, tezos/tezos#3910

Improve irmin-pack I/O Performance
Our goal is to index fewer objects in irmin-pack to dramatically change the running node’s I/O. To do this, we’re adding a new feature in Irmin to support structured keys, which is planned to be part of Octez’s shell in Q1 2022.

A structured key, instead of being a just hash, now consists of a hash, an offset, and a length. Therefore, with a structured key, we can read a value directly from the pack file, instead of asking the index for its offset and length. To serialise the structured keys, we’re introducing new types of serialised objects in the pack file. This keeps backwards compatibility with older pack files, so it’s only the newly added objects that will use the new format. This feature considerably optimised irmin-pack by switching from an indirect access mode (that goes through the index) to a direct one. The nodes and blobs are now directly accessible from their parent.

With the structured keys approach, we can also choose which objects to index. Please note that the more objects indexed, the slower the index library works, however indexing fewer objects can lead to an increased size on the disk. We started benchmarking different indexing strategies to find an optimal one.
We’ve fixed a few bugs and have a better understanding of some other issues we need to fix before deploying this feature (for instance in repr, we need to fix the function overriding mechanism). Our initial benchmarks showed that two indexing strategies could work:

  1. Never index nodes and blob, only commits
  2. Always index all objects

Next, we’ll print stats on the stores obtained after bootstrapping in order to make a decision about this complex indexing strategy.

In addition to the work on the structured keys feature, we also added tests for the LRU in index and experimented with cachecache (a more memory-efficient LRU) in irmin-pack and we are investigating mmap as a replacement to preads and pwrites. It was necessary to add msync calls to ensure that the in-memory data reached the disk, but this slowed down the overall performance of index.

Relevant links: irmin/1510, irmin/1534, mirage/irmin#1534, mirage/repr#82, mirage/irmin#1659, mirage/irmin#1601, mirage/index#366

Record/Replay of the Tezos Bootstrap Trace
We published a blog post describing the record/replay framework we’ve been working on to build a reliable benchmark infrastructure for Tezos storage.

We made considerable progress on this feature and the summary generation from a trace works, and we obtained pretty-printing of the benchmarking results. The code doing the record must be part of the Tezos codebase, as we are recording the API of lib_context, but we’ll also add the code that does the Tezos replay to prevent copying and maintaining it as a separate library.

Relevant PRs: tarides/tezos/16, index/353.

Alternative Index Implementation
We continued our investigation into different alternatives to index. We implemented two OCaml libraries for benchmarking against the current index performance: a simple SQLite index (kv-lite) and a Kyoto Cabinets index (kv-hash). Our approach was to keep part of index as-is because the write-ahead log has good performance and has been intensively tested, but we changed the data file in index to replace merges with more performant bulk writes.

The kv-lite performs well until it reaches 800 merges. From that point, the overall performance is poor.

The kv-hash seems more promising, but it’s still problematic when it reaches 800 merges because we’re using mmap. The 800 merge point corresponds to a full memory, and that’s when mmap flushs everything to disk. We’re looking into using msync periodically (to prevent one very long flush) or using I/O syscalls.

We benchmarked an alternative of index using rocksdb, and the results are similar to what we already have: good initial perfs that degrades on large stores. Our most promising alternative to index is an OCaml reimplementation of the Kyoto Cabinets, ocaml-kc. This implementation won’t use mmap in favour of Unix syscalls for reads and writes. To increase the speed, we added some caching. While benchmarking this library, we fixed some bugs in the code as well.

The kv-hash is almost ready for the first release. We’ve created an issue with what is still missing.

Relevant PRs: tarides/tezos/16, index/353.

Merkle Proofs

Optimistic Rollups

In 2022, Tezos plans to add optimistic rollups to the protocol. To support this, we are adding support for efficient Merkle proofs in Irmin.

With the release of Irmin 2.8.0, we revisited the Irmin support for Merkle proofs and tried to add nodes under a shallow node in a Merkle tree, but it now throws an exception where it used to silently ignore it. While doing this, we also discovered and fixed some bugs in Irmin.

We proposed an API for compact Merkle proofs that can be generated from a sequence of batched operations in the store. The API allows two different ways that Merkle proofs can be used in Tezos. Traditional Merkle proofs are Merkle trees in which subtrees may be replaced by their hash. The proof characterises the tree’s substructure that has not been hidden. A second approach builds partial Merkle proofs when applying the batch operation sequence. The partial Merkle proofs contain a minimal initial tree and the full history of reads, which recompute the root node hash on demand. These changes will be part of the upcoming Irmin 2.10 release.

Relevant links: mirage/irmin#1583, mirage/irmin#1537, mirage/irmin#1583, mirage/irmin#1621


Publish Irmin/Tezos Performance Benchmarks
We worked on a monitoring framework for a Tezos node that reports both system stats (i.e., memory usage) and metrics from Irmin (i.e., the number of nodes reads from disk). A recorder exposes these stats through Tezos RPC and feeds them to Prometheus and then Grafana. We set up a Tezos node on an AWS instance to continuously monitor the node and gather Irmin-specific metrics.

We noticed a performance regression for one of our monthly benchmarks that are not explained by changes to the code. Apparently, either a change in the machine used or in the dependencies is responsible for the perf drop. We plan to better track dependencies using exports of the Opam switch or Opam lock files.

We also plan to generate monthly benchmarks automatically by spawning Equinix machines when necessary.

Relevant links: mirage/irmin#1574

Continuous Benchmarking Infrastructure
We spent time debugging an issue in the CB framework. There’s considerable variation between different runs of the same benchmark. We also continued the framework maintenance and debugged other smaller issues.

For the CB framework, we fixed an issue where several data points were reported for the same commit. This was an issue with ocurrent running the same job again, even when cached.

Currently, the Docker DSL in ocurrent specifies the steps needed to run benchmarks, but for different repositories, the packages might differ. It might be necessary to pin some packages for the benchmarks to work. Hence, we plan to introduce a configurable Dockerfile to plug into the pipeline, which will unblock the Multicore benchmarks and is also a step forward towards adapting the CB to Tezos.

We also worked towards a local setup for the database and pipeline installation. This allowed easier debugging of the cpuset configurations, and it improved the front-end development because we can now perform local testing.

We addressed bugs involving the Irmin benchmarks and GitHub webhook secrets, and we introduced the ability to track a job status in the database. We improved the frontend for cancelled builds and for updating the status when the pipeline builds.

We made the response time of occurent-bench faster by not rerunning all open PRs older than 2 weeks when redeploying the CB, as sometimes it took over a day to rerun all benchmarks. We also made improvements to the production deployment, including testing and automating the DB migrations, and ameliorated the pipeline development by adding tests and faster builds.

Relevant links: ocurrent/current-bench#162, ocurrent/current-bench#11, ocurrent/current-bench#158, ocurrent/current-bench#200, ocurrent/current-bench#203, ocurrent/current-bench#211

Use OCluster to Schedule Jobs on Different Machines
Currently, the CB framework only runs benchmarks on one machine. To scale up and allow Tezos benchmarks to run on customized machines (like Raspberry Pi’s), we wanted to use OCluster to dispatch the jobs on different workers. We drafted a proposal and started working on this and on integrating OCluster to the ocurrent-bench’s pipeline.

Relevant links: ocurrent/current-bench#213, ocurrent/current-bench#216.

Improve Tezos Interoperability

Storage Appliance
We worked towards providing replication for the irmin-sever and added a watch command, similar to what exists in irmin-unix.

Rust and Python Bindings for Irmin
We finished the first version of the Irmin bindings for Rust. Python bindings are also now available.
The repo for the Rust/Python bindings can be found on GitHub.

Relevant links: mirage/irmin-server#18, zshipko/libirmin.

Irmin Documentation is Online and Up-to-Date
We improved Irmin’s README and worked on the documentation at irmin.org.

Maintain Tezos’s MirageOS Dependencies

Respond to and Track Issues Reported by Tezos Maintainers
Using our mechanism of recording a trace in a Tezos node, we helped Nomadic Labs debug an issue related to a snapshot export that caused corrupted stores when importing the snapshot and applying reconstruct. We reduced the memory usage of the snapshot export from ~10G to ~4G by optimising the data structure used. As a consequence, the export is faster—going from 13.5mins to 8.5mins. The graph below shows the original memory usage of the export (in blue) vs. the optimised one (in purple).

We’ve investigated several issues related to storage posted on the Tezos GitLab.

Relevant links: tezos/1679, tezos/1729, tezos/3526, tezos/tezos#2043, tezos/tezos#2044, tezos/tezos#2066, tezos/tezos#2075*

General Irmin Maintenance
We fixed a performance issue in the call to Store.commit, which is included in the next release (Irmin 2.9.0).

We experimented with refactoring Irmin’s CLI to have one that isn’t dependent on irmin-unix and that can be used with the irmin-pack backend. In particular, we wanted to use the CLI on Tezos stores and make it more uniform across backends. Each backend will register a plugin to the Irmin CLI, which will run built-in subcommands in addition to user-provided plugins. We’re investigating how different backends can be added as plugins to a more general CLI tool.

We released Repr 0.5.0, which contains a simpler API, and Tezos-base58 1.0.0, a self-contained package for Base58 encoding used by Tezos. The latter allowed us to simplify Irmin’s dependencies, which formerly required git submodules. It now has a new package, irmin-tezos, which contains all the necessary modules to instantiate a Tezos store from Irmin. We’re still working on adding irmin-tezos as a backend. The CLI exposed a watch command that calls another program (or writes diffs to a long-running program) when updates in the store occur. We’ve also completed several more assorted CI fixes for repositories in the Irmin project.

We can now build Irmin with dune-universe! This required a few PRs to ocaml-inotify, and we added it to the overlays repo dune-universe/opam-overlays.

Finally, we released Index.1.5.0 and Irmin 2.9.0. Our current releases live on a separate branch from the main development. This implies that for every release, we must forward port changes to main.

Relevant links: mirage/irmin#1553, irmin/1543, mirage/repr#77, mirage/repr#83, mirage/irmin#1579, mirage/irmin#1608, dune-universe/opam-overlays#97, mirage/irmin#1612, ocaml/opam-repository#19972, ocaml/opam-repository#20033

Maintenance of MirageOS Libraries Used by Tezos
We added Js_of_ocaml support to Alcotest and a new release of Digestif 1.1.0, ezjsonm.1.3.0, and ezjsonm-lwt.1.3.0.

Relevant links: mirage/alcotest#326, ocaml/opam-repository#20002

Verify Existing Bits of the Stack Using Gospel
We prepared for the Gospel and Ortac release announcement and have improved much of the documentation, polished the code, and added tests. We presented Ortac to RV’21 and started refactoring to introduce an intermediate representation. This allows the implementation of an ortac report command to help developers write executable specs.

We also worked on an ANR proposal for a Gospel ecosystem and on gospel-lru, a fully verified lru that allocates significantly less than Irmin’s current lrus.

We spent time getting Gospel to compute the mutability of an OCaml type. Mutability now has four statuses: Mutable, Immutable, Unknown, and Dependant (waiting for an instantiation of an alpha). We also worked on a PPX that procures Monolith generators to replace some code in the Monolith frontend (and decrease technical debt as the PPX will depend on the OCaml AST, not on the less stable Gospel AST).

*Relevant links: ocaml-gospel/ortac/, ocaml-gospel/gospel/, gospel-lru, ocaml-gospel/ortac#36, ocaml-gospel/ortac#37