Tezos Storage / Irmin 2021 Update

samoht · July 6, 2021, 12:25am

Irmin is an OCaml library for building mergeable, branchable distributed data stores. You can choose the way you store information with this open-source database, originally written for use by MirageOS unikernels. Whether it’s in-memory or on your hard disk, you can choose where to store your data. With its built-in snapshotting and git compatibility, you can ensure it’s safe and recoverable in the event of a crash. Plus, you can customize data types and run it on Linux or web browsers, among other places, because it’s so portable!

Irmin was released in 2014 as a component for MirageOS. The first Tezos implementation (now dubbed “Octez”) has used Irmin as a solution to run things more quickly and securely since its conception. Initial prototypes used irmin-git, then they progressed through irmin-leveldb, irmin-lmdb, and most recently irmin-pack, deployed in 2019 specifically for Tezos’s use-case. We’re continuing our collaboration with Tezos developers to create a solution to reduce disk size and increase speed for their blockchain model.

Tezos nodes currently have two storage areas: store and context. The store keeps metadata about protocols, chains, and validated blocks. The context is a Unix-like directory and file structure that uses version control to store blockchain data such as contract and roll values, which is a perfect fit for Irmin.

This summary contains our progress from January - May 2021. Moving forward, we will release regular reports that will strive to communicate our goals and accomplishments throughout the development of Irmin—improving functionality and dynamic, reliable tools—to benefit the overall OCaml and Tezos community. Our objective is to ensure our community has access to excellent tools that enable them to be more efficient and empower their programmers to write complex, secure code.

Main focus: the need for speed

In partnership with DaiLambda, our initial focus for Q1 2021 was to increase the speed of the Baking Account migration process, and we managed to make it 10x faster. We also collaborated with the Tezos team to merge requested changes upstream. Although it’s not yet available in Florence, the changes necessary to make it work faster has been incorporated into Irmin 2.3.0 and 2.4.0. Please read more about the Irmin release used by the Florence protocol on our blog.

Octez uses Irmin for storing blockchain state. To get a better idea of the pattern of storage actions, we added a bootstrapping trace record feature to Octez, which records all the Irmin operations during the bootstrap of a Tezos node, and then we wrote an independent benchmark to replay that trace using only Irmin’s API (without involving Octez code at all). This trace consists of the context operations done during the bootstrap and is currently available in our private Octez fork; however, we plan to upstream these changes over the coming months to make it easy for everyone to generate storage benchmarks that we could replay and optimize. This is similar to the record-replay feature that is available with Tezedge.

We used this new record/replay feature to evaluate performance by measuring metrics during the replay and then generate a stat trace to correspond with the bootstrap trace, which contains various statistics for every commit in the trace. At the end of the replay, a summary of all stats is computed. Using these summaries along with the stat traces, we can compare different runs of the trace (e.g., for every release of Irmin) and print them for distribution and review. Thanks to this, we detected a performance issue with the remove operation, so we addressed it by making that operation 6x faster and created additional functionalities.

We implemented this record/replay for most of the trace, and we are planning to have it fully functional by the end of Q3. We are also working with TezEdge to share a common trace format, and we are looking at how to start replays from a non-empty store.

Context hash documentation

Every node participating in the Tezos protocol consensus should agree on the blockchain’s content. In order to do so, the Tezos storage computes a short, unique identifier: a BLAKE2b hash of the Merkle tree derived from the current state of the contracts, the rolls, etc. Until now, the way to compute this Merkle hash was hidden in the Irmin codebase, so it was difficult for other teams to reproduce independently.

In order to solve this, we created a new repository with the documentation on how context hashes are computed in Tezos. This repository also contains a reference implementation and a randomly generated context dataset for testing the documentation.

As we focused on ways to optimise how large directories were serialised, we had to modify and document how the context hash was computed as well. In order to do so, we worked closely with TezEdge and DaiLambda to improve the support for large directories on both Octez and Tezedge, which is a prerequisite for a process called flattening that strives to make large directories more naturally efficient.

Optimizing large directories

Serializing immutable nodes is not as simple as serializing immutable contents. In fact, nodes might contain an arbitrarily large number of children, so serializing them as a long list of references might harm performance. This includes loading and writing a large amount of data for each modification, no matter how small this modification might be. Similarly, browsing the tree means reading large blocks of data, even though only one child is needed.

For this reason, last year we implemented a Patricia Tree representation of internal nodes that allows us to split the child list into smaller parts that can be accessed and modified independently, while still being quickly available when needed. This reduced duplication of tree data in the Irmin store and improved disk access times. This implementation is inspired by Unix-style inodes and optimised the way data is stored and accessed in Tezos’s context stores.

In 2021, Tarides has continued to improve the performance of Irmin inodes by working closely with DaiLambda, which created a similar inode implementation called Plebeia last year. We are now working closely together to integrate Plebeia’s features into Irmin.

The node proxy and light clients use an in-memory context database. We thus started working on an irmin-pack in-memory backend which supports large directories and is able to compute the same context hash as the on-disk context database.

Path flattening

Thanks to the improved support for large directories in the storage backend, it is now possible to optimise the data schema of Tezos’s context. This optimisation, created in conjunction with DaiLambda, is called path flattening. Using the bootstrap trace, DaiLambda and Tarides conducted benchmarks with path flattening which resulted in making the bootstrapping 30% faster; however, flattening the store introduced very large directories in the database and thus required the optimisations for large directories, described in the previous section.

We considered taking advantage of the flattening migration to introduce other hash-sensitive changes to irmin-pack, for which there might be a performance improvement (initial benchmarks showed promising results). For this, we need to add support for multiple hash versions in Tezos and Irmin.

Compact Merkle proofs

We changed the representation of lazy trees in Irmin to support TweagIO’s implementation of Merkle proofs, which are the original Authenticated Data Structure (ADS). These ADSs send proofs that the data came from the correct and expected database. This strategy diverts the need to have a full copy of the database through hashes, so the “proof” is confirmed with these hashes. This paves the way for Tezos light clients in the future.

Improved disk I/O usage

Following our ongoing partnership with Nomadic Labs, we have set up some benchmarking scenarios for the layered store, a new storage system that merges two existing stores, and added some features to improve its usability (e.g., removing certain temporary files). Thanks to these benchmarks, we better understand how to obtain and keep a self-contained upper in the layered store. The benchmarks run successfully, and we’re now exploring ways to solve a performance issue that occurred when committing during a freeze.

In collaboration with the Multicore team, we are also experimenting with an asynchronous I/O framework called io_uring. The initial tests with io_uring resulted in faster I/O performance! We’re experimenting with utilizing io_uring to ensure that Irmin and Tezos can take advantage of asynchronous I/O as soon as it’s available.

Lastly, we’ve been working on integrating a new B-tree index prototype with irmin-pack and Tezos, so that we can use the bootstrap trace to benchmark them accurately. Initial benchmarking suggests some platform variance (worse performance on MacOS vs. Linux), so we are investigating this for future iterations.