Continuously profiling and monitoring the Tezos platform

sadiqj · July 20, 2021, 3:21pm

Hello everyone. Sadiq (@sadiqj ) and Richard (@richardwarburton) from Opsian here and we thought we’d introduce the project we’re working on that’s supported by a grant from the Tezos foundation.

Why continuously profile and monitor Tezos?

There’s a load of work going on in the Tezos and OCaml ecosystems at the moment to improve performance and allow the Tezos network to scale to dramatically higher transactions-per-second.

The problem this project sets out to address is how do we decide what to optimise and how do we measure the improvements from that work? The nodes across the network exercise different parts of the codebase and run on varying hardware - this makes relying on benchmarking alone unrepresentative.

What this project intends to do

The goal of this project is to build the underlying technologies to make continuously collecting performance information from all Tezos nodes practical. This means developers can see the actual behaviour of the Tezos codebase across the network and make better decisions on what to optimise.

Progress so far

We’ve separated the project into two parallel tracks. The first is to gather line-level profiling data continuously and the second is to modify OCaml itself so we’re able to extract fine-grained runtime data, such as the Garbage Collector events.

Profiling

We’ve finished the research prototyping stage where we evaluated many different ways of doing low-overhead continuous profiling in OCaml. Our chosen approach enables gathering line-level profiling data with <1% overhead - making it suitable for running continuously on a mainnet node. There’s still a lot of work to be done on testing and integration, as well as profiling other resources (like memory and IO).

Runtime metrics

For runtime metrics we have a prototype fork of OCaml that maintains a lightweight ring buffer with runtime events from the Garbage Collector (such as event timings, heap usage, etc…). The design is intended to be extendable to dynamic user events in the future and there will be a lot more written up about this pretty soon.

Demo

Who doesn’t love a demo?

Getting started

The open source OCaml profiling agent is coming along but there’s still more lots more work to do on testing and integration (e.g getting Opam packages up, having CI up through Github Actions). We should be at the stage soon where we’re able to help people integrate the agent into their projects and go through reports with them. If you think you’d like us to give you a poke when it’s ready then drop Sadiq an email at sadiq@opsian.com.

Questions?

Happy to answer any people have.