Tezos is a self-amending blockchain technology that lets anybody prepare and publish a proposal for a future network protocol. Once it passes a five-stage governance process, the new protocol will automatically go live across the network.

As more services, applications and users rely on data correctness and high availability of our APIs and products, we have been investing our time in better infrastructure testing, and in particular, testing the impact of a real mainnet upgrade. We like to share our experiences and a guide for performing such tests yourself.

This guide is meant for Tezos bakers and people who are generally interested in the Tezos protocol. In a nutshell, we will cover the exact steps that let you:

  • download, patch, compile and run a special test version of the Tezos node called Yes node from source
  • import a snapshot to create a so-called full node
  • run the protocol migration by manually baking new blocks

We walk you through this process step by step and provide command line examples so you can follow along. To get started, you need a laptop or server with about 20 GB free disk space, as well as a good internet connection since you need to download a 4 GB snapshot file plus all sources and dependencies.

Why Protocol Migration Testing

When migrating to a new Tezos protocol, a new context is produced by applying migration operations to the previous context (context = file system of Tezos blockchain state). This process is sometimes simple and fast as in Florence and Edo, but it can also be data and compute intense as in the upcoming Hangzhou migration (see Marigold's blog post for details). Hangzhou needs to rewrite large data structures which can take anywhere from 10 minutes to 3 hours.

Testnets are one piece of our testing regime where we can check if our software works as expected with new RPC endpoints and protocol features in an isolated environment. For this reason we run our own testnet nodes including testnet bakers from day one. Tezos core developers make sure that Tezos testnets perform a protocol migration early on which is very helpful because we can run our tests on small datasets first.

Recent experiences have taught us, however, that this is not enough. We also need to test migration on "real" mainnet context, i.e. the entire state imported from Mainnet, because its larger size can lead to unexpected performance and timing issues that won't be visible on a testnet. We're interested to see how the size and complexity of a mainnet context affects a smooth migration and our ability to pull data from the RPC during migration and afterwards.

So how do you simulate such kind of migration on real mainnet data?

Preparing a Yes Node for testing

As part of this process we will download a snapshot from a trusted source like https://mainnet.xtz-shots.io. Visit the page and copy the link to the Rolling Snapshot. We will need it in step 6, and the block number in step 3.

When writing this guide, the most recent available snapshot was from block 1,895,269. For you this will most certainly be different, so you need to adjust steps 3 & 6 accordingly.

https://mainnet.xtz-shots.io/tezos-mainnet-1895269.rolling

We run our examples inside Docker using Alpine Linux and an external volume to keep our state across container restarts. Besides installing some utilities and compilers the instructions should work in a similar way across other Linux distros and even MacOS homebrew.

0. Create a Docker volume, start a vanilla Alpine container, and install packages

# create a new volume to hold node and wallet data
docker volume create tezos_migration

# start an Alpine container and attach our volume to /data
docker run --name yes-node -it --rm -v tezos_migration:/data alpine sh

# inside the container, update package descriptions
apk update && apk upgrade

# and install a few packages we need to build Tezos
apk add patch unzip make gcc m4 git g++ aspcud bubblewrap \
  curl bzip2 rsync libev-dev gmp-dev pkgconf perl bash \
  hidapi-dev binutils-dev ocaml opam cargo libffi-dev \
  zlib-dev ncurses-dev openssl-dev jq

  1. Check out the Tezos v11 branch and prepare all dependencies
git clone --depth 1 -b v11.0 https://gitlab.com/tezos/tezos
cd tezos
opam init --bare --disable-sandboxing --no-setup --yes
make build-deps
eval $(opam env)

2. Next, apply the yes-node patch and create a yes-wallet (the yes node is a special version of Tezos that accepts any private key for signing baked blocks; further down we still need to bake as the baker who owns the rights for a given block)

patch -p1 < scripts/yes-node.patch
dune exec scripts/yes-wallet/yes_wallet.exe -- create minimal \
  in /data/yes-wallet

3. Run the user_activated_upgrade.sh shell script to trigger the upgrade of the protocol 2 blocks after the snapshot level. In our case, e.g. 1895269 + 2 = 1895271.

./scripts/user_activated_upgrade.sh src/proto_011_PtHangz2 1895271

4. Ready to compile our patched Tezos node and CLI tools

make

5. Once that's done, create a temporary data dir for your yes-node

mkdir /data/yes-node

6. Now is the time to import the snapshot from above (this will take 25+ minutes, so feel free to fetch a coffee now)

# download
wget -P /data https://mainnet.xtz-shots.io/tezos-mainnet-1895269.rolling 

# and import the snapshot (we skip validation checking to save time)
./tezos-node snapshot import \
  --data-dir /data/yes-node/ \
  --no-check \
  /data/tezos-mainnet-1895269.rolling 

6.1 (optional) For your convenience and if you have disk space, make a backup copy of the folder. This way, you can easily perform the process multiple times.

cp -r /data/yes-node /data/yes-node-backup

7. Start your patched Tezos node without any connection.

./tezos-node run --connections 0 \
  --data-dir /data/yes-node \
  --rpc-addr localhost 

Now let's bake some blocks!

Running the Manual Migration Test

This part of the guide requires a second terminal. In Docker we can simply exec into  the running container

docker exec -it yes-node sh

1. We need to bake 2 blocks for the migration to start. Do do that, we can use the pre-defined baker aliases foundation1foundation8. If one of them does not work, i.e. you get an error telling you there are no baking rights, pick another baker. Lets bake a test block:

# first change to the tezos build directory
cd tezos

# then bake a test block to check everything works
./tezos-client -d /data/yes-wallet bake for foundation1 --minimal-timestamp

2. When you bake the second time the magic is set to happen. The client will only return after the block has been processed which includes the entire protocol migration. Don't abort the command. Optionally add time in front to measure runtime.

Note: Timing may be a bit tricky here. Sometimes you have to wait a bit until the second block is accepted. If you see a "timestamp in the future" error, delete the file /data/yes-wallet/blocks before you try again!

time ./tezos-client -d /data/yes-wallet bake for foundation1 --minimal-timestamp

Now take a walk or take another sip of coffee.

Meanwhile your node logs in terminal 1 should show the following output

node.protocol: 011-PtHangz2: flattening the context storage: this operation may take several minutes

and after the process is complete

node.protocol: 011-PtHangz2: context storage flattening completed

(...)

node.store: the protocol table was updated: protocol PtHangz2aRng (level 11) was
node.store:   activated on block BL...
node.store:   (level 1895271)

Congratulations! You have successfully performed a manual Tezos migration test.

Our test results

We performed the above tests on mainnet contexts at blocks 1,884,950 and 1,895,271 with both a full archive node and a rolling node from a fresh snapshot like described here.

We used a 12-core 64G bare metal server with 2x NVMe drives in Raid-1. Our Tezos node ran inside a Docker container with Alpine Linux 3.14 without CPU or mem limits and data volumes mounted via LVM. The process was mostly I/O bound for archive (80% iotop usage) and CPU bound for full node (single-core only).

Our results are

Metric Archive Node Rolling Node
Runtime 1h 40m 27s 11m 25s
Peak storage 310G 12G
Peak memory 26G 20G

11 minutes is not bad. As Marigold said in their post, we suggest bakers run a rolling node, preferably built from a snapshot at the day of the Hangzhou protocol upgrade. That way we can keep the network downtime to a minimum.

Happy baking!