Progress on Erigon-CL and thoughts on Merkle Trie Root computation for Beacon State.

Progress on Erigon-CL and thoughts on Merkle Trie Root computation for Beacon State.

More good dev things by Giulio :)

Share

Hello Everybody, This is the second post of my technical blog, and today I am going to discuss the progress I made in the last 4 days on Erigon-CL, Interesting things I found out and also some brainstorming on implementing effective Merkle Root computation of the Beacon state within Erigon :). Hope you enjoy, Let us dive into it 👇👇👇

Progress on Erigon-CL.

There has been two main points of progress in the Erigon Consensus Layer implementation.

Handshake procedure

The first one as I cited in the previous post was the handshake procedure to accept new peers. Basically, I implemented the filter that filters out bad peers from good peers. For example, filtering out peers connected to the wrong network, peers that are too distant from chain tip… In other words, I needed a way to filter out the trash and eventually I got it to work! The Beacon block downloader has been greatly improved and can stay connected for an extended period of time, and also it does not fall behind anymore. Additionally, I suspect that this change also improved lightclient networking as well but I do not have confirmation of that yet. There are still some problems that I need to fix in this regard though:

  • Unusually high CPU usage when making request (cl-lightclient not affected).
  • Need to port our status correctly to other peers as it is a bit hacky for now.

But I will make my best to make this work well.

Slot vs Block Number: A grave misconception.

Now this is veryyyy interesting. I want to do a quick digression. So, before starting to developing the Beacon Blocks downloading system, I was under the wrong assumption that slot are the counterparts of block number, just to then realize that IT IS WRONG! Block numbers are not like slots. For each block number there HAS to be assigned a block but the same thing cannot be said for slots. as a matter of fact, Slots are just opportunity windows and are NOT numerical identifier or unit of length. Instead they represent time. And I found this out in the most painful way possible: By trying to import the chain slot by slot. GRAVE MISTAKE. because with this approach you will end up getting stuck each time you find a missed block. and that is because there is no block for that slot and that slot need to be skipped. I lost on this, like, 4 hours of my life. My downloader got stuck all over again because I was expecting slot+1 of what I currently had, and of course when the closest slot was slot+2, it ended up getting stuck. AMAZING. but onto the solution…

Beacon block downloader now relies on Block Roots as sanity checks instead of slots.

So because of the aforementioned misconception, I had to get creative, so instead of just asking a range of blocks by slot to the node peers and then linearly check it against their slot optimistically, I had to reverse check them by comparing block root and parent root. But how did I make it work, well let us dive into it:

  1. Retrieve blocks for a 10 slots range from P2P, starting from slot S: {S, …, S+10}
  2. Retrieve the segment and reverse it: {S+10, …, S}
  3. Save Parent root of block S+10
  4. Compare Parent Root of block S+10 with block root S+9 (supposedly the ancenstor).
  5. Rinse and repeat until you are done with all slots.
  6. Any checks failed? Abort, Otherwise? Import the chain.

And guess what? It worked! Like a charm.

Then go on to the beacon state stage and execute all blocks one by one in forward order (Not implemented yet).

Why is this better? Slot is inconsistent and is more equivalent to timestamp rather than block number in its behavior. Block roots instead are the direct linkage in the chain so there is no weird stuff going on with them. This basically allowed me to be sure not to skip any single block in the chain. Yay!!!! And with this one working, we sealed another nail in the coffin and positioned another piece of the puzzle.

Merkle Trie Root computation for Beacon State

And now onto my next objective. Which is proper root computation for the state. In other words, checking and comparing our root against the block state root to see if we have the correct state or not. and this…. this is going to be a pain. It is true that the beacon state is only 100 MB and can be kept in memory… but hashing it every single time from scratch takes time and on top of that, the hashing library I am using breaks for the current version of the Beacon State. So its gotta be manual. The first thing to do is trying to generate a State Root from a fresh state correctly, of course. This will take a lot of hashing and a lot of researching but I am sure I can make it. The second step is to make it not shit. As a matter of fact, on Erigon, hashing the whole 100 MB worth of beacon state takes half a second. That is a lot! But how can we fix it?

Dirty Leaves system.

So a way to fix it is to just re-hash only the parts of the state that has been changed. Let me show you an example of what I mean:

Lets assume that fields 1 to 8 are our beacon state fields. Now let us assume that we want to hash it. The first time we hash all of it and we do it slowly in 500ms that is bad, but instead of saving only the root we also save H(12), H(34), H(12, 34), H(56), H(78), H(56, 78) and save them somewhere (a slice, list, map, i dont care). Now we modify the state, more specifically, field 1 and 3, we transition and we need to find the root again. However, since we only modify field 1 and 3, only H(12) and H(34) will be affected, so we do not re-hash H(56), H(78) and H(56, 78) but only the one we touched. so we compute H(12), H(34) and then H(12, 34), and the hash H(12, 34) with H(56,78) that we saved before without doing further computation. Thus, saving almost 50% of the hashing. this means that now we will take almost half as much as we would be taking if hashed everything from scratch.

GoHashTree

I also switched to GoHashTree, a library from Prysm to do Merkle root hashing, which made Erigon CL-Lightclient take 30ms to process a new update instead of 40ms. I would assume that the rest is BLS verification. Big shoutout to the Prysm team for their kindness :).

Fin.

original post: https://giulioswamp.substack.com/p/8122022-progress-on-erigon-cl-and