Beacon epoch transition and optimizations made along the way.

Beacon epoch transition and optimizations made along the way.

Faster, Better, Stronger.


Hello Frens, This is Giulio. I believe it has been a while since my last blog post and I would like to discuss what I have been up to in the past 2 weeks or so. Firstly, I finally finished up the first semi-working prototype of the epoch and slot transitions functions within Erigon-CL. Secondly, I have been working on optimizations in block processing which heavily improved slot processing time. And Lastly, I was able to bring down the State root computational time for each slot. to add some context to this, Slot processing happen at each block whenever we switch slot, while Epoch transition happens every 32 slots after we switch epochs. Now without further discussions, let us jump into what I did.

Slot transition optimizations: naive implementation to “I just cached your mom lol”

Well, when I was able to finally finish the slot transition processing and finally go to run on a checkpoint synced state onto the Mainnet, the first thing I noticed is that it was slow, like REALLY slow. Matter of the fact, for the first attempt I implemented somewhat faithfully the specs which is not a good idea because the Ethereum Consensus Specs prioritize readability over performance (rightfully so). However, I found the translation approach not that much of a sin since it helped me get the starting ground. To give you an idea it was taking 2 seconds per slot, which was arguably really bad. So decided to do some profiling and the results were as follow:

As you can see most of the time is spent in the GetActiveValidatorIndicies, which again was due to the fact that I copy catted the Consensus Specs at the beggining. Additionally it is spending a lot of time on the GetTotalBalance function too which was no good of course. However, the two functions above results usually did not differ when you run the transition within the same Epoch. Thus, I did the most logical move that I could see.

Yes, I abused caching like there is no tommorow. I cached quite literally ~60% of the computation, I cached: Beacon committees, ActiveIndicies, Total balance of active indicies, the Square root of the Total Balance of active indicies, proposerIndex per slot, and even 50% of the hashing made to retrieve the actual proposer index because it was actually quite trivial to do. After all of this stuff, I got a time of ~700 ms per slot (note this is with BLS verification so this time is actually meaningful only at chain tip, it is supposed to be much faster when reconstructing historical states). Below is the profile of the end result.

Yes, basically only BLS verification but worry not because it got better after I processed each signature in parallel which brought the time from 700ms/slot to 400ms/slot. However, this is chain tip performance, and a good amount of it is now beacon state root computation so there is still room of improvement. At the end of the day these are the final result for the slot processing performance:

On another note, It is theorically possible to parallelize 100% of the bulky parts that consitute partial verification given we have an older Beacon state(when we reconstruct historical states) so there is definitely still room for improvement there. On the other hand I was able to snatch another 100ms improvement on the Parallelized BLS version with improved state root computation.

State Root computation optimizations.

Let me give an introduction why this part is somewhat painful in CL implementations: Beacon state is 100 MB big and we need to hash all of it. Without any caching the initial computation in Erigon-CL used to take 500ms, and re-computation used to take 100-120ms after slot processing and 150ms after epoch processing. This was due to 2 things in particular data re-allocation and hashing of big vectors. My solution: reusable buffers and parallelized hashing of big lists. Pretty obvious optimizations here and below is the improvement.

Epoch processing

So epoch processing is still incomplete, it works for some epochs and does not work for others(mismatched state roots) so I will refrain to do optimizations before actually passing all consensus tests and have some degree of confidence that it will somewhat function correctly. However, there was one obvious “optimization” which was not computing the participating indices and just check them along the way. and also not use maps to collect rewards/penalties but just apply them straight of the bat (I used maps because I am dumb, dunno why I did that). Initial epoch processing 700ms-800ms, Current epoch processing:180ms-300ms.

In conclusion, it is going well, next step is passing all consensus tests for what I have so far. so I think I will build a test runner accordingly.

original post: