With the release of Erigon v3.3, we are announcing the addition of efficient indexing for historical MPT proofs. Now users will be able to resync with –prune.include-commitment-history to store all historical proofs. This article is divided into two parts:
- Part 1: Why historical proofs matter and why solving this problem is worthwhile (use cases, comparisons with other clients, and practical implications).
- Part 2: A technical explanation of how the system works and the design decisions that made it possible.
First, we look at why historical proofs are important for both the Ethereum roadmap and the broader application layer.
Below is a tightened, corrected, and completed Part 1, presented with clear structure and a straightforward technical style. Subsequently, we will move to Part 2 where there will be a technical on how it works.
1. Why Historical Proofs Matter
Historical proofs allow any party to verify past Ethereum state without trusting a full node or replaying the entire chain. A historical proof provides the minimal data needed to reconstruct and validate the state value at a specific block—either account state or contract storage supported by the corresponding MPT branch.
This capability matters for several reasons:
- Ethereum’s long-term direction prioritizes ZKVms. In a ZK world, block verifiers rely on compact state witnesses instead of carrying the full state. To reach that future, clients must be able to store, index, and serve historical proofs efficiently. Without efficient proof handling, the generation of proofs and syncing up nodes gets very hard and tedious.
- Trustless Access to Historical State for applications. Without efficient indexed proofs, users either rely on centralized RPCs or run heavy infrastructure. Providing direct, verifiable access without replaying blocks—reduces operational cost and increases trust guarantees. the applications that will benefit will be Layer 2 systems reconstructing L1 provenance and for creating challenges (see Nitro) and Zero-knowledge systems needing inclusion proofs for older data
Performance Compared to Other Clients
Before v3.3, obtaining historical proofs required one of the following:
- Running a full archive Geth/Nethermind nodes (Reth only supports last 10k blocks) which are >20TB in size.
- Querying a centralized RPC node that performs expensive on-demand proof reconstruction.
These approaches are either heavy in resources or lead to centralization. With v3.3, Erigon can serve proofs directly from indexed structures, delivering:
- Consistent low-latency retrieval
- Predictable resource use
Comparison against other clients
| Client | Stores Historical Proofs | Trie nodes Size (Approx.) | p50 Latency (s) | p75 Latency (s) | p90 Latency (s) | p99 Latency (s) |
|---|---|---|---|---|---|---|
| Reth | No | N/A | N/A | N/A | N/A | N/A |
| Geth | Yes (full archive) | >20 TB | 0.015 | 0.026 | 0.035 | 0.060 |
| Nethermind | Yes (full archive) | >20 TB | 0.015 | 0.022 | 0.030 | 0.058 |
| Erigon v3.3 | Yes (indexed proofs) | 4.1 TB | 0.003 | 0.006 | 0.007 | 0.012 |
NOTE: Geth is working on a solution to this problem, so the figure of >20TB is not their long-term goal, as they are developing a solution that will produce comparable results to those of Erigon.
Erigon is consistently 5x times faster at serving proofs than both Geth and and Nethermind.
2. Gentle technical explainer on how the proof indexing work (tech people only)
Before we talk about the actual indexing, we need to talk about Haystack.
Haystack is Meta’s cold-storage system for photos. The core idea is to separate large, immutable binary objects (the photos) from the small metadata required to locate them. This enables extremely cost-effective sequential disk access for bulk data, while ensuring the indexing layer remains compact and fast. We could try separating these two types of data into two different kinds of file.
Segment Files (Data Files)
These contain the raw objects (photos) almost entirely sequentially according to some heuristic.
Key properties:
- Large, append-only, immutable.
- Optimized for throughput, not random seeks.
- No metadata inside besides minimal framing.
- You do not search inside them; you jump to an offset and read.
- Heuristic for the order is up to the implemented (the design is flexible).
Segment files are the “haystack.”
Index Files
Index files store the data structure needed to locate an object inside the seg files:
- Object key → (seg file ID, offset, length)
- Small, random-access friendly, designed for memory mapping or SSD.
Index files tell you where to read, seg files contain what to read.
This separation is the key architectural trick which we are going to build upon to map it on the historical merkle proofs problem.
Mapping This to Erigon’s Proof Indexing
Erigon adopts the same principle: Large, immutable data files for trie nodes + small, query-optimized index files. This is important because you can cleanly separate files with very similar looking data into separate contiguous memory without mixing them. this can allow for clever storage optimization tricks such as doing any kind of compression we want and manipulate the heuristic as to what order each individual trie node is stored into the file (a sorting heuristic).
Index Files
The index in and out of itself is not important, it maps (also note that this is a very high level description. in reality there is a lot of compression at this layer through elias-fano encoding a roaring bitmaps so please take a leap of faith consider the storage overhead minimal for indexing):
Meaning:
- The index answers which segment contains the specific proof branch given a block number.
- Actual node retrieval is a single quick offset jump into the corresponding side-node data file.
- No reconstruction, no re-hashing, no walking the historical trie, just a one-way lookup on a perfect hash-table which is O(1) ignoring hardware overhead.
This is important for the latency but not important for the size. the way we optimize the storage requirements come from.
The Data File (The Storage Optimization)
The data files store all historical trie nodes. Each file contains many nodes, but they are not written in the order the chain encounters them. Instead, Erigon reorganizes them to maximize compression and locality.
The chain is split into fixed-size chunks of transactions:
For each chunk (C_i), Erigon collects all trie nodes touched during execution:
These nodes are the input set for the layout and compression pipeline.
Similarity Metric: Hamming Distance on Compressed Nodes
Before sorting, each node is serialized and transformed into a compact encoding:
Similarity is measured on this compressed representation, not on the raw node bytes. For two nodes (n_a) and (n_b):
Erigon then orders the nodes in (N_i) so that consecutive nodes are as similar as possible:
A neat property here is that, in practice, similarity behaves almost transitively on these compressed encodings. If
then typically
d_H(c(n_j), c(n_{j+2}))
is also small. Once similar nodes start clustering, that similarity naturally propagates along the sorted sequence, which is exactly what we want for compression.
Importantly, this means:
- Each data file contains many trie nodes.
- Nodes are not stored in chain order or “first-seen” order.
- They are stored in an order that minimizes Hamming distance between neighbors in their compressed form.
Batch Compression
After sorting, the sequence is split into fixed-size batches of 64 nodes:
Compression is applied after batching:
where is the chosen compression function. Because each batch contains nodes that are all close to each other under on their compressed encodings, the entropy within a batch is low, and the compressor achieves very high ratios. Indexing is also done on compressed functions. our compression function of choice is ZSTD.
Resulting Layout
For each 1.5M-tx chunk:
- Every trie node seen in that chunk is stored.
- Nodes are reordered by similarity on their compressed representation.
- The reordered list is cut into groups of 64 and compressed batch-by-batch.
The index layer points into these data files by offset, but the files themselves are optimized purely for density and locality. This layout is what allows Erigon to store full historical proofs in about 4.1 TB for trie nodes, instead of the 20+ TB required by archive-style layouts that keep nodes in chronological or naive trie order.
Analysis of the entropy of historical trie nodes.
This section is for most nerdy people that want to understand how this approach came to be. the way I figured it out is by looking at the optimal entropy.
What is Entropy?
Entropy measures the average uncertainty or randomness in a dataset. Coined by Claude Shannon in 1948, it quantifies how much information is contained in each bit of data. The higher the entropy, the more “surprised” we are by each new piece of data—and the harder it is to compress.
The Mathematical Definition given by ChatGPT
Shannon entropy H(X) for a discrete random variable X with possible values {x₁, x₂, …, xₙ} is:
Where:
- p(xᵢ) is the probability of outcome xᵢ
- log₂ gives us the result in bits
- The negative sign ensures entropy is positive
-
Fair coin (50% heads, 50% tails):
-
H = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)] = 1 bit
-
Maximum entropy for a binary outcome
-
-
Biased coin (90% heads, 10% tails):
- H = -[0.9 × log₂(0.9) + 0.1 × log₂(0.1)] ≈ 0.47 bits
- Lower entropy because outcomes are more predictable
-
Two-headed coin (100% heads):
- H = -[1.0 × log₂(1.0)] = 0 bits
- Zero entropy—no uncertainty!
Examples:
- Random data: ~8 bits per byte (maximum)
- English text: ~4.5 bits per byte
- Repetitive logs: ~2-3 bits per byte
- Cryptographic hashes: ~7.9 bits per byte
Compression Ratio as a way to measure entropy
The most practical way to measure entropy is to compress the data and see how small it gets:
\text{Compression Ratio} = \frac{\text{Compressed Size}}{\text{Original Size}}
\text{Estimated Entropy} ≈ 8 × \text{Compression Ratio} \text{ bits/byte}
This works because optimal compression approaches the theoretical entropy limit. Real compressors like ZSTD, gzip, or bzip2 give us a practical approximation of the true entropy.
Examples:
| Data Type | Compression Rate | Estimated Entropy |
|---|---|---|
| Random bytes | ~100% (no compression) | ~8 bits/byte |
| English text | ~30% | ~2.4 bits/byte |
| JSON logs | ~15% | ~1.2 bits/byte |
| Repeated pattern | ~5% | ~0.4 bits/byte |
Why This Works
Compression algorithms exploit redundancy, and thus the compression ratio directly reflects how much redundancy (inverse of entropy) exists in your data. If ZSTD can compress your file to 10% of its original size, that means 90% of the original was redundant—the true information content was only 10%. Additionally, Entropy isn’t just about the data it’s about how it is represented. The same information can have different entropy depending on how it’s organized.
The trivial way to measure the entropy in our case is to take a 100% unoptimized historical trie layout and see how well ZSTD compresses it with different ordering.

Erigon v3.3: Historical Trie Node Compression Analysis