Geth crashes on Holesky when trying to add invalid block after syncing good branch #31320

etan-status · 2025-03-05T11:55:36Z

System information

Geth version:

% geth version
Geth
Version: 1.15.4-stable
Architecture: arm64
Go Version: go1.24.0
Operating System: darwin
GOPATH=
GOROOT=

geth --holesky --log.vmodule='rpc=5'

CL client & version: status-im/nimbus-eth2 v25.3.0 ethpandaops/nimbus-eth2:splitview

nimbus_beacon_node \
    --data-dir="$HOME/Downloads/nimbus/data/holesky" \
    --network=holesky \
    --rest \
    --tcp-port=9010 \
    --udp-port=9010 \
    --history=prune \
    --el=http://127.0.0.1:8551 \
    --jwt-secret=/Users/etan/Library/Ethereum/holesky/geth/jwtsecret \
    --max-peers=80

OS & Version:

% sw_vers
ProductName:		macOS
ProductVersion:		15.3.1
BuildVersion:		24D70

Commit hash : installed via Home brew

Expected behaviour

Geth should just say that block is invalid without crashing

Actual behaviour

Geth sometime crashes when trying to add the invalid block on Holesky

Steps to reproduce the behaviour

Geth is started with geth --holesky --log.vmodule='rpc=5'

No idea how to reproduce, but logs attached. Geth was on the canonical branch, but then a peer provided the invalid branch to Nimbus, and Nimbus tried to validate it against Geth. Note that Nimbus does not randomly blacklist hardcoded blocks.

Terminal Saved Output.txt.zip

Backtrace

DEBUG[03-05|12:48:46.998] Served engine_newPayloadV4               conn=127.0.0.1:57953 reqid=8656 duration=1.19475ms
DEBUG[03-05|12:48:47.165] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:57954 reqid=8657 duration="114.167µs"
WARN [03-05|12:48:48.096] Forkchoice requested unknown head        hash=c13070..20e358
DEBUG[03-05|12:48:48.096] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:57958 reqid=8661 duration="96.167µs"
WARN [03-05|12:48:48.520] Ignoring payload with missing parent     number=3,419,737 hash=d8aaab..9b2e03 parent=c13070..20e358 reason="forced head needed for startup"
DEBUG[03-05|12:48:48.520] Served engine_newPayloadV4               conn=127.0.0.1:57961 reqid=8662 duration=1.540709ms
INFO [03-05|12:48:48.649] Forkchoice requested sync to new head    number=3,419,737 hash=d8aaab..9b2e03 finalized=unknown
DEBUG[03-05|12:48:48.649] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:57962 reqid=8663 duration="153.917µs"
DEBUG[03-05|12:48:49.200] Served engine_newPayloadV4               conn=127.0.0.1:57972 reqid=8664 duration=2.140625ms
INFO [03-05|12:48:49.284] Forkchoice requested sync to new head    number=3,419,738 hash=fdd3df..823188 finalized=unknown
DEBUG[03-05|12:48:49.284] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:57973 reqid=8665 duration="139.459µs"
INFO [03-05|12:48:49.556] Syncing beacon headers                   downloaded=13 left=0 eta=0s
ERROR[03-05|12:48:49.556] Latest filled block is not available
INFO [03-05|12:48:49.556] Block synchronisation started
WARN [03-05|12:48:49.558] Retrieved pivot header from local        number=3,419,674 hash=b06f15..d10a0c latest=3,419,738 oldest=3,419,724
WARN [03-05|12:48:49.558] Beacon head lower than local chain       beacon=3,419,738 local=3,421,281
INFO [03-05|12:48:49.975] Reenabled snap sync as chain is stateless
ERROR[03-05|12:48:49.975] Downloader sync mode changed mid-run     old=full new=snap
DEBUG[03-05|12:48:58.056] Served eth_getBlockByNumber              conn=127.0.0.1:58013 reqid=8668 duration="49.583µs"
INFO [03-05|12:48:59.083] Forkchoice requested sync to new head    number=3,419,739 hash=a517ab..dbe59e finalized=unknown
INFO [03-05|12:48:59.083] Reenabled snap sync as chain is stateless
DEBUG[03-05|12:48:59.083] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:58017 reqid=8669 duration=1.281708ms
WARN [03-05|12:48:59.603] Forkchoice requested unknown head        hash=909a3d..5c716f
DEBUG[03-05|12:48:59.603] Served engine_forkchoiceUpdatedV3        conn=127.0.0.1:58020 reqid=8671 duration="84.916µs"
WARN [03-05|12:49:03.547] Truncating freezer table                 database=/Users/etan/Library/Ethereum/holesky/geth/chaindata/ancient/state table=history.meta items=1448 limit=18
WARN [03-05|12:49:03.564] Truncating freezer table                 database=/Users/etan/Library/Ethereum/holesky/geth/chaindata/ancient/state table=account.index items=1448 limit=18
WARN [03-05|12:49:03.581] Truncating freezer table                 database=/Users/etan/Library/Ethereum/holesky/geth/chaindata/ancient/state table=storage.index items=1448 limit=18
WARN [03-05|12:49:03.597] Truncating freezer table                 database=/Users/etan/Library/Ethereum/holesky/geth/chaindata/ancient/state table=account.data items=1448 limit=18
WARN [03-05|12:49:03.613] Truncating freezer table                 database=/Users/etan/Library/Ethereum/holesky/geth/chaindata/ancient/state table=storage.data items=1448 limit=18
INFO [03-05|12:49:03.628] Importing sidechain segment              start=3,419,724 end=3,419,724
INFO [03-05|12:49:03.680] Skip duplicated bad block                number=3,419,724 hash=40a656..a9f642
ERROR[03-05|12:49:03.680] 
########## BAD BLOCK #########
Block: 3419724 (0x40a656c88b9ceb7d6251adc8819228a98ae26511faa246cb88004ca402a9f642)
Error: invalid requests hash (remote: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 local: 12e7307cb8a29c779310bea59482500fb917e433f6849de7394f9e2f5c34bf31)
Platform: geth (devel) go1.24.0 arm64 darwin
Chain config: &params.ChainConfig{ChainID:17000, HomesteadBlock:0, DAOForkBlock:<nil>, DAOForkSupport:true, EIP150Block:0, EIP155Block:0, EIP158Block:0, ByzantiumBlock:0, ConstantinopleBlock:0, PetersburgBlock:0, IstanbulBlock:0, MuirGlacierBlock:<nil>, BerlinBlock:0, LondonBlock:0, ArrowGlacierBlock:<nil>, GrayGlacierBlock:<nil>, MergeNetsplitBlock:<nil>, ShanghaiTime:(*uint64)(0x140000111c0), CancunTime:(*uint64)(0x140000111c8), PragueTime:(*uint64)(0x140000111d0), OsakaTime:(*uint64)(nil), VerkleTime:(*uint64)(nil), TerminalTotalDifficulty:0, DepositContractAddress:0x4242424242424242424242424242424242424242, EnableVerkleAtGenesis:false, Ethash:(*params.EthashConfig)(0x1037116a0), Clique:(*params.CliqueConfig)(nil), BlobScheduleConfig:(*params.BlobScheduleConfig)(0x1036de4a0)}
.....
ERROR[03-05|12:49:03.680] Beacon backfilling failed                err="retrieved hash chain is invalid: invalid requests hash (remote: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 local: 12e7307cb8a29c779310bea59482500fb917e433f6849de7394f9e2f5c34bf31)"
INFO [03-05|12:49:03.680] Disabled trie database due to state sync
DEBUG[03-05|12:49:03.680] Served engine_newPayloadV4               conn=127.0.0.1:57974 reqid=8666 duration=13.705625208s
WARN [03-05|12:49:03.681] Ignoring already known beacon payload    number=3,420,731 hash=dc7737..5e04e8 age=6d20h6m
DEBUG[03-05|12:49:03.681] Served engine_newPayloadV4               conn=127.0.0.1:58012 reqid=8667 duration=5.624187958s
INFO [03-05|12:49:03.682] Reenabled snap sync as chain is stateless
INFO [03-05|12:49:03.682] Reenabled snap sync as chain is stateless
WARN [03-05|12:49:03.682] Ignoring payload while snap syncing      number=3,419,741 hash=a12bfa..9c0477 reason="chain gapped, head: 3419739, newHead: 3419741"
DEBUG[03-05|12:49:03.682] Served engine_newPayloadV4               conn=127.0.0.1:58025 reqid=8672 duration=3.618417042s
WARN [03-05|12:49:03.682] Retrieved pivot header from local        number=3,419,675 hash=7593f7..7807ed latest=3,419,739 oldest=3,419,724
WARN [03-05|12:49:03.683] Ignoring already known beacon payload    number=3,420,732 hash=615e7f..21f59b age=6d20h6m
DEBUG[03-05|12:49:03.683] Served engine_newPayloadV4               conn=127.0.0.1:58049 reqid=8673 duration="91.083µs"
WARN [03-05|12:49:03.683] Beacon head lower than local chain       beacon=3,419,739 local=3,421,281
WARN [03-05|12:49:03.683] Rewinding blockchain to block            target=3,419,674
CRIT [03-05|12:49:03.699] Failed to rollback state                 err="waiting for sync"

The text was updated successfully, but these errors were encountered:

lightclient · 2025-03-05T13:26:23Z

Can you define what you mean by "geth crashes"? Or is the issue that geth is stuck not able to reorg?

etan-status · 2025-03-05T13:29:00Z

It exited with exit code 1, see logs: https://github.com/user-attachments/files/19088525/Terminal.Saved.Output.txt.zip

lightclient · 2025-03-05T13:31:12Z

Oh I see now, the failed to rollback state is a CRIT. Let me look into it.

MariusVanDerWijden · 2025-03-13T11:04:49Z

I think this is a rollback before the snapshot point, thats why we crash. In this case we can't rollback, so crashing might be the correct behavior in this case? But in longer non-finality, the CL MUST NOT reorg back to a chain before our snapshot, so I'm wondering how to best handle this

etan-status · 2025-03-13T11:54:06Z

Well, with crashes / exits, the CL has no idea to identify what's going on. And the error message that does not indicate what the user should do is not helping. Is it: delete datadir and restart Geth with --syncmode=full? If that's the solution, could Geth do that more efficiently in an automated way?

In any case, the current snapsync behaviour means that Geth treats the initially synced state as a finalized state. Which is wrong in any situation where that initial state later gets orphaned, as any of its parent states cannot be restored, including non-finalized ones.

Would it make more sense to change the initial sync target to the finalizedBlockHash (if non-zero) instead of the headBlockHash? Then the case where the CL attempts to reorg back to before the snapshot is only possible in a slashable environment (where user interaction is required anyway). And if the finalizedBlockHash is outside the latest ~20 minutes that can be snapsynced, maybe could automatically switch to fullsync, as that's the common situation that is being hit during an incident response (resyncing clients from scratch). May also have to extend the ~90k limit for statediffs to max(90k, head block number - finalized block number) so that reorgs within the non-finality period are always supported.

It is tricky, for sure.

MariusVanDerWijden · 2025-03-13T11:59:43Z

The problem is that we need nodes to have the state at the finalizedBlockHash available for syncing, otherwise the snap sync will never finish. I agree that the message should be different and maybe we should just return syncing instead of crashing. We can increase the 90k limit soon, once the full pbss stuff is merged

etan-status · 2025-03-13T12:08:22Z

Raising the 90k limit doesn't help in this situation / is not being hit here. That limit only applies if the initially synced state was ≤ finalized. Here, in principle a random non-canonical branch got initially synced (depth doesn't matter, could be just a single orphaned block), and then the canonical branch (not descending from the non-canonical branch) was discovered later and switched to after the initial sync to the orphaned block/state completed.

Perpetual SYNCING is also tricky to resolve. The CL eventually needs an INVALID signal to switch away from the bad branch (ideally on the newPayload if it's already known to be invalid at that point, like nethermind does it). With SYNCING, CL will just keep requesting the same head as dictated by fork choice.

etan-status added the type:bug label Mar 5, 2025

fjl assigned lightclient and rjl493456442 Mar 11, 2025

gballet added the status:triage label Apr 3, 2025

fjl assigned MariusVanDerWijden and unassigned lightclient Apr 10, 2025

fjl removed the status:triage label Apr 10, 2025

MariusVanDerWijden linked a pull request Apr 10, 2025 that will close this issue

core: don't terminate geth on rewind during sync #31603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geth crashes on Holesky when trying to add invalid block after syncing good branch #31320

Geth crashes on Holesky when trying to add invalid block after syncing good branch #31320

etan-status commented Mar 5, 2025 •

edited

Loading

lightclient commented Mar 5, 2025

etan-status commented Mar 5, 2025

lightclient commented Mar 5, 2025

MariusVanDerWijden commented Mar 13, 2025

etan-status commented Mar 13, 2025

MariusVanDerWijden commented Mar 13, 2025

etan-status commented Mar 13, 2025

Geth crashes on Holesky when trying to add invalid block after syncing good branch #31320

Geth crashes on Holesky when trying to add invalid block after syncing good branch #31320

Comments

etan-status commented Mar 5, 2025 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Backtrace

lightclient commented Mar 5, 2025

etan-status commented Mar 5, 2025

lightclient commented Mar 5, 2025

MariusVanDerWijden commented Mar 13, 2025

etan-status commented Mar 13, 2025

MariusVanDerWijden commented Mar 13, 2025

etan-status commented Mar 13, 2025

etan-status commented Mar 5, 2025 •

edited

Loading