[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

ZZiigguurraatt · 2025-04-17T16:50:51Z

When launching LND, it seems as though when ChannelRouter starts up it runs resumePayments. This seems to run through all historical payments to see if they have any in flight HTLC. With millions of payments this can take hours (with postgres at least) even if we have no in flight HTLC. With postgress we have 2 postgres processes loading CPU the entire time, in addition to LND loading a CPU.

There are a couple of issues here.

First, there is poor logging to know what is going on. When this starts, we get a Channel Router starting message, which doesn't really inform us that it is going to go through all historical invoices. There should be another message after that that says Starting to look through XXXXX payments for in flight HTLC that need to be settled or something like that (maybe we change settled to monitored because we may not be able to settle yet and it needs to stay in flight). Then, we should have a progress every 15 to 30 seconds indicating how far it is percent wise getting through all XXXXX payments. When it is complete, we currently should get Authenticated Gossiper starting, but that also tells me nothing meaningful about this payment scan. I think there should be a message before that that says Finished locating all payments with in flight HTLC.

The second issue is that it takes so much time to do this in the first place. I think the reason is that with the KV DB schema, we need to search through all payments to see what is still in flight, rather than using a smart DB query to find them. With postgres, it seems that the payments using a KV schema is much slower than SQLite with KV schema or the old bbolt DB which was only able to use KV schema (I will have to report back on this after doing some more tests confirming if this is actually right).

A third issue to me as a user is during this long search for in flight payments, can I use the rest of the node? What functions are limited while this is going on?

#9147 seems to move payments from KV to SQL schema. I think a later PR might be required to make the DB query actually smarter so all payments don't need to be checked to see if they are in flight are not, but instead let the DB figure out what payments are still in flight.

The text was updated successfully, but these errors were encountered:

ZZiigguurraatt · 2025-04-18T02:38:42Z

With millions of payments this can take hours (with postgres at least) even if we have no in flight HTLC.

So for my database with 5,835,721 payments, it actually took 114 minutes minutes with postgres.

I then re-tested with 5,835,721 payments with SQLite and it took 17 minutes.

yyforyongyu · 2025-04-18T03:17:39Z

Nice report! Re first issue, we need to make the payment lifecycle its own package then add more logs. And yeah we do need to scan all payments, which can be very slow. I think we should fix it with a SQL query. Lastly, only the send payment is blocked during the startup. Tho technically you can use the node for other things, the startup process is linear, so the node operator has to wait until the node is fully started.

ZZiigguurraatt · 2025-04-18T03:19:55Z

Tho technically you can use the node for other things, the startup process is linear, so the node operator has to wait until the node is fully started.

You mean you can use the node for everything but send and receive payments? Like you could create an invoice, but not actually receive payment to it. Or, query invoice status. Or, make on chain transactions?

ZZiigguurraatt · 2025-04-18T03:21:14Z

And yeah we do need to scan all payments, which can be very slow. I think we should fix it with a SQL query.

I'm also wondering why we can't look at the latest commitment transaction and if there are no in flight HTLC, can't we just skip this payments database entirely?

yyforyongyu · 2025-04-18T04:16:50Z

You mean you can use the node for everything but send and receive payments? Like you could create an invoice, but not actually receive payment to it. Or, query invoice status. Or, make on chain transactions?

The short answer is you cannot do anything because the node is still starting up, and it will be blocked here. This happens because we reload inflight payments when the router starts. If we could make the startup of subsystems async, then yeah you can create invoices, receive HTLCs, open/close channels, basically everything else except send payments.

I'm also wondering why we can't look at the latest commitment transaction and if there are no in flight HTLC, can't we just skip this payments database entirely?

You can def do that, treating the commitment state as the single source of truth, but I don't think it helps much. On the commitment side, when seeing an inflight HTLC, it could be the case that we are forwarding an HTLC, or we are sending an HTLC. We can determine this by checking whether we have a corresponding incoming HTLC or not. If we don't have an incoming HTLC, then we are the sender, otherwise, we could be just forwarding this HTLC, unless we are doing a circular payment, which cannot be decided by looking at the HTLC alone.

Then we need to decide which payment the HTLCs belong to. For an MMP, which is made up of multiple HTLCs, we can use the rHash field to decide. For an AMP, it's very hard to decide as it involves XORing the hashes. But even just for MPPs, we still need the meta info like the payment creation time to decide whether to resume the payment or not during startup, since they should be timed out. This means we still need a query from HTLC -> PaymentInfo.

So I think the complexity comes from the mapping of HTLCs to payments. The lookup size should be much smaller if we have lots of payments. Tho I would prefer either indexing the status field so we can SELECT * FROM payments WHERE status = 'inflight'; or simply put inflight payments in a different table.

ZZiigguurraatt · 2025-04-22T20:02:58Z

This might be a more useful view of the size of my DB?

yyforyongyu · 2025-04-23T05:33:20Z

hmmm i don't see payment db there.

ZZiigguurraatt added the enhancement Improvements to existing features / behaviour label Apr 17, 2025

ZZiigguurraatt changed the title ~~[feature]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on~~ [bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on Apr 17, 2025

This was referenced Apr 18, 2025

[bug]: listinvoices hangs with a large sqlite DB #9716

Open

[bug]: listpayments hangs node and eats RAM when lots of payments in the DB #9709

Open

ZZiigguurraatt mentioned this issue Apr 18, 2025

[bug]: unknown postgres query hangs on startup #9730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

ZZiigguurraatt commented Apr 17, 2025 •

edited

Loading

ZZiigguurraatt commented Apr 18, 2025

yyforyongyu commented Apr 18, 2025

ZZiigguurraatt commented Apr 18, 2025

ZZiigguurraatt commented Apr 18, 2025

yyforyongyu commented Apr 18, 2025

ZZiigguurraatt commented Apr 22, 2025

yyforyongyu commented Apr 23, 2025

[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

Comments

ZZiigguurraatt commented Apr 17, 2025 • edited Loading

ZZiigguurraatt commented Apr 18, 2025

yyforyongyu commented Apr 18, 2025

ZZiigguurraatt commented Apr 18, 2025

ZZiigguurraatt commented Apr 18, 2025

yyforyongyu commented Apr 18, 2025

ZZiigguurraatt commented Apr 22, 2025

yyforyongyu commented Apr 23, 2025

ZZiigguurraatt commented Apr 17, 2025 •

edited

Loading