Skip to content

[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZZiigguurraatt opened this issue Apr 17, 2025 · 7 comments
Labels
enhancement Improvements to existing features / behaviour

Comments

@ZZiigguurraatt
Copy link

ZZiigguurraatt commented Apr 17, 2025

When launching LND, it seems as though when ChannelRouter starts up it runs resumePayments. This seems to run through all historical payments to see if they have any in flight HTLC. With millions of payments this can take hours (with postgres at least) even if we have no in flight HTLC. With postgress we have 2 postgres processes loading CPU the entire time, in addition to LND loading a CPU.

There are a couple of issues here.

First, there is poor logging to know what is going on. When this starts, we get a Channel Router starting message, which doesn't really inform us that it is going to go through all historical invoices. There should be another message after that that says Starting to look through XXXXX payments for in flight HTLC that need to be settled or something like that (maybe we change settled to monitored because we may not be able to settle yet and it needs to stay in flight). Then, we should have a progress every 15 to 30 seconds indicating how far it is percent wise getting through all XXXXX payments. When it is complete, we currently should get Authenticated Gossiper starting, but that also tells me nothing meaningful about this payment scan. I think there should be a message before that that says Finished locating all payments with in flight HTLC.

The second issue is that it takes so much time to do this in the first place. I think the reason is that with the KV DB schema, we need to search through all payments to see what is still in flight, rather than using a smart DB query to find them. With postgres, it seems that the payments using a KV schema is much slower than SQLite with KV schema or the old bbolt DB which was only able to use KV schema (I will have to report back on this after doing some more tests confirming if this is actually right).

A third issue to me as a user is during this long search for in flight payments, can I use the rest of the node? What functions are limited while this is going on?

#9147 seems to move payments from KV to SQL schema. I think a later PR might be required to make the DB query actually smarter so all payments don't need to be checked to see if they are in flight are not, but instead let the DB figure out what payments are still in flight.

@ZZiigguurraatt ZZiigguurraatt added the enhancement Improvements to existing features / behaviour label Apr 17, 2025
@ZZiigguurraatt ZZiigguurraatt changed the title [feature]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on [bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on Apr 17, 2025
@ZZiigguurraatt
Copy link
Author

With millions of payments this can take hours (with postgres at least) even if we have no in flight HTLC.

So for my database with 5,835,721 payments, it actually took 114 minutes minutes with postgres.

I then re-tested with 5,835,721 payments with SQLite and it took 17 minutes.

@yyforyongyu
Copy link
Member

Nice report! Re first issue, we need to make the payment lifecycle its own package then add more logs. And yeah we do need to scan all payments, which can be very slow. I think we should fix it with a SQL query. Lastly, only the send payment is blocked during the startup. Tho technically you can use the node for other things, the startup process is linear, so the node operator has to wait until the node is fully started.

@ZZiigguurraatt
Copy link
Author

Tho technically you can use the node for other things, the startup process is linear, so the node operator has to wait until the node is fully started.

You mean you can use the node for everything but send and receive payments? Like you could create an invoice, but not actually receive payment to it. Or, query invoice status. Or, make on chain transactions?

@ZZiigguurraatt
Copy link
Author

And yeah we do need to scan all payments, which can be very slow. I think we should fix it with a SQL query.

I'm also wondering why we can't look at the latest commitment transaction and if there are no in flight HTLC, can't we just skip this payments database entirely?

@yyforyongyu
Copy link
Member

You mean you can use the node for everything but send and receive payments? Like you could create an invoice, but not actually receive payment to it. Or, query invoice status. Or, make on chain transactions?

The short answer is you cannot do anything because the node is still starting up, and it will be blocked here. This happens because we reload inflight payments when the router starts. If we could make the startup of subsystems async, then yeah you can create invoices, receive HTLCs, open/close channels, basically everything else except send payments.

I'm also wondering why we can't look at the latest commitment transaction and if there are no in flight HTLC, can't we just skip this payments database entirely?

You can def do that, treating the commitment state as the single source of truth, but I don't think it helps much. On the commitment side, when seeing an inflight HTLC, it could be the case that we are forwarding an HTLC, or we are sending an HTLC. We can determine this by checking whether we have a corresponding incoming HTLC or not. If we don't have an incoming HTLC, then we are the sender, otherwise, we could be just forwarding this HTLC, unless we are doing a circular payment, which cannot be decided by looking at the HTLC alone.

Then we need to decide which payment the HTLCs belong to. For an MMP, which is made up of multiple HTLCs, we can use the rHash field to decide. For an AMP, it's very hard to decide as it involves XORing the hashes. But even just for MPPs, we still need the meta info like the payment creation time to decide whether to resume the payment or not during startup, since they should be timed out. This means we still need a query from HTLC -> PaymentInfo.

So I think the complexity comes from the mapping of HTLCs to payments. The lookup size should be much smaller if we have lots of payments. Tho I would prefer either indexing the status field so we can SELECT * FROM payments WHERE status = 'inflight'; or simply put inflight payments in a different table.

@ZZiigguurraatt
Copy link
Author

This might be a more useful view of the size of my DB?

Image

@yyforyongyu
Copy link
Member

hmmm i don't see payment db there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to existing features / behaviour
Projects
None yet
Development

No branches or pull requests

2 participants