-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[bug]: very slow to locate in flight HTLC on startup and poor logging messages informing what is going on #9729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So for my database with 5,835,721 payments, it actually took 114 minutes minutes with postgres. I then re-tested with 5,835,721 payments with SQLite and it took 17 minutes. |
Nice report! Re first issue, we need to make the payment lifecycle its own package then add more logs. And yeah we do need to scan all payments, which can be very slow. I think we should fix it with a SQL query. Lastly, only the send payment is blocked during the startup. Tho technically you can use the node for other things, the startup process is linear, so the node operator has to wait until the node is fully started. |
You mean you can use the node for everything but send and receive payments? Like you could create an invoice, but not actually receive payment to it. Or, query invoice status. Or, make on chain transactions? |
I'm also wondering why we can't look at the latest commitment transaction and if there are no in flight HTLC, can't we just skip this payments database entirely? |
The short answer is you cannot do anything because the node is still starting up, and it will be blocked here. This happens because we reload inflight payments when the router starts. If we could make the startup of subsystems async, then yeah you can create invoices, receive HTLCs, open/close channels, basically everything else except send payments.
You can def do that, treating the commitment state as the single source of truth, but I don't think it helps much. On the commitment side, when seeing an inflight HTLC, it could be the case that we are forwarding an HTLC, or we are sending an HTLC. We can determine this by checking whether we have a corresponding incoming HTLC or not. If we don't have an incoming HTLC, then we are the sender, otherwise, we could be just forwarding this HTLC, unless we are doing a circular payment, which cannot be decided by looking at the HTLC alone. Then we need to decide which payment the HTLCs belong to. For an MMP, which is made up of multiple HTLCs, we can use the So I think the complexity comes from the mapping of HTLCs to payments. The lookup size should be much smaller if we have lots of payments. Tho I would prefer either indexing the status field so we can |
hmmm i don't see payment db there. |
When launching LND, it seems as though when
ChannelRouter
starts up it runsresumePayments
. This seems to run through all historical payments to see if they have any in flight HTLC. With millions of payments this can take hours (with postgres at least) even if we have no in flight HTLC. With postgress we have 2 postgres processes loading CPU the entire time, in addition to LND loading a CPU.There are a couple of issues here.
First, there is poor logging to know what is going on. When this starts, we get a
Channel Router starting
message, which doesn't really inform us that it is going to go through all historical invoices. There should be another message after that that saysStarting to look through XXXXX payments for in flight HTLC that need to be settled
or something like that (maybe we changesettled
tomonitored
because we may not be able to settle yet and it needs to stay in flight). Then, we should have a progress every 15 to 30 seconds indicating how far it is percent wise getting through all XXXXX payments. When it is complete, we currently should getAuthenticated Gossiper starting
, but that also tells me nothing meaningful about this payment scan. I think there should be a message before that that saysFinished locating all payments with in flight HTLC
.The second issue is that it takes so much time to do this in the first place. I think the reason is that with the KV DB schema, we need to search through all payments to see what is still in flight, rather than using a smart DB query to find them. With postgres, it seems that the payments using a KV schema is much slower than SQLite with KV schema or the old bbolt DB which was only able to use KV schema (I will have to report back on this after doing some more tests confirming if this is actually right).
A third issue to me as a user is during this long search for in flight payments, can I use the rest of the node? What functions are limited while this is going on?
#9147 seems to move payments from KV to SQL schema. I think a later PR might be required to make the DB query actually smarter so all payments don't need to be checked to see if they are in flight are not, but instead let the DB figure out what payments are still in flight.
The text was updated successfully, but these errors were encountered: