-
-
Notifications
You must be signed in to change notification settings - Fork 946
Hung Dispose() caused by infinite socket timeout #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue reported by @aniknaemmm in #688 looks like it might have the same root cause. |
Beautiful description. The "secondary thread" in your description is the message listener which runs for the duration of the session, so while infinite timeout could be a smell, there is at least some reason to it. I don't know whether SSH.NET/src/Renci.SshNet/Session.cs Lines 1783 to 1798 in 326ce14
I wonder if it could only attempt to enter the lock: if it can't, then the message listener is still running and we are still connected. And if it can, then the message listener might be stopped and we can perform the poll to get a clearer answer. Something like this: diff --git a/src/Renci.SshNet/Session.cs b/src/Renci.SshNet/Session.cs
index 5bf6d8ee..d68257f5 100644
--- a/src/Renci.SshNet/Session.cs
+++ b/src/Renci.SshNet/Session.cs
@@ -1789,11 +1789,20 @@ namespace Renci.SshNet
return false;
}
- lock (_socketReadLock)
+ if (!Monitor.TryEnter(_socketReadLock))
+ {
+ return true;
+ }
+
+ try
{
var connectionClosedOrDataAvailable = _socket.Poll(0, SelectMode.SelectRead);
return !(connectionClosedOrDataAvailable && _socket.Available == 0);
}
+ finally
+ {
+ Monitor.Exit(_socketReadLock);
+ }
}
}
|
Hi @Rob-Hague. Thanks for the response. I like your suggestion. I tested this solution with my bug demo program and it seems to work. I believe this will solve the problem I am having in my production environment as well. I have created a pull request. See #1280. |
Fixed by #1280 |
The 2023.0.1 version has been released to Nuget: https://www.nuget.org/packages/SSH.NET/2023.0.1 |
Problem
From time to time, my application hangs indefinitely. My application is a headless system that runs on a schedule. When the system hangs, it requires a human operator to notice that the system is hung and then kill the process manually. I have asked the operator to capture process dumps before killing the process. I have analyzed several of these dumps with WinDbg. They each indicate that the system hangs during an SFTP operation.
Based on the information I have gathered, it looks like the problem is a combination of two separate threads that are hanging indefinitely. One thread is waiting for a lock. The other thread is waiting for a socket read operation to complete.
Evidence from WinDbg that a thread is waiting for a lock:
Evidence from WinDbg that another thread is waiting for a socket read operation to complete:
My application code looks something like this:
Note
There is an implicit call to
client.Dispose()
at the end of theusing
block in the code above.Reproduction
My theory is that the following sequence of events occurs:
SshOperationTimeoutException
exception.Dispose()
method of theSftpClient
class is called implicitly at the end of theusing
block.Dispose()
method attempts to acquire the lock in the main thread.The evidence I have gathered so far supports this theory. I have not been able to reproduce this problem the way it happens in the wild. However, I have been able to simulate this scenario with some small code changes. I have created a
bugdemo-01
branch in my fork that demonstrates this scenario. You can find the code here:jkillingsworth@d8e46c8
If you want a local copy, you can clone it like this:
There is a
RunBugDemo
project I added to the solution. You can build and run the code to reproduce this scenario.When you run it, the process hangs indefinitely. You can use task manager to capture a dump of the hung process.
You can open the dump file with WinDbg for analysis. If you enter the
!threads
command, you can see all the threads in the process.Keep in mind the thread identifiers might be different for each unique dump. Also notice the lock counts in the screenshot above.
Enter
~0s
and then!clrstack
to see the stack trace for the main thread.Observe
System.Threading.Monitor.Enter
where it is waiting for the lock.Enter
~4s
and then!clrstack
to see the stack trace for the secondary thread.Observe
System.Net.Sockets.Socket.Receive
where it is waiting to read data from the socket.Debugging
I have included three hardcoded breakpoints in my demo code that simulates this bug. These breakpoints are placed at the key points necessary to understand this issue. You can run the
RunBugDemo
project with a debugger attached to hit the breakpoints. If using Visual Studio, you might find the thread debugging tools helpful.You can open these tool windows when the debugger is running.
Breakpoint #1
Line 1204. While performing the list directory operation, the secondary thread acquires a lock on the
_socketReadLock
object.Line 241. The main thread waits while the secondary thread continues.
Breakpoint #2
Line 256. The main thread throws an exception after the timeout interval has elapsed.
Line 309. The secondary thread is trying to read from a socket when the timeout exception is thrown in the main thread.
Breakpoint #3
Line 1798. The main thread tries to acquire a lock on the
_socketReadLock
object. It blocks here because the secondary thread still has the lock. Notice in the Parallel Stacks pane that this is deep inside theDispose()
call.Line 309. The secondary thread is still trying to read from a socket. The socket read operation never completes, and so the secondary thread never releases the lock.
Solution
I don't fully understand the threading model for this library, so I am not in a position to propose the best possible solution. However, I think it would be wise to use finite timeouts when performing socket operations. Right now, the socket timeouts are hardcoded to use an infinite timeout value with no way for the user to override. A hardcoded timeout of, say, 30 seconds would be a more reasonable value in my estimation.
The following screenshot shows the expected behavior of my demo program.
Instead of hanging indefinitely, I would expect it to throw an exception if there is a problem reading from the socket. My application can handle exceptions such as these without manual intervention. It is the infinite hangs that are causing problems.
The text was updated successfully, but these errors were encountered: