Skip to content

Add option to set worker healthcheck timeout #2500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mmac-m3a
Copy link

@mmac-m3a mmac-m3a commented Nov 1, 2024

Summary

Adding new command line and config option worker_healthcheck_timeout which sets the timeout for worker liveness from the supervisor when multiple workers are in use. Default timeout is unchanged as well as frequency of health checks.

Rationale

Applications with CPU intensive synchronous startup may starve the worker process for CPU cycles and make the pong thread generate response too late, which in turn makes the supervisor kill and relaunch the worker.

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.

Rationale for no explicit test

I was not able to create simple unit test that would reliably trigger the health check timeout. I can 100% reliably trigger it in my application and I've also verified that longer health check timeout resolves the problem.

@omid-jf
Copy link

omid-jf commented Dec 5, 2024

I was receiving many "Child Process Died" messages in my FastAPI application. Manually increasing the timeout mentioned in this PR fixed my issue.

@racinmat
Copy link

Hi, is anything blocking merging this PR? It will solve #2450 for many cases, because you could pass higher timeout.

@vjeranc
Copy link

vjeranc commented May 20, 2025

This was discussed a while ago, the fix to increase the timeout was known then, so not sure what's the blocker.

IMO, there should be no pinging or ponging. OS is not stupid, either process is suspended or it is not. If process is stuck, either in a deadlock or in CPU heavy operation and would be considered "inactive", that's another issue.

Anyone hosting web apps should have their own health checks and restart the process/container.

Similarly, processes/workers in uvicorn are created with a very expensive fork. It would be better to just invoke uvicorn N times at the web app level and completely remove "workers" concept. The savings on the socket level (all processes reading from the same socket is abysmal), cheaper fork would allow lower memory footprint, a much better saving, which is not the case now.

If a cheaper fork was done instead, it would make some sense to manage process liveness at uvicorn level. With current expensive fork, nothing is gained by the workers=N argument, as we see in this case, it just created more bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants