-
Notifications
You must be signed in to change notification settings - Fork 644
Move update-downloads
to a background job
#1798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move update-downloads
to a background job
#1798
Conversation
r? @jtgeibel (rust_highfive has picked a reviewer for you, use r? to override) |
This replaces the `update-downloads` binary with a background job, and a binary that is used to queue up a given job, which we will run from Heroku scheduler. This accomplishes 2 things: - It makes it easier to write tasks that need to run periodically (e.g. cleaning up stale rate limit buckets), since we don't need to create a new standalone binary. - `update_downloads` and any future recurring tasks will automatically get monitoring if they fail, since we are already monitoring for background jobs not being successfully run. Right now the intent is to have `enqueue-job update_downloads` get run periodically by Heroku scheudler (and a similar scheduled task for any future tasks that are added). Once swirl gains the ability to schedule jobs to be run at arbitrary points in the future, we could instead have these jobs re-queue themselves once they complete, and have the cron task just look to see if any job is queued for each given type, queuing it if not. That would have a bit less boilerplate, but a lot more complexity. Fixes rust-lang#1797.
3d2ef0d
to
6b9dbb5
Compare
For what it's worth, this looks good to me. I tested the code locally by manually enqueueing a job and seeing the background worker successfully picking it up and updating the download counts. The command-line parsing and error reporting via panic is rather minimalistic, but it's definitely good enough for this use case. Incidentally, I worked on a prototype for #630 in the last few days, and did something very similar to allow running the database dumps periodically. I will switch my prototype over to this implementation. |
LGTM! @bors r+ |
📌 Commit 6b9dbb5 has been approved by |
…jtgeibel Move `update-downloads` to a background job This replaces the `update-downloads` binary with a background job, and a binary that is used to queue up a given job, which we will run from Heroku scheduler. This accomplishes 2 things: - It makes it easier to write tasks that need to run periodically (e.g. cleaning up stale rate limit buckets), since we don't need to create a new standalone binary. - `update_downloads` and any future recurring tasks will automatically get monitoring if they fail, since we are already monitoring for background jobs not being successfully run. Right now the intent is to have `enqueue-job update_downloads` get run periodically by Heroku scheudler (and a similar scheduled task for any future tasks that are added). Once swirl gains the ability to schedule jobs to be run at arbitrary points in the future, we could instead have these jobs re-queue themselves once they complete, and have the cron task just look to see if any job is queued for each given type, queuing it if not. That would have a bit less boilerplate, but a lot more complexity. Fixes #1797.
☀️ Test successful - checks-travis |
An incident was caused by rust-lang#1798. There is a description below if you're interested, but this PR does not fix the problem. However, the band-aid to get things running again fix is to increase the timeout for the job runner. When responding to an incident, waiting for a full rebuild to change this is not acceptable. This replaces the hard-coded value with an environment variable so we can quickly change this on the fly in the future. Description of the actual problem that this does not fix -- The problem was that the `update_downloads` job takes longer than the timeout we had set for jobs to begin running. So swirl would start the `update_downloads` job, try to spawn another worker, and then would time out hearing from that worker whether it got a job or not. So we would crash the process, the job would be incomplete, and we'd just start the whole thing over again. There's several real fixes for this, and I will open a PR that is some combination of all of them. Ultimately each of these fixes just increase the number of slow concurrent jobs that can be run before we hit the timeout and the problem re-appears, but that's fundamentally always going to be the case... If we are getting more jobs than we can process, we do need to get paged so we can remedy the situation. Still, any or all of these will be the "real" fix: - Increasing the number of concurrent jobs - Increasing the timeout - Re-building the runner before crashing - The reason this would fix the issue is that by not crashing the process, we give the spawned threads a chance to finish. We do still want to *eventually* crash the process, as there might be something inherent to this process or machine preventing the jobs from running, but starting with a new thread/connection pool a few times gives things a better chance to recover on their own.
A brief incident was caused by rust-lang#1798. A band-aid fix is in place, and rust-lang#1803 (included in this branch) makes it possible to apply similar band-aids in the future without requiring a rebuild of the code. This commit attempts to better address the root problem though. The short version (which is expanded on below, but not required to understand this commit or why it's needed) is that `update_downloads` takes longer than our job timeout to run. When we moved that task to a background job, we did not increase the number of concurrent jobs, nor did we increase the timeout. This meant that swirl timed out trying to start new jobs, and our behavior in that case was to crash the process. This would mean that `update_downloads` never completes, and remains at the front of the queue. This PR addresses all 3 of the problematic cases. - Increasing concurrency - When this system was added, the only jobs we had were index updates. These want to be serial, so we set the thread pool size to 1. We added readme renderings, which probably should have been parallel, but only happen with crate publishes anyway so it was fine. `update_downloads` *always* takes longer than the timeout to run though. We can't have it block everything else while it's running. The main downside to this is that index updates are no longer guaranteed to run in serial, which means that if two crates are uploaded simultaneously one job will fail and will have to wait for a retry to update the index. In theory if a crate happened to be uploaded at the exact instant of the retry 7 or 8 times in a row this could even result in getting paged. This is exceptionally unlikely, and I'm not concerned about it for now. As more features land in swirl we may want to move index updates to their own queue or tweak the retry behavior on that job though. Swirl will eventually handle this for us by default, and we should use its defaults once that lands. - Increasing the default timeout - 10s was a bit too aggressive. Fundamentally there is always a condition where we hit this timeout, and if the reason for hitting it is that we are receiving more jobs than we can process (either because of volume of jobs, or our jobs are too slow). The most common reason we would hit this is that all threads are occupied by a job which takes longer than the timeout to execute. Increasing the concurrency makes this less likely to occur since our jobs are low volume, but we were actually seeing this crash before the addition of `update_downloads` meaning that our other jobs are sometimes taking >10s to run. Increasing the concurrency beyond 2 would make it extremely unlikely we will ever hit this, but since we theoretically can with a burst of crate uploads at any concurrency, I've also upped the timeout. - Rebuild the runner a few times before crashing the process - This is the most important change, though it's the only one that wouldn't fix the problem by itself. The first two changes address why the problem occurred, this last change addresses why it placed us in an unrecoverable state. What would happen is we would time out trying to start another job after `update_downloads`, and then the process would crash. This would mean that `update_downloads` would never complete, so as soon as we restarted, we'd just try to run it again (I may also change swirl to increment the retry counter before even beginning to run the job, but there are issues with that which are out of scope for this commit to discuss). This commit changes the behavior to instead built a new runner (which means a new thread pool and DB pool) up to 5 times before crashing the process. This means that any spawned threads will get a bit more time to run before the process itself crashes, so any jobs clogging the runner still get a chance to complete. I've opted to have a hard limit on the number of failures in the runner to avoid potentially unbounded growth in DB connections. We do still want to eventually fail, since being unable to start jobs can indicate issues that are only solved by starting a new process or moving to another physical machine. More specific technical details on the issue that are not required to review this PR, but may be interesting -- I've written this issue up at sgrif/swirl#16 as well. The main entry point for a Swirl runner today is `run_all_pending_jobs`. This method is fairly low level. The intent is to eventually add a "reasonable defaults" binary shipped with swirl, probably somewhat based on what crates.io needs here. This method will run in a loop, attempting to fully saturate its thread pool on each iteration. It will check the number of availble threads, spawning that many tasks. Each task that is spawned will quickly communicate back to the coordinator via an mpsc channel. The coordinator keeps track of how many messages it's expecting (we get exactly 1 message per spawned task). If we aren't currently expecting any messages, and there are also 0 available threads, we will attempt to spawn 1 task no matter what. This is to ensure we don't loop forever waiting for a free thread, and respsect the given timeout. We do this in a loop until we hear from a thread that there was no job available, or receive an error (caused by a thread being unable to get a DB connection, an error loading the job from the DB [which should only happen if the DB has gone away], or if we time out waiting to hear back at all). That's exactly what happened in this case. We would see 1 available thread, spawn 1 task, and have 1 pending message. The worker would communicate back that it got a job. We'd loop. There are 0 available threads. We are expecting 0 messages, so we spawn 1 task anyway. We are now expecting 1 pending message. We block waiting for it. The only way we will receive a message is for the job we started in the first iteration to complete before the timeout. It doesn't, so `run_all_pending_jobs` returns an error. Our runner was calling `.expect` on that, so the process crashes. This shows several issues both in the configuration that was being used by crates.io, and also in Swirl itself. I discussed the configuration issues above, but there are also questions WRT Swirl's design. The first issue is whether this case should be separated from not getting a response from the worker at all. The latter should *never* happen under reasonable circumstances, so my gut is that we can assume if it does happen it was due to this case... The second issue is that this was put us in an unrecoverable state rather than causing one class of issues to fail to run. This could be prevented by increasing the retry counter outside of a transaction before running the job. This has issues though, which are out of scope for this commit, but basically boil down to introducing non-atomic pieces to an otherwise atomic operation.
…meout, r=jtgeibel Configure the background job timeout via an environment variable An incident was caused by #1798. There is a description below if you're interested, but this PR does not fix the problem. However, the band-aid to get things running again fix is to increase the timeout for the job runner. When responding to an incident, waiting for a full rebuild to change this is not acceptable. This replaces the hard-coded value with an environment variable so we can quickly change this on the fly in the future. Description of the actual problem that this does not fix -- The problem was that the `update_downloads` job takes longer than the timeout we had set for jobs to begin running. So swirl would start the `update_downloads` job, try to spawn another worker, and then would time out hearing from that worker whether it got a job or not. So we would crash the process, the job would be incomplete, and we'd just start the whole thing over again. There's several real fixes for this, and I will open a PR that is some combination of all of them. Ultimately each of these fixes just increase the number of slow concurrent jobs that can be run before we hit the timeout and the problem re-appears, but that's fundamentally always going to be the case... If we are getting more jobs than we can process, we do need to get paged so we can remedy the situation. Still, any or all of these will be the "real" fix: - Increasing the number of concurrent jobs - Increasing the timeout - Re-building the runner before crashing - The reason this would fix the issue is that by not crashing the process, we give the spawned threads a chance to finish. We do still want to *eventually* crash the process, as there might be something inherent to this process or machine preventing the jobs from running, but starting with a new thread/connection pool a few times gives things a better chance to recover on their own.
Make the job runner a bit more resilient to slow jobs or other errors A brief incident was caused by #1798. A band-aid fix is in place, and #1803 (included in this branch) makes it possible to apply similar band-aids in the future without requiring a rebuild of the code. This commit attempts to better address the root problem though. The short version (which is expanded on below, but not required to understand this commit or why it's needed) is that `update_downloads` takes longer than our job timeout to run. When we moved that task to a background job, we did not increase the number of concurrent jobs, nor did we increase the timeout. This meant that swirl timed out trying to start new jobs, and our behavior in that case was to crash the process. This would mean that `update_downloads` never completes, and remains at the front of the queue. This PR addresses all 3 of the problematic cases. - Increasing concurrency - When this system was added, the only jobs we had were index updates. These want to be serial, so we set the thread pool size to 1. We added readme renderings, which probably should have been parallel, but only happen with crate publishes anyway so it was fine. `update_downloads` *always* takes longer than the timeout to run though. We can't have it block everything else while it's running. The main downside to this is that index updates are no longer guaranteed to run in serial, which means that if two crates are uploaded simultaneously one job will fail and will have to wait for a retry to update the index. In theory if a crate happened to be uploaded at the exact instant of the retry 7 or 8 times in a row this could even result in getting paged. This is exceptionally unlikely, and I'm not concerned about it for now. As more features land in swirl we may want to move index updates to their own queue or tweak the retry behavior on that job though. Swirl will eventually handle this for us by default, and we should use its defaults once that lands. - Increasing the default timeout - 10s was a bit too aggressive. Fundamentally there is always a condition where we hit this timeout, and if the reason for hitting it is that we are receiving more jobs than we can process (either because of volume of jobs, or our jobs are too slow). The most common reason we would hit this is that all threads are occupied by a job which takes longer than the timeout to execute. Increasing the concurrency makes this less likely to occur since our jobs are low volume, but we were actually seeing this crash before the addition of `update_downloads` meaning that our other jobs are sometimes taking >10s to run. Increasing the concurrency beyond 2 would make it extremely unlikely we will ever hit this, but since we theoretically can with a burst of crate uploads at any concurrency, I've also upped the timeout. - Rebuild the runner a few times before crashing the process - This is the most important change, though it's the only one that wouldn't fix the problem by itself. The first two changes address why the problem occurred, this last change addresses why it placed us in an unrecoverable state. What would happen is we would time out trying to start another job after `update_downloads`, and then the process would crash. This would mean that `update_downloads` would never complete, so as soon as we restarted, we'd just try to run it again (I may also change swirl to increment the retry counter before even beginning to run the job, but there are issues with that which are out of scope for this commit to discuss). This commit changes the behavior to instead built a new runner (which means a new thread pool and DB pool) up to 5 times before crashing the process. This means that any spawned threads will get a bit more time to run before the process itself crashes, so any jobs clogging the runner still get a chance to complete. I've opted to have a hard limit on the number of failures in the runner to avoid potentially unbounded growth in DB connections. We do still want to eventually fail, since being unable to start jobs can indicate issues that are only solved by starting a new process or moving to another physical machine. More specific technical details on the issue that are not required to review this PR, but may be interesting -- I've written this issue up at sgrif/swirl#16 as well. The main entry point for a Swirl runner today is `run_all_pending_jobs`. This method is fairly low level. The intent is to eventually add a "reasonable defaults" binary shipped with swirl, probably somewhat based on what crates.io needs here. This method will run in a loop, attempting to fully saturate its thread pool on each iteration. It will check the number of availble threads, spawning that many tasks. Each task that is spawned will quickly communicate back to the coordinator via an mpsc channel. The coordinator keeps track of how many messages it's expecting (we get exactly 1 message per spawned task). If we aren't currently expecting any messages, and there are also 0 available threads, we will attempt to spawn 1 task no matter what. This is to ensure we don't loop forever waiting for a free thread, and respsect the given timeout. We do this in a loop until we hear from a thread that there was no job available, or receive an error (caused by a thread being unable to get a DB connection, an error loading the job from the DB [which should only happen if the DB has gone away], or if we time out waiting to hear back at all). That's exactly what happened in this case. We would see 1 available thread, spawn 1 task, and have 1 pending message. The worker would communicate back that it got a job. We'd loop. There are 0 available threads. We are expecting 0 messages, so we spawn 1 task anyway. We are now expecting 1 pending message. We block waiting for it. The only way we will receive a message is for the job we started in the first iteration to complete before the timeout. It doesn't, so `run_all_pending_jobs` returns an error. Our runner was calling `.expect` on that, so the process crashes. This shows several issues both in the configuration that was being used by crates.io, and also in Swirl itself. I discussed the configuration issues above, but there are also questions WRT Swirl's design. The first issue is whether this case should be separated from not getting a response from the worker at all. The latter should *never* happen under reasonable circumstances, so my gut is that we can assume if it does happen it was due to this case... The second issue is that this was put us in an unrecoverable state rather than causing one class of issues to fail to run. This could be prevented by increasing the retry counter outside of a transaction before running the job. This has issues though, which are out of scope for this commit, but basically boil down to introducing non-atomic pieces to an otherwise atomic operation.
This replaces the
update-downloads
binary with a background job, and abinary that is used to queue up a given job, which we will run from
Heroku scheduler. This accomplishes 2 things:
cleaning up stale rate limit buckets), since we don't need to create a
new standalone binary.
update_downloads
and any future recurring tasks will automaticallyget monitoring if they fail, since we are already monitoring for
background jobs not being successfully run.
Right now the intent is to have
enqueue-job update_downloads
get runperiodically by Heroku scheudler (and a similar scheduled task for any
future tasks that are added). Once swirl gains the ability to schedule
jobs to be run at arbitrary points in the future, we could instead have
these jobs re-queue themselves once they complete, and have the cron
task just look to see if any job is queued for each given type, queuing
it if not. That would have a bit less boilerplate, but a lot more
complexity.
Fixes #1797.