Investigate shutdown delay option #20995

philwebb · 2020-04-16T23:07:34Z

The webinar presented by @ryanjbaxter has an interesting bit of configuration to delay shutdown using a preStop command. This article has some interesting background.

We might be able to offer a similar feature out-of-the-box in Boot and configure it automatically.

The text was updated successfully, but these errors were encountered:

ryanjbaxter · 2020-04-16T23:18:47Z

@philwebb actually pulled that from the boot docs 🤫

philwebb · 2020-04-16T23:36:56Z

I guess I missed those doc updates. They must have happened when I was away so I'm glad I watched the webinar!

ckoutsouridis · 2020-06-16T16:59:06Z

i think this would be a great feature given that endpoints propagation and pod deletion happens asynchronously. + not all base images have a shell in order to run sleep in the preStop.

the 2.3 gracefull shutdown support is great, but i believe in k8s we will have to delay the shutdown still, in order to to make sure that new requests will not end up in an endpoint from a pod that is already gone.

ttddyy · 2020-09-05T03:55:34Z

We have an operational scenario that directly open a shell on the pod and restart the application. This allows faster restart of the app since it doesn't involve k8s pod lifecycle.

In such case, k8s does not know the application is restarting and it still send traffic to the pod during shutdown.
So, in our graceful shutdown logic(in addition to the one from boot 2.3), we put a sleep for the duration of readiness check frequency, to make sure k8s readiness check picks up that the app is about to restarting and stop routing the traffic before actual shutdown happens.

If Spring Boot provides sleep(delay) on its graceful shutdown logic, I can remove our sleep logic from our library, which is great.

siwyd · 2020-12-09T10:26:51Z

This would be a handy addition, as right now I always lose some metrics when the application shuts down because Prometheus hasn't had a chance to scrape the last 'meaningful' metrics before it's killed, causing a big (fake) dip in request rate for instance. If I could keep the application around for 30s while traffic is already being diverted to other instances, this would allow me to scrape those last metrics.

dimovelev · 2020-12-14T11:58:28Z

Waiting between tripping the readiness probe to REFUSING_TRAFFIC and actually starting to refuse traffic would also be good for non-kubernetes environments with standard load-balancers that have checks (monitors) to determine whether an instance should be kept in the pool or not. Right now, even with graceful shutdown activated, the load balancer is going to send a few (or many, depending on the number of requests) to endpoints that are already shutting down thus making it impossible to deploy without downtime.

wilkinsona · 2020-12-14T12:33:40Z

@dimovelev Outside of a K8S environment, the expectation is that the load balancer is instructed to stop routing requests to the app first and that graceful shutdown of the application instance is then initiated. This should allow existing in-flights requests to complete while any new requests are routed to a different instance.

jgslima · 2021-06-19T15:04:31Z

I understand that ideally a real load balancer should not rely solely on the readiness endpoint check. However we may have architectures and environments where you may indeed have some components that have nothing else to be based on.

Also, it is reasonable to assume that, components that keep checking the readiness endpoint will do that with some frequency (say some few seconds). Then, from a feature correctness point of view, the /actuator/health/readiness is not actually good/correct in regard to shutdown, since it reports UP immediately before the server stops accepting requests.

Therefore, for me, an optional delay config would make sense for more correctness of Spring Boot as a whole.

I am assuming that currently, there is no safe way for someone to implement a custom delay in a Spring Boot application. As far as I understand, one option might be to implement a SmartLifecycle bean which could make a Thread.sleep, however since the graceful shutdown already uses the maximum possible Phase, I assume that it would not be guaranteed that this custom bean would be called first.

nhmarujo · 2021-07-16T10:02:19Z

Hi. I would like to propose a possible solution for this.

The problem here is that the app is entering the graceful shutdown, thus rejecting new requests, but the pod may still be in the load balancer for a bit. This means for a brief time new requests may still be forwarded to the pod, but the application server within will reject them.

The preStop sleep does solve the issue, but just like @ckoutsouridis stated, many of us are using distroless images, so that is not really an option for us. Plus, it makes sense to me that Spring Boot manages this out-of-the-box, since it has all the context necessary to do so. Would be cleaner and would solve other issues too (just like the metrics issue reported here).

I tested a simple solution for this and did some load tests with rolling restarts in the middle and stopped having issues (before I was always able to reproduce it).

My solution to the issue was to wrap the LivenessStateHealthIndicator and listen to the dying context (ContextClosedEvent). When that happens I flag that the app is shutting down and force the liveness probe to start reporting OUT_OF_SERVICE. I then do a sleep for 10s to ensure I give enough time to the liveness probe to realise the pod is out-of-service, and to take it out of the load balancer, before I let it start the graceful shutdown.

Here is an example of the code I tried:

@Slf4j
@Component("livenessStateHealthIndicator")
@Profile("!dev")
public class GracefulLivenessStateHealthIndicator extends LivenessStateHealthIndicator {
    private boolean shuttingDown;

    public GracefulLivenessStateHealthIndicator(ApplicationAvailability availability) {
        super(availability);
    }

    @EventListener(ContextClosedEvent.class)
    public void onContextClosedEvent() {
        if (shuttingDown) {
            return;
        }

        shuttingDown = true;

        try {
            log.info("Waiting 10s before starting shutdown");
            SECONDS.sleep(10);
        } catch (InterruptedException e) {
            log.error("Wait before shutdown was interrupted", e);
        }
        log.info("Wait before shutdown finished");
    }

    @Override
    public Health getHealth(boolean includeDetails) {
        return shuttingDown ? Health.outOfService().build() : super.getHealth(includeDetails);
    }
}

I should make a few notes about it:

I had to name the bean livenessStateHealthIndicator in order to ensure that it would replace the one created by Spring Boot
ContextClosedEvent seems to be called twice, so I had to put an if to ensure the sleep was only done once
My condition to create the bean (@Profile("!dev")) was just for my own test purposes and is not the right one, but I was not yet able to figure out how Spring Boot is assessing it is inside k8s (that is probably the right condition)
The sleep value should be, of course, configurable through a property (maybe can have a default of 10s). The important rule to whoever sets it in their project is that it is bigger than the periodSeconds of the liveness probe, so that it ensures that it gives enough time for the probe to run at least once before entering the graceful shutdown (as a rule of thumb I would suggest 2 x periodSeconds)
Obviously this is not the code implementation I propose, but what I propose is rather that LivenessStateHealthIndicator is changed to ensure a similar behaviour.

nhmarujo · 2021-10-07T18:36:36Z

Hi @philwebb. Did you guys had any chance to analyse my proposal above?

philwebb · 2021-10-07T20:00:03Z

@nhmarujo I'm afraid we've not had a chance to revisit this one yet.

nhmarujo · 2021-10-07T20:07:54Z

Thanks @philwebb for the prompt answer.

nhmarujo · 2022-06-07T18:09:52Z

Hi. Any moves on this one? 😄

philwebb · 2022-06-07T21:15:55Z

I'm afraid not. Currently most of our focus is on ahead-of-time code generation and support for Graal native. I'm sorry it's been so long :(

nhmarujo · 2022-06-07T23:32:35Z

Is ok, I understand. Worth asking anyway. Thanks for the feedback!

Additional note - on my "proposal" on #20995 (comment) I used the wrong probe. I should have in fact extended ReadinessStateHealthIndicator, so it would be:

@Slf4j
@Component("readinessStateHealthIndicator")
public class GracefulReadinessStateHealthIndicator extends ReadinessStateHealthIndicator {
    private boolean shuttingDown;

    public GracefulReadinessStateHealthIndicator(ApplicationAvailability availability) {
        super(availability);
    }

    @EventListener
    public void onContextClosedEvent(ContextClosedEvent event) {
        if (!KUBERNETES.isActive(event.getApplicationContext().getEnvironment()) || shuttingDown) {
            //Avoid running sleep if not inside k8s or if ContextClosedEvent was already received before
            return;
        }

        shuttingDown = true;

        try {
            log.info("Readiness probe set as OUT_OF_SERVICE. Delay before commencing graceful shutdown initiated");
            SECONDS.sleep(10);
        } catch (InterruptedException e) {
            log.error("Delay before commencing graceful shutdown interrupted", e);
        }
        log.info("Delay before commencing graceful shutdown finished");
    }

    @Override
    public Health getHealth(boolean includeDetails) {
        return shuttingDown ? Health.outOfService().build() : super.getHealth(includeDetails);
    }
}

☝️ although the other version worked on my POC, this is more accurate as it is the readiness probe responsibility to control traffic redirection to the pods.

All the remaining comments on the original post still apply.

sigand · 2024-03-21T08:54:27Z

This is still an issue.
Implemented a workaround as suggested in #31714, which does work for us

nhmarujo · 2024-03-21T10:07:53Z

@sigand can you please share your solution? Thanks

sigand · 2024-04-02T10:51:57Z

@Configuration
@ConditionalOnProperty(name = Array("server.shutdown"), havingValue = "graceful")
class PreStopShutdownLifecycle() extends SmartLifecycle {

  private var running: Boolean = false

  override def start(): Unit = {
    running = true
  }

  override def stop(): Unit = {
    Thread.sleep(sleepDuration.toMillis)
    this.running = false
  }

  override def isRunning: Boolean = this.running

  override def getPhase: Int = WebServerGracefulShutdownLifecycle.SMART_LIFECYCLE_PHASE + 1 // Just before the WebServerGracefulShutdownLifecycle
}

andrej-urvantsev · 2025-01-15T19:24:15Z

I've just found this: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

If I understand it correctly then latest k8s(1.32) already has sleep preStop option.

We're still on 1.30, so can't verify it at the moment.

mhalbritter · 2025-01-30T09:31:26Z

#43830 documented that sleep feature of Kubernetes. With Kubernetes 1.32 you can use this:

lifecycle:
  preStop:
    sleep:
      seconds: 10

No need to have a shell / sleep command in the container.

billkoch · 2025-02-20T14:13:48Z

For folks that aren't able to upgrade to Kubernetes 1.32, I humbly offer upmc-enterprises-graceful-shutdown-spring-boot-starter (an open source library I maintain) which adds an Actuator endpoint (/actuator/preStopHook/{some-delay-in-ms}) that can be used in the preStop configuration.

philwebb added type: enhancement A general enhancement status: pending-design-work Needs design work before any code can be developed labels Apr 16, 2020

philwebb added this to the 2.x milestone Apr 16, 2020

philwebb modified the milestones: 2.x, 2.4.x Jun 5, 2020

philwebb added the theme: kubernetes label Jun 5, 2020

wilkinsona mentioned this issue Jul 21, 2020

Configurable waiting time before graceful shutdown starts (to let kubernetes pick up new readiness state) #22487

Closed

bclozel modified the milestones: 2.4.x, 2.x Sep 28, 2020

bclozel mentioned this issue Dec 12, 2020

Graceful shutdown without client impact #24479

Closed

bclozel mentioned this issue Dec 28, 2020

Liveness/Readiness probes failure while using graceful shutdown #24617

Closed

philwebb added the for: team-meeting An issue we'd like to discuss as a team to make progress label Aug 10, 2021

philwebb removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Aug 27, 2021

jGleitz mentioned this issue Jul 13, 2022

Change phases of WebServer start-stop and graceful shutdown lifecycles to allow other smart lifecycles to run before graceful shutdown #31714

Closed

philwebb removed this from the 2.x milestone Aug 19, 2022

philwebb added this to the 3.x milestone Aug 19, 2022

philwebb mentioned this issue Jan 15, 2025

Document Kubernetes preStop handler when using a Docker image without a shell #43830

Closed

philwebb closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2025

philwebb added status: superseded An issue that has been superseded by another and removed type: enhancement A general enhancement status: pending-design-work Needs design work before any code can be developed theme: kubernetes labels Jan 30, 2025

wilkinsona removed this from the 3.x milestone Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate shutdown delay option #20995

Investigate shutdown delay option #20995

philwebb commented Apr 16, 2020

ryanjbaxter commented Apr 16, 2020

philwebb commented Apr 16, 2020

ckoutsouridis commented Jun 16, 2020

ttddyy commented Sep 5, 2020

siwyd commented Dec 9, 2020

dimovelev commented Dec 14, 2020

wilkinsona commented Dec 14, 2020

jgslima commented Jun 19, 2021

nhmarujo commented Jul 16, 2021 •

edited by philwebb

Loading

nhmarujo commented Oct 7, 2021

philwebb commented Oct 7, 2021

nhmarujo commented Oct 7, 2021

nhmarujo commented Jun 7, 2022

philwebb commented Jun 7, 2022

nhmarujo commented Jun 7, 2022 •

edited

Loading

sigand commented Mar 21, 2024

nhmarujo commented Mar 21, 2024

sigand commented Apr 2, 2024

andrej-urvantsev commented Jan 15, 2025

mhalbritter commented Jan 30, 2025 •

edited

Loading

billkoch commented Feb 20, 2025

Investigate shutdown delay option #20995

Investigate shutdown delay option #20995

Comments

philwebb commented Apr 16, 2020

ryanjbaxter commented Apr 16, 2020

philwebb commented Apr 16, 2020

ckoutsouridis commented Jun 16, 2020

ttddyy commented Sep 5, 2020

siwyd commented Dec 9, 2020

dimovelev commented Dec 14, 2020

wilkinsona commented Dec 14, 2020

jgslima commented Jun 19, 2021

nhmarujo commented Jul 16, 2021 • edited by philwebb Loading

nhmarujo commented Oct 7, 2021

philwebb commented Oct 7, 2021

nhmarujo commented Oct 7, 2021

nhmarujo commented Jun 7, 2022

philwebb commented Jun 7, 2022

nhmarujo commented Jun 7, 2022 • edited Loading

sigand commented Mar 21, 2024

nhmarujo commented Mar 21, 2024

sigand commented Apr 2, 2024

andrej-urvantsev commented Jan 15, 2025

mhalbritter commented Jan 30, 2025 • edited Loading

billkoch commented Feb 20, 2025

nhmarujo commented Jul 16, 2021 •

edited by philwebb

Loading

nhmarujo commented Jun 7, 2022 •

edited

Loading

mhalbritter commented Jan 30, 2025 •

edited

Loading