Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown of a stream for a single subscription #1201

Open
wants to merge 133 commits into
base: master
Choose a base branch
from

Conversation

svroonland
Copy link
Collaborator

@svroonland svroonland commented Mar 24, 2024

Implements functionality for gracefully stopping a stream for a single subscription: stop fetching records for the assigned topic-partitions but keep being subscribed so that offsets can still be committed. Intended to replace stopConsumption, which did not support multiple-subscription use cases.

A new command EndStreamsBySubscription is introduced, which calls the end method on the PartitionStreamControl of streams matching a subscription. In the method Consumer#runWithGracefulShutdown we then wait for the user's stream to complete, before removing the subscription.

This is experimental functionality, intended to replace stopConsumption at some point. Methods with this new functionality are offered besides existing methods to maintain compatibility.

All the fiber and scope trickery proved to be very hard to get right (the lifetime of this PR is a testimony to that), and there may still be subtle issues here. This is now traced back to issue zio/zio#9288

Implements some of #941.

@svroonland svroonland changed the title Subscription stream control Graceful shutdown of a single subscription Mar 30, 2024
@svroonland svroonland marked this pull request as ready for review March 30, 2024 11:07
Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the implementation yet, only docs and tests.

Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need more time to digest this.

@svroonland
Copy link
Collaborator Author

svroonland commented Apr 3, 2024

Hmm, should we instead of this:

Consumer.runWithGracefulShutdown(Consumer.partitionedStreamWithControl(Subscription.topics("topic150"), Serde.string, Serde.string)) { 
  stream => ... 
}

offer this:

Consumer.partitionedStreamWithGracefulShutdown(Subscription.topics("topic150"), Serde.string, Serde.string) {
  (stream, _) => stream.flatMapPar(...) 
}

The second parameter would be the SubscriptionStreamControl, which you could always manually call stop on. Or would that prevent certain use cases.. 🤔

@erikvanoosten
Copy link
Collaborator

Hmm, should we instead of this:

If I understand it correctly, the proposal allows for more use cases; with it you can also call stop for any condition you want. Is it true that after stopping, you can start consuming again?

@svroonland
Copy link
Collaborator Author

Well, I mean compared to just the partitionedStreamWithControl method. In both cases you would need to do something with the stream that ultimately reduces to a ZIO of Any, so I don't think the partitionedStreamWithGracefulShutdown is limiting in that regard.

stop currently doesn't support that, since the stream would then be finished. We could probably build pause and resume like in #941.

@erikvanoosten
Copy link
Collaborator

If resume after stop is not supported (and never will be), then I like the first proposal better where you don't need to call stop. What would you do after calling stop?

@svroonland
Copy link
Collaborator Author

Well, in both proposals you can call stop.

I don't think you want to do anything after stop, but it would give you more explicit control when to stop, instead of when the scope ends.

We probably need to decide if we want to add pause/resume in the future. If we do, we should add the control parameter like in the partitionedStreamWithGracefulShutdown example for future compatibility. If we don't, we can drop it altogether and make SubscriptionStreamControl a purely internal concept (if at all).

@guizmaii
Copy link
Member

guizmaii commented Apr 5, 2024

Hey :)

Thanks for the great work!

Here's some initial feedback:

I'm not a big fan of the SubscriptionStreamControl implementation.

To me, functions/methods returning it should return a Tuple (stream, control).
It avoids adding one more concept for our users to understand and learn (Kafka already has a lot of concepts)
It also simplifies the interface of the control type, the current one with the [S <: ZStream[_, _, _]] being complex
It also simplifies the return type of our functions/methods, avoiding this kind of type:

SubscriptionStreamControl[Stream[Throwable, Chunk[(TopicPartition, ZStream[R, Throwable, CommittableRecord[K, V]])]]]

in favor of:

(Stream[Throwable, Chunk[(TopicPartition, ZStream[R, Throwable, CommittableRecord[K, V]])], SubscriptionStreamControl)

Made the change in a PR to show/study how, to me, it simplifies things: https://github.com/zio/zio-kafka/pull/1207/files

@guizmaii
Copy link
Member

guizmaii commented Apr 5, 2024

Didn't finish my review yet. I still have some parts of the code to explore/understand, but I have to go. I'll finish it later 🙂

@svroonland
Copy link
Collaborator Author

Thanks for the feedback Jules. Agreed about the extra concept that would be unwanted. Check out my latest interface proposal where there is only a plainStreamWithGracefulShutdown method and SubscriptionStreamControl remains hidden.

Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading the code...

@erikvanoosten
Copy link
Collaborator

erikvanoosten commented Apr 7, 2024

I understand now that when graceful shutdown starts we're ending the subscribed streams. That should work nicely. Lets work out what will happen next to the runloop. The runloop would still be happily fetching records for that stream. When those are offered to the stream, PartitionStreamControl.offerRecords will probably append those records to the queue (even though it now also contains an 'end' token). Because of the 'end' token that is already in that queue, these new records will never be taken out. Back pressure will kick in (depending on the fetch strategy) and the partitions will be paused. Once we're unsubscribed, 15 seconds later, the queue will be garbage collected. So far so good.

We can do slightly better though. We're fetching and storing all these records in the queue for nothing, even potentially causing an OOM for systems that are tuned for the case where processing happens almost immediately.

My proposal is to:

  1. stop accepting more records in PartitionStreamControl.offerRecords when the queue was ended
  2. in Runloop.handlePoll only pass running streams to fetchStrategy.selectPartitionsToFetch so that partitions for ended streams are immediately paused

If you want, I can extend this PR with that proposal (or create a separate PR).

@svroonland
Copy link
Collaborator Author

@erikvanoosten If you have some time to implement those two things, by all means.

@erikvanoosten
Copy link
Collaborator

erikvanoosten commented Apr 13, 2024

@erikvanoosten If you have some time to implement those two things, by all means.

@svroonland Done in commit 1218204.

Now I am wondering, how can we test this?

@svroonland
Copy link
Collaborator Author

svroonland commented Apr 14, 2024

Change looks good. Totally forgot to implement this part.

@svroonland
Copy link
Collaborator Author

I was preparing some work for removing stopConsumption but I'm now beginning to wonder if stopping the stream via interruption is the best API. With stopConsumption, we could stop the streams and access the result of the runCollect or something like that. With this interruption-based stopping, we can no longer return a value from with*Stream. Not sure how useful that is in practice, but it is used in our unit tests.

We've discussed it earlier (about a year ago), but perhaps this alternative API is more powerful for our users, for usage scenario's we haven't thought of.. #1501

Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scope provided to the returned effect controls the lifetime of the subscription. The subscription is unsubscribed when the scope is ended. Calling [[StreamControl.end]] stops fetching data for the subscription partitions but will not unsubscribe until the Scope is ended.

How do you coordinate calling end and closing the scope? Does end return immediately, or does it block until its safe to close the scope? Or is there another way to know the scope may be closed?

@svroonland
Copy link
Collaborator Author

How do you coordinate calling end and closing the scope? Does end return immediately, or does it block until its safe to close the scope? Or is there another way to know the scope may be closed?

end returns when the command has been processed by the Runloop, so after all relevant PartitionStreamControls have put a Take.end in their dataQueue. It's okay to call end more than once (see unit test).

When the scope is closed without end having been called, there is no graceful shutdown.

Does that make sense?

@erikvanoosten
Copy link
Collaborator

Yes, that makes sense. However, could we also wait until the stream provided by the user ends?

@svroonland
Copy link
Collaborator Author

The intended usage pattern of this construct is something like:

streamControl <- Consumer.plainStreamWithControl
fib <- streamControl.stream.tap(..).runDrain
...
// At some point
_ <- streamControl.end
_ <- fib.join

So no additional need to wait for the stream to end when closing the scope. In fact, if we do that we might actually limit useability in some use cases we haven't thought of yet. If you use Consumer.runWithGracefulShutdown you can just interupt the fiber and it will end and wait for the stream to complete.

Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'll go through it once more from top to bottom. Perhaps this evening.

@erikvanoosten
Copy link
Collaborator

BTW, is the PR description up to date?

Copy link
Collaborator

@erikvanoosten erikvanoosten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I thought of a new problem. When there is a rebalance, new partitions might be assigned to this consumer, even though the consumer is trying to do a graceful shutdown! We need to make sure NOT to start new streams for subscriptions for which we received a EndStreamsBySubscription.


Use the `*StreamWithControl` variants of `plainStream`, `partitionedStream` and `partitionedAssignmentStream` for this purpose. These methods return a `StreamControl` object allowing you to stop fetching records and terminate the execution of the stream gracefully.

There is also the `Consumer.runWithGracefulShutdown` method which can gracefully terminate the stream upon fiber interruption, useful for a controlled shutdown when your application is terminated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumer.runWithGracefulShutdown could use its own section (with example) in this document. IMHO it should go before this section so that programmer who are in a hurry find it quickly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, added to docs

Comment on lines +448 to +450
_ <- ZIO.foreachDiscard(
state.assignedStreams.filter(stream => Subscription.subscriptionMatches(subscription, stream.tp))
)(_.end)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ <- ZIO.foreachDiscard(
state.assignedStreams.filter(stream => Subscription.subscriptionMatches(subscription, stream.tp))
)(_.end)
_ <- ZIO.foreachDiscard {
state.assignedStreams.filter(stream => Subscription.subscriptionMatches(subscription, stream.tp))
}(_.end)

@erikvanoosten
Copy link
Collaborator

Okay, my review is complete. Finally 🙂
Only a few things left to address and then I think its good to go!

@svroonland
Copy link
Collaborator Author

Unfortunately, I thought of a new problem. When there is a rebalance, new partitions might be assigned to this consumer, even though the consumer is trying to do a graceful shutdown! We need to make sure NOT to start new streams for subscriptions for which we received a EndStreamsBySubscription.

Ouch, we'll need to think about how to implement and test this..

@erikvanoosten
Copy link
Collaborator

we'll need to think about how to implement and test this..

I can't think of anything else than adding terminating subscriptions to the runloop state. They can be removed as soon as the subscription is actually terminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants