-
Notifications
You must be signed in to change notification settings - Fork 526
fix: Only apply requestHandlerTimeout to request handler #1474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.
It makes sense when I'm reading this.
Do we have any examples of where the previous behavior caused any troubles?
This is correct.
In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR. |
Pijukatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some test for the correct application of the timeout.
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring suggestions. Otherwise LGTM.
Co-authored-by: Vlada Dusek <[email protected]>
| async def _execute_pre_navigation_hooks( | ||
| self, context: BasicCrawlingContext | ||
| ) -> AsyncGenerator[BasicCrawlingContext, None]: | ||
| self._shared_navigation_timeouts[context] = SharedTimeout(self._navigation_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. Context is expanded by many pipeline steps, so this mapping is not very flexible as it will break if you try to apply SharedTimeout over pipeline steps that work on expanded context.
Maybe it would be better to keep the timeout on the context itself, instead of this mapping on the crawler? I think it belongs there also conceptually. The crawler sets the timeout, but it belongs to the context. Also only the consumers of this specific context should have access to the timeout, there is no reason to keep it global for the whole Crawler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here 🙂 I want to avoid exposing the timeout to the request handler, mostly because it just doesn't make sense to me to do so.
Since the tests are now failing because of this, I guess we can agree that this approach is not optimal - I'll iterate on it for a while.
| result.cpu = after_cpu - before_cpu | ||
|
|
||
|
|
||
| class SharedTimeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. I like this a lot.
Just thinking out loud, where this could lead to:
In case we need to create more granular timeouts for specific steps, I think this class could be easily expanded to support that. I am wondering if even the final context consumer (request handler) could just use timeout from here if the timeout is set on the context (my other comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, do you think that the request handler should be limited by a shared timeout? Or that it should have access to the remaining timeout "budget"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not focus on any specific interface example in my example. It is just about capability.
Imagine that the context would be created something like this:
pipeline_timeout = SharedTimeout(...)
BasicCrawlingContext(.....,
timeouts={
"WholePipeline": pipeline_timeout, # Maybe the other timeouts could be somehow limited by this one?
"Navigation": pipeline_timeout.limited_to(NAVIGATION_LIMIT),
"RequestHandler": pipeline_timeout.limited_to(HANDLER_LIMIT)
})
And each timeout-protected context consumer would pick the relevant timeout from the context and apply it. Context consumers could even modify the timeouts of the steps that follow them.
For example, users could mutate "RequestHandler" timeout in pre-navigation hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to block this change for this. If needed, we can discuss here: #1596
Mantisus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An excellent solution with context for a general navigation timeout!
…to-request-handler
I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂