Skip to content

Conversation

@janbuchar
Copy link
Collaborator

I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂

@janbuchar janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Oct 10, 2025
@janbuchar janbuchar requested review from Pijukatel and vdusek October 10, 2025 15:59
@github-actions github-actions bot added this to the 125th sprint - Tooling team milestone Oct 10, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

It makes sense when I'm reading this.

Do we have any examples of where the previous behavior caused any troubles?

@janbuchar
Copy link
Collaborator Author

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

This is correct.

Do we have any examples of where the previous behavior caused any troubles?

In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - navigationTimeout and requestHandlerTimeout. Then it does this: https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L395

This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR.

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some test for the correct application of the timeout.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 28, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring suggestions. Otherwise LGTM.

async def _execute_pre_navigation_hooks(
self, context: BasicCrawlingContext
) -> AsyncGenerator[BasicCrawlingContext, None]:
self._shared_navigation_timeouts[context] = SharedTimeout(self._navigation_timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Context is expanded by many pipeline steps, so this mapping is not very flexible as it will break if you try to apply SharedTimeout over pipeline steps that work on expanded context.

Maybe it would be better to keep the timeout on the context itself, instead of this mapping on the crawler? I think it belongs there also conceptually. The crawler sets the timeout, but it belongs to the context. Also only the consumers of this specific context should have access to the timeout, there is no reason to keep it global for the whole Crawler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here 🙂 I want to avoid exposing the timeout to the request handler, mostly because it just doesn't make sense to me to do so.

Since the tests are now failing because of this, I guess we can agree that this approach is not optimal - I'll iterate on it for a while.

result.cpu = after_cpu - before_cpu


class SharedTimeout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. I like this a lot.

Just thinking out loud, where this could lead to:
In case we need to create more granular timeouts for specific steps, I think this class could be easily expanded to support that. I am wondering if even the final context consumer (request handler) could just use timeout from here if the timeout is set on the context (my other comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, do you think that the request handler should be limited by a shared timeout? Or that it should have access to the remaining timeout "budget"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not focus on any specific interface example in my example. It is just about capability.

Imagine that the context would be created something like this:

pipeline_timeout = SharedTimeout(...)

BasicCrawlingContext(.....,
 timeouts={
"WholePipeline": pipeline_timeout,   # Maybe the other timeouts could be somehow limited by this one? 
"Navigation": pipeline_timeout.limited_to(NAVIGATION_LIMIT),
"RequestHandler": pipeline_timeout.limited_to(HANDLER_LIMIT)
})

And each timeout-protected context consumer would pick the relevant timeout from the context and apply it. Context consumers could even modify the timeouts of the steps that follow them.

For example, users could mutate "RequestHandler" timeout in pre-navigation hooks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to block this change for this. If needed, we can discuss here: #1596

@janbuchar janbuchar requested a review from Pijukatel December 3, 2025 14:37
@janbuchar janbuchar requested review from Mantisus and vdusek December 3, 2025 14:37
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent solution with context for a general navigation timeout!

@janbuchar janbuchar merged commit 0dfb6c2 into master Dec 4, 2025
23 checks passed
@janbuchar janbuchar deleted the only-apply-timeout-to-request-handler branch December 4, 2025 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants