fix: Only apply requestHandlerTimeout to request handler #1474

janbuchar · 2025-10-10T15:59:20Z

I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂

vdusek

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

It makes sense when I'm reading this.

Do we have any examples of where the previous behavior caused any troubles?

janbuchar · 2025-10-14T09:24:14Z

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

This is correct.

Do we have any examples of where the previous behavior caused any troubles?

In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - navigationTimeout and requestHandlerTimeout. Then it does this: https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L395

This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR.

Pijukatel

Could you please add some test for the correct application of the timeout.

…awler

…to-request-handler

src/crawlee/http_clients/_impit.py

src/crawlee/crawlers/_playwright/_playwright_crawler.py

src/crawlee/crawlers/_basic/_basic_crawler.py

vdusek

Docstring suggestions. Otherwise LGTM.

src/crawlee/http_clients/_base.py

Co-authored-by: Vlada Dusek <[email protected]>

src/crawlee/_utils/time.py

Pijukatel · 2025-12-03T12:24:46Z

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

    async def _execute_pre_navigation_hooks(
        self, context: BasicCrawlingContext
    ) -> AsyncGenerator[BasicCrawlingContext, None]:
+        self._shared_navigation_timeouts[context] = SharedTimeout(self._navigation_timeout)


I am not sure about this. Context is expanded by many pipeline steps, so this mapping is not very flexible as it will break if you try to apply SharedTimeout over pipeline steps that work on expanded context.

Maybe it would be better to keep the timeout on the context itself, instead of this mapping on the crawler? I think it belongs there also conceptually. The crawler sets the timeout, but it belongs to the context. Also only the consumers of this specific context should have access to the timeout, there is no reason to keep it global for the whole Crawler.

Same here 🙂 I want to avoid exposing the timeout to the request handler, mostly because it just doesn't make sense to me to do so.

Since the tests are now failing because of this, I guess we can agree that this approach is not optimal - I'll iterate on it for a while.

Pijukatel · 2025-12-03T12:33:50Z

src/crawlee/_utils/time.py

        result.cpu = after_cpu - before_cpu


+class SharedTimeout:


Nice idea. I like this a lot.

Just thinking out loud, where this could lead to:
In case we need to create more granular timeouts for specific steps, I think this class could be easily expanded to support that. I am wondering if even the final context consumer (request handler) could just use timeout from here if the timeout is set on the context (my other comment)

I'm not sure I follow, do you think that the request handler should be limited by a shared timeout? Or that it should have access to the remaining timeout "budget"?

Please do not focus on any specific interface example in my example. It is just about capability.

Imagine that the context would be created something like this:

pipeline_timeout = SharedTimeout(...) BasicCrawlingContext(....., timeouts={ "WholePipeline": pipeline_timeout, # Maybe the other timeouts could be somehow limited by this one? "Navigation": pipeline_timeout.limited_to(NAVIGATION_LIMIT), "RequestHandler": pipeline_timeout.limited_to(HANDLER_LIMIT) })

And each timeout-protected context consumer would pick the relevant timeout from the context and apply it. Context consumers could even modify the timeouts of the steps that follow them.

For example, users could mutate "RequestHandler" timeout in pre-navigation hooks.

No need to block this change for this. If needed, we can discuss here: #1596

Mantisus

An excellent solution with context for a general navigation timeout!

…to-request-handler

fix: Only apply requestHandlerTimeout to request handler

1b44070

janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Oct 10, 2025

janbuchar requested review from Pijukatel and vdusek October 10, 2025 15:59

github-actions bot assigned janbuchar Oct 10, 2025

github-actions bot added this to the 125th sprint - Tooling team milestone Oct 10, 2025

vdusek approved these changes Oct 14, 2025

View reviewed changes

janbuchar mentioned this pull request Oct 14, 2025

PlaywrightCrawler error_handler cannot access Page object #1482

Open

Pijukatel reviewed Oct 15, 2025

View reviewed changes

Mantisus self-requested a review October 21, 2025 04:28

Mantisus mentioned this pull request Nov 25, 2025

PlaywrightCrawler doesn't have gotoOptions #1576

Closed

janbuchar added 3 commits November 27, 2025 14:24

Implement navigation_timeout for AbstractHttpCrawler and PlaywrightCr…

fb85108

…awler

Merge remote-tracking branch 'origin/master' into only-apply-timeout-…

eba3eff

…to-request-handler

Fix tests and bugs

3715db2

janbuchar commented Nov 28, 2025

View reviewed changes

src/crawlee/http_clients/_impit.py Show resolved Hide resolved

src/crawlee/crawlers/_playwright/_playwright_crawler.py Show resolved Hide resolved

janbuchar requested review from Pijukatel and vdusek November 28, 2025 11:05

Merge branch 'master' into only-apply-timeout-to-request-handler

ecd3c64

github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 28, 2025

Pijukatel reviewed Nov 28, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Show resolved Hide resolved

vdusek approved these changes Nov 28, 2025

View reviewed changes

src/crawlee/http_clients/_base.py Outdated Show resolved Hide resolved

src/crawlee/http_clients/_base.py Outdated Show resolved Hide resolved

Mantisus approved these changes Nov 29, 2025

View reviewed changes

janbuchar and others added 2 commits December 3, 2025 12:27

Wrap pre-navigation hooks with navigation timeout

6565eea

Apply suggestions from code review

f4b41f0

Co-authored-by: Vlada Dusek <[email protected]>

Pijukatel reviewed Dec 3, 2025

View reviewed changes

janbuchar added 2 commits December 3, 2025 14:33

Track shared timers manually

55967e8

Expand docblock

4b80365

janbuchar requested a review from Pijukatel December 3, 2025 14:37

janbuchar requested review from Mantisus and vdusek December 3, 2025 14:37

Mantisus approved these changes Dec 3, 2025

View reviewed changes

Pijukatel mentioned this pull request Dec 4, 2025

Reconsider how to apply timeouts in Crawler #1596

Open

Pijukatel approved these changes Dec 4, 2025

View reviewed changes

vdusek approved these changes Dec 4, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into only-apply-timeout-…

f61b1f8

…to-request-handler

janbuchar merged commit 0dfb6c2 into master Dec 4, 2025
23 checks passed

janbuchar deleted the only-apply-timeout-to-request-handler branch December 4, 2025 12:24

fix: Only apply requestHandlerTimeout to request handler #1474

fix: Only apply requestHandlerTimeout to request handler #1474

Uh oh!

Conversation

janbuchar commented Oct 10, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Oct 14, 2025

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pijukatel Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants