Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add commands to collect and retrieve response bodies #877

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

juliandescottes
Copy link
Contributor

@juliandescottes juliandescottes commented Feb 13, 2025

Overview of what this PR aims to add:

  • Concept of network body collector A network body collector is a concept similar to intercepts and events subscriptions. Clients can add/remove collectors. In theory this should be used for both requests and responses but is only applied to responses in this PR. A network body collector is a struct with contexts or userContexts, and urlPatterns. All are optional so you can potentially define a collector which will match everything (to be discussed)

New BiDi session items:

  • BiDi session has a network body collector map, similar to the intercept map. Simply stores the active body collectors
  • BiDi session has a network maximum body size, js-uint to define the maximum size of collected bodies.
  • BiDi session has a network response map, which contains all the collected bodies, keyed by request id. This map is stored at session level because different sessions might have different configurations about what kind of network bodies can be collected (eg max size).

New commands:

  • new command addBodyCollector to add a new network body collector
  • new command removeBodyCollector to remove an existing network body collector
  • new command setNetworkBodyCollectorConfiguration, which can be used to set session's network maximum body size. In the future we might have more configuration available here, this is why this is setting a generic configuration.
  • also getResponseBody, but is mostly identical to the one in Add a command to get response body #856 . It defaults to base64 at the moment, we probably want to make it easier to receive the body as string if possible? (but I prefered to leave this command as close to the existing PR as possible)

New error:

  • new error no such body collector, for removeBodyCollector

Updates to existing commands

  • When a response is caught in network.responseCompleted, we attempt to collect the body if it is related to a navigable
  • On navigation committed we remove the bodies of all responses linked to this navigable
  • On context destroyed we also remove the bodies of all responses linked to this navigable

Note that I haven't added extra limitations to which responses are collected in responseCompleted, but we can definitely add them (eg no worker requests etc...)


Preview | Diff

@juliandescottes
Copy link
Contributor Author

@OrKoN @jgraham I was not sure how I could.(or if I could?) update PR #856 , so I just created a new one here.
Please take a look at the summary before looking at the patch, you might already have comments on the overview before diving into the details :)

@OrKoN
Copy link
Contributor

OrKoN commented Feb 13, 2025

Thanks for the PR. I think we do not have clear requirements that any clients need the functionality provided by addBodyCollector so we could exclude it for now (unless someone needs it?). At least I would not add browsing contexts params in the same way as we have it in event subscriptions (when context id resolves to the top-level traversable). I think we need an ability to define the overall size limit instead of (in addition?) a per-request limit in setBodyCollectorConfiguration (instead of just not saving the freshest request we should probably evict earlier requests).

@OrKoN
Copy link
Contributor

OrKoN commented Feb 13, 2025

Note that I haven't added extra limitations to which responses are collected in responseCompleted

I am thinking if in my initial draft I should have started collection in responseStarted (I think that would actually be required for interception use cases?)

@@ -5264,6 +5275,9 @@ given |navigable| and |navigation status|:

1. [=Resume=] with "<code>navigation committed</code>", |navigation id|, and |navigation status|.

1. For each |session| in [=active BiDi sessions=], [=delete collected response bodies=]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by this point I believe the navigation request that loaded the document has already happened and we want to retain it. If we really want to follow the CDP model we should key the network data by document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the response already completed by that time? In any case, adding a reference to the document sounds fine to me I almost wanted to include it in the initial design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the headers are read and the body starts being read in parallel. Not having our network hooks in the fetch spec makes it a bit more difficult to cross-check but I think using document's navigation ID would be more resilient (especially if we might be moving the collection to various hooks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble with this part so far.

AFAICT Document's navigation id get set back to null once the load is done (step 9.7 of https://html.spec.whatwg.org/#the-end:concept-document-navigation-id). This means that we won't be able to store the navigation id for all requests.

Storing the document itself is not great either, because the document is normally only created after the response for the document's URL started arriving.

My current thinking is to store the navigable's ongoing navigation (which is reused as the Document's during-load navigation id), and on navigation committed clear all responses which either:

  • don't have any navigation id
  • have another navigation id than the navigables ongoing navigation

It might work, but it feels a bit flaky.

@juliandescottes
Copy link
Contributor Author

Thanks for taking a look!

Thanks for the PR. I think we do not have clear requirements that any clients need the functionality provided by addBodyCollector so we could exclude it for now (unless someone needs it?). At least I would not add browsing contexts params in the same way as we have it in event subscriptions (when context id resolves to the top-level traversable).

I'll wait for feedback from James here, in case that doesn't align with his feedback from PR #856 , but I thought that was one of the main required changes? Having a way to clearly declare whether you want to record responses or not. And if we do I think it makes sense to make it consistent with all our other similar APIs (events and intercepts) (note: intercepts don't have user context support yet, but they really should).

I think we need an ability to define the overall size limit instead of (in addition?) a per-request limit in setBodyCollectorConfiguration (instead of just not saving the freshest request we should probably evict earlier requests).

Yeah I'm happy to update the configuration bit with a total size + FIFO approach to evict requests, let's see if there are any other requested flags/limits.

I am thinking if in my initial draft I should have started collection in responseStarted (I think that would actually be required for interception use cases?)

Maybe we should create the entry as early as beforeRequestSent, and have a "state" in the collected response (pending, available, evicted ...)

@juliandescottes
Copy link
Contributor Author

One thing I wanted to mention re: contexts/userContexts in addBodyCollector.

On our side, considering our current implementation, it is important to have an API where clients can be selective upfront about which requests they are interested in. To record responses, Firefox duplicates them in another (parent) process. Means it's easier for us to control the availability of responses, but we probably use up more memory than Chrome does.

On the client size, if you are only interested in one class of requests coming from a specific tab, if you can't define the contexts userContexts to watch, then you have to fiddle with the "total size" configuration hoping that the requests you are interested in are not going to be evicted first?

Puppeteer and other clients can still just call it without any argument in the beginning? But considering this API is consistent with our subscription and intercept APIs, and seems beneficial for clients, I would still like us to consider it.

@jgraham
Copy link
Member

jgraham commented Feb 17, 2025

I agree with Julian. Given that this feature has potentially high overhead in terms of resource usage it seems important to be able to turn it on and off with more granularity than just "did anyone subscribe to any network events and so could have a response id to allow reading the body" (even that optimization is a little hard because in theory as a pure optimisation one would need to keep the data around until the responseCompleted event was emitted just in case someone started an event subscription after the response was started and before it was completed. Of course one could specify that that doesn't work, but it would certainly be surprising to users if the general model is "I have a response id therefore I can get the response body").

I also agree that if we're adding a maximum total cache size it's even more important that you can be specific about which response bodies you're interested in.

My assumption is that test automation clients currently don't offer this kind of control because they have been based around CDP's semantics. It seems reasonable to me to assume tests generally know when they will want to access the network bodies and so an API based on opt-in is reasonable. We also know that devtools, for example, do collect network response bodies one tab at a time, and that that kind of use case would be severely compromised if there was only global control over retention (e.g. if I'm trying to record an interactive browsing session in one tab for later replay it's extremely important to me that everything in that tab ends up being accessible, and I actively don't want anything that happens in other tabs to affect it).

@OrKoN
Copy link
Contributor

OrKoN commented Feb 17, 2025

CDP does offer max size control per target (local subtree of frames). My point was mostly about limiting by URL patterns or max response body size not being seemingly too useful: most clients want everything in a tab even if they know what URL patterns or individual response sizes they are dealing with. I think the current API very easily allows scenarios like "oh I set max size per response body to 10KB so my 10.1KB response was not recorded and I need to re-run everything or I have not realized that I needed bodies for these URLs". Most users would just set something like 99999MB per response body and match all for URLs.

As for URL pattern matching, matching by media type even sounds more useful as the first filter. Do we have any specific clients interested in the fine-grain controls beyond per-navigable total limits? If not, I would propose to simplify the proposal by adding context ids to network.SetBodyCollectorConfiguration making it per context or global and changing maximumBodySize to maximumTotalBodySize (I believe most clients would just be using that and we can reduce the amount of specification and implementation needed without blocking an extension with fine-grained filtering in the future). That would require partitioning per navigable in the cache (or even per document) but it looks like we would need it for cleanup as well (if the we agree on the current cleanup points).

@jgraham
Copy link
Member

jgraham commented Feb 17, 2025

I agree that URL patterns could be dropped in the first pass, as long as we keep contexts and user contexts.

My concern with just having a maximum total size, and no other filtering, is the case where you have a page with a few large assets that you're not interested in, but which might cause cache eviction of the small assets you are interested in. For example on a media viewing page where you might have some pictures or videos that are hundreds of megabytes, when your test is entirely concerned with checking some data in a few kB of HTML or JSON.

Without a URL (or MIME type) filter we can't easily avoid the overhead of copying that data to the parent process (at least up to the size limit), but we can avoid requiring people to set the maximum cache size to the size of all the resources on the page rather than a per-resource limit of (say) 100kB.

@juliandescottes
Copy link
Contributor Author

Thanks for the feedback!

Trying to summarize where we are:

  • all ok with adding a maximumTotalBodySize to the configuration
  • all ok with dropping URL patterns in the first iteration
  • needs agreement about keeping maximumBodySize
  • needs agreement about the API (drop add/removeBodyCollector in favor of just having setBodyCollectorConfiguration)

My comments on this:

1/ For the URL patterns, I agree we can drop, but from our discussion it sounds like we want some way to exclude requests instead. Would excludedURLPatterns be more useful? Or do we want to design something to exclude requests based on specific fields of the network event eg mimeType, bodySize etc. In any case it sounds like we can keep it for a next iteration.

2/ For maximumBodySize:

I imagined this should be used to set a reasonably high (few MBs) limit to individual requests to avoid having the whole storage for response bodies taken up by just a few random requests (as mentioned by @jgraham ). In Firefox DevTools we have a cap for individual responses to avoid storing unreasonably large responses (1MB by default, can be changed with a pref). I think it's worth having an explicit limit, but maybe it should have a default value. And maybe that should rather be a capability. I would like to keep a clearly defined limit and allow clients to override it if needed. On Firefox side I don't think we can handle duplicating huge responses in the parent process for BiDi, we will have to implement a cap anyway.

3/ API: only add setBodyCollectorConfiguration (or another name :) )

I imagine the behaviour would be close to setCacheBehavior. When setting for global, it overrides all previously defined context/user context configurations. When setting for a context/user context it will potentially preserve the previous body collector configurations set for other contexts/user contexts. This brings some questions:

  • If you can set a maximum total size / maximum size at the same time, then this setting only applies to the context/user contexts provided in the command? Imagine you first set configuration for context "12" and set another configuration for userContext "foo" which contains context "12". Should the user context configuration should the configuration for context "12"?
  • How can a client stop collecting network bodies? If I set a configuration for context 12, which command can we use to stop it?

While it does simplify the API, it feels like a step back closer to what we previously had for subscriptions. A model where we create unique collectors that can each be removed on their own feels less surprising?

@OrKoN Let me know what you think, maybe you have suggestions on how a single setBodyCollectorConfiguration could fit those scenarios?

@OrKoN
Copy link
Contributor

OrKoN commented Feb 18, 2025

Thanks for summarizing. I am still not sure if we have a client with a use case right now for limiting response storage based on specific attributes of the request/response. I see that Playwright's model for Firefox is also based on the total size with eviction (I could not tell if it is per navigable or global?). Therefore, I think it would be a reasonable model to say that as a client you get last X bytes of response data stored per navigable that you enabled the collection for? Eventually if there are users requesting fine-grained control on the per-request basis, it could be added on top of that model.

As for how the configuration command should work I would say, unlike event subscriptions, we could make it so that the last command always win.

setBodyCollectorConfiguration(maxCacheSize) # sets maxCacheSizePerNavigable for all navigables in all user contexts, new navigables inherit from the session
setBodyCollectorConfiguration(maxCacheSize, userContexts) # sets maxCacheSizePerNavigable for all navigables in specified userContexts, new navigables inherit from the specified user context if they are created in it and from the session otherwise.
setBodyCollectorConfiguration(maxCacheSize, browsingContexts) # sets maxCacheSizePerNavigable for specified browsingContexts only

basically, at any time the session, each user context, each browsing context have a maxCacheSizePerNavigable value that is either a result of a configuration call or inherited from the "parent" object if a navigable/user context is newly created. So to stop collecting any responses the client could send setBodyCollectorConfiguration(maxCacheSize=0). I do not currently see that we would need the same mechanism for undoing configuration calls call-by-call as we have for event subscriptions so indeed it would be similar to setCacheBehavior.

So it sounds to be that maxCacheSize for the entire session would not be that useful but maxCacheSizePerNavigable as described would be fine without fine-grained per request controls?

On Firefox side I don't think we can handle duplicating huge responses in the parent process for BiDi, we will have to implement a cap anyway.

I wonder if you would still need to duplicate the responses if the removal happens at the points proposed in this PR (responses do not outlive the navigable)?

@juliandescottes
Copy link
Contributor Author

juliandescottes commented Feb 18, 2025

Right, we can set the total size to 0. It feels a bit like a workaround? Maybe having a clean command to really stop collecting bodies wouldn't hurt?

About making the max size a per navigable configuration. It makes it easier to work with multiple calls to setConfiguration.

setConfiguration(maxCacheSize=1000)
setConfiguration(userContexts=["foo"], maxCacheSize=5000)
setConfiguration(contexts=["12"], maxCacheSize=10000)

In that case, by default navigables have 1000 allowed cache size, the ones in "foo" have 5000 and navigable 12 has 10000.

It does mean that there's no effective max cache size anymore though. User may create new navigables and fill the browser memory. I imagine that's not a concern in practice, but it's important to note that we can't keep this under control with this approach. At least in this model we don't have to wonder if a request takes up space in the cache configured for its navigable or globally, the cache size is always allocated per navigable, and that seems nice.

I wonder if you would still need to duplicate the responses if the removal happens at the points proposed in this PR (responses do not outlive the navigable)?

Not really, it would require too many changes to our network events monitoring, which is almost entirely handled in the parent process for devtools/bidi. Also we should keep the door open to relax those limitations in the future, it would be great if responses could only be evicted when the top level traversable navigates / or is destroyed.

Which means I would still like to keep this configurable. Worst case scenario this could be driven by a preference + NOTE that implementations might truncate long response bodies, but I would really prefer having something consistent across browsers here.

@juliandescottes
Copy link
Contributor Author

Sidenote: I notice that CDP supports maxTotalBufferSize/maxResourceBufferSize, so unless I'm mistaken you already should have support for a per resource limit on CDP side?

@OrKoN
Copy link
Contributor

OrKoN commented Feb 20, 2025

Sidenote: I notice that CDP supports maxTotalBufferSize/maxResourceBufferSize, so unless I'm mistaken you already should have support for a per resource limit on CDP side?

indeed, in Puppeteer we have not used it so far though. In issues where people want increased limits they usually set it as high as the total size available so I am not sure how useful it is to guess how large individual responses could be.

@juliandescottes
Copy link
Contributor Author

Before reviewing the PR in details - I'm sure there are still syntax mistakes not worth fixing for now - let's summarize the current state, and get feedback on the overall approach.

Session changes:

  • session has navigable network collector configurations (map), user context network collector configurations (map) and global network configuration which store the various configurations clients can set for collecting bodies.
  • said configurations contain two numbers: max total size and max resource size
  • session has a list of collected responses, which contain (navigable id, navigation, request id and response). It's a list because ordering matters for eviction.

New command:

  • setBodyCollectorConfiguration(userContexts, contexts, maxTotalBodySize, maxResourceBodySize). Similar to setCacheBehavior in the sense that you need be careful with the order in which you call the API. Calling it globally erases individual configurations set for userContext/contexts, calling it for userContexts erases the configuration for contexts etc... There's no explicit way to completely stop collecting bodies, you need to set sizes to 0.

Updates to existing events:

  • When a response is caught in network.responseCompleted, we attempt to collect the body if it is related to a navigable:
    • If a collector configuration is set, then a collected response struct will be added to the collected responses list.
    • But it will only preserve the actual response if it matches the limits maxTotalBodySize/maxTotalBodySize.
    • Then we calculate the remaining size available for the navigable based on already collected responses for this navigable, and evict the first one until there is enough room available. (the algorithm is really not efficient, but I was trying not to go into too many details at the spec level, implementations can and should handle this differently).
    • The collected response will contain the navigable id as well as the navigable's ongoing navigation if available
  • On navigation committed we remove the bodies of all responses linked to this navigable, unless it has the same navigation id as the one provided to navigation committed.
  • On context destroyed we also remove the bodies of all responses linked to this navigable

@juliandescottes juliandescottes force-pushed the pr-856 branch 2 times, most recently from 403f6cb to a677a1c Compare February 21, 2025 09:03
…r navigable, check navigation id to evict responses
@juliandescottes
Copy link
Contributor Author

@OrKoN In the last update I tried to simplify the API to only keep one method as suggested. While this works, I'm not sure this is really a good decision at the spec level.

It's functionally very close to what we had with the previous proposal, but is less flexible and more sensitive to the order in which commands are called. With the previous approach we have something that can naturally evolve to support url patterns and more fine grained configurations.

If libraries such as puppeteer only prefer to expose it as a simplified API, it should still be possible. Could we reconsider ?

@OrKoN
Copy link
Contributor

OrKoN commented Feb 21, 2025

@OrKoN In the last update I tried to simplify the API to only keep one method as suggested. While this works, I'm not sure this is really a good decision at the spec level.

It's functionally very close to what we had with the previous proposal, but is less flexible and more sensitive to the order in which commands are called. With the previous approach we have something that can naturally evolve to support url patterns and more fine grained configurations.

If libraries such as puppeteer only prefer to expose it as a simplified API, it should still be possible. Could we reconsider ?

I think in the previous proposal there was also a configuration method for limits and an additional per URL configuration methods. Could you please clarify how the current proposal would limit the addition of the per URL configuration methods?

@jgraham
Copy link
Member

jgraham commented Feb 21, 2025

I agree with @juliandescottes here; I feel like in this proposal the obvious things that a user might want to do (enable/disable collecting response bodies for some tab or user context) are exposed as side effects of configuring low-level details (cache sizes).

I do think we need that level of configuration, but I'd prefer an API where the methods correspond to user intent, and where we can have reasonable defaults for the various tuning parameters.

@juliandescottes
Copy link
Contributor Author

@OrKoN In the last update I tried to simplify the API to only keep one method as suggested. While this works, I'm not sure this is really a good decision at the spec level.
It's functionally very close to what we had with the previous proposal, but is less flexible and more sensitive to the order in which commands are called. With the previous approach we have something that can naturally evolve to support url patterns and more fine grained configurations.
If libraries such as puppeteer only prefer to expose it as a simplified API, it should still be possible. Could we reconsider ?

I think in the previous proposal there was also a configuration method for limits and an additional per URL configuration methods.

In the previous approach you had one method to set a global configuration (only resource max size, but can easily add total max size as well). Then add/removeBodyCollector was used to select in which contexts/userContexts user wanted to collect bodies, with an optional urlPattern (which can still be dropped in a first iteration).

Could you please clarify how the current proposal would limit the addition of the per URL configuration methods?

I find the current proposal harder to understand as is. You need to be aware that the order in which you call the command is important, and you might erase configurations unexpectedly. But it still remains relatively easy to predict how it's going to work without reading the spec.

Now if we add url patterns, there are a few things to answer. If context 12 is listening www.a.com, and I want to also listen to www.b.com, how can I do it? When we set the configuration again for this context, does it add to the existing pattern? Does it override it?

Then if we imagine we catch all requests globally with cache size of 1000. And for context 12, we only cache requests to JS, but with a cache size of 2000. If there's a request in context 12 which is not JS and doesn't match, then does it still get captured because we capture all requests globally? If then which cache size should be used?

We can answer to all those questions in the spec, but I'm still concerned it will make the behavior unexpected, whereas the API where you add and remove collectors is very simple to understand.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 21, 2025

I can see a concern but I am not sure it's worse than the behavior of network.setCacheBehavior. I'd say the latest version aligns more with network.setCacheBehavior.

In the previous approach you had one method to set a global configuration (only resource max size, but can easily add total max size as well). Then add/removeBodyCollector was used to select in which contexts/userContexts user wanted to collect bodies, with an optional urlPattern (which can still be dropped in a first iteration).

should resource max size and max total size per navigable be part of the add/removeBodyCollector collector methods?

@OrKoN
Copy link
Contributor

OrKoN commented Feb 21, 2025

should resource max size and max total size per navigable be part of the add/removeBodyCollector collector methods?

if these settings are not part of the add/removeBodyCollector methods, then changing this limits via the global configuration is similar to this proposal in the sense that it would remove/add things from the cache otherwise handled by the body collector.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 21, 2025

I agree with @juliandescottes here; I feel like in this proposal the obvious things that a user might want to do (enable/disable collecting response bodies for some tab or user context) are exposed as side effects of configuring low-level details (cache sizes).

would changing the current version's configuration to accept a cacheBehavior: "store" / "do-not-store" and limits being made optional address this concern?

@jgraham
Copy link
Member

jgraham commented Feb 21, 2025

There is synchronization, but I think you could synchronize the computed per-top-level-traversable state rather than lists?

e.g.

collector1 = network.addBodyCollector(context=["example"], maxBodySize=1024, cacheSize=102400)
// For context "example" need to synchronize the data {maxBodySize: 1024, cacheSize: 102400}
collector2 = network.addBodyCollector(context=["example", "example2"], maxBodySize=10240)
// For context "example" need to synchronize the data {maxBodySize: 10240, cacheSize: 102400}
network.removeBodyCollector(collector=collector2)
// For context "example" need to synchronize the data {maxBodySize: 1024, cacheSize: 102400}

@jgraham
Copy link
Member

jgraham commented Feb 21, 2025

IOW the parent process owns the lists, the content/network processes only see the resolved values.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 21, 2025

I think addBodyCollector with limits being computed at the update time would work (and it does not change the deletion and collection logic much). A global configuration method would not be needed then? I am not sure if it makes it more difficult to add URL patterns later since different patterns could have (partially) conflicting limits but perhaps the URL pattern configuration should not be allowed to change any limits.

In practice I'm not sure why the number of collectors would grow large. I can see N=2 that apply to the same request being common (a fixture configures something for a specific user context, and then a test configures something for a specific browsing context), but the chance of something much higher seems small.

I agree that N > 2 would be uncommon in practice, that's why I think the current proposal would also work for the examples like testharness vs test code.

# test harness calls this before test
network.CollectResponseBodies(userContext=[A], maxBodySize=1024, cacheSize=102400);
# test calls this
network.CollectResponseBodies(context=["example", "example2"], maxBodySize=10240)
# test harness calls this after test (test does not need to do clean up)
network.CollectResponseBodies(userContext=[A], maxBodySize=1024, cacheSize=102400);

@juliandescottes
Copy link
Contributor Author

juliandescottes commented Feb 24, 2025

@OrKoN not sure I understand your last example, did you mean to write some different code for # test harness calls this after test (test does not need to do clean up), otherwise it's the same as the first command being sent.

Also are you fine with a add/remove pattern?

In the absence of a global configuration command another thing I would like to clarify before updating the PR is the behavior of the maxBodySize / totalBodySize parameters provided in the addBodyCollector / collectResponseBodies commands.

1/ Is totalBodySize scoped by navigable (as in the total size of responses which can be stored for a given navigable) or is it global. I guess for Chrome / CDP at the moment, it's technically easier if it's scoped by navigable? On Firefox side we could handle both. Scoped by navigable means that you can't set an effective max size, but I guess users can workaround this and avoid creating too many navigables which would store responses?

2/ Do we resolve the current value of maxBodySize / totalCacheSize by getting the max of all collectors which match the request (resolved at body collection time then) ?

Slightly worried about weird edge cases once we support URL patterns with this approach. If totalBodySize is understood as being per navigable and we have:

  • one collector for all globals, totalBodySize: 1000kB
  • one collector for *.html in context A, totalBodySize: 100000kB
    Then if a request is done in context A for a non-html resource. It should be collected because it matches the first collector, but which size should be considered as the totalBodySize then?

That's probably what you meant by

I am not sure if it makes it more difficult to add URL patterns later since different patterns could have (partially) conflicting limits but perhaps the URL pattern configuration should not be allowed to change any limits.

And yes I agree it might be problematic. I feel like a global configuration command would be much more straightforward in this case?

@jgraham
Copy link
Member

jgraham commented Feb 24, 2025

Is totalBodySize scoped by navigable

I think the overall cache size should be scoped by top-level traversable. That seems slightly easier to reason about than actually per navigable, but without the problems associated with just having a global limit. But if that's hard to implement then per actual navigable is probably fine.

Slightly worried about weird edge cases once we support URL patterns with this approach. If totalBodySize is understood as being per navigable and we have

I think you end up with weird edge cases for any model other than "one cache per collector, maximum possible size is the sum of all sizes, but the actual size may be smaller even if the caches are full due to deduplication". Implementing that model feels like overkill, and does lead to the higher synchronisation overhead that @OrKoN was concerned about.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 24, 2025

not sure I understand your last example, did you mean to write some different code for # test harness calls this after test (test does not need to do clean up), otherwise it's the same as the first command being sent.

this is an example for the use case brought up by James before about how the add/remove pattern makes it easier for test harness to clean up the state after test and the example I provide shows how it can be done with the current proposal (the last command undoes test-specific configuration).

Also are you fine with a add/remove pattern?

I am fine with the add/remove pattern (although I do not think it is necessarily needed in this use case) as long as effective configuration can be resolved at the add/remove time.

And yes I agree it might be problematic. I feel like a global configuration command would be much more straightforward in this case?

I think it would be good if these were two different commands: one is for the static configuration limits like the current proposal and another command for filtering requests/responses (that is more like request interception). maxBodySize is in the grey area as it an be seen as a limit but can be seen as a pre-request filter.

In the short- to mid-term, Chromium could only implement totalBodySize per actual navigable but I agree that per-top-level-traversable would make more sense.

@juliandescottes
Copy link
Contributor Author

I think you end up with weird edge cases for any model other than "one cache per collector, maximum possible size is the sum of all sizes, but the actual size may be smaller even if the caches are full due to deduplication".

If we only allow a global configuration for the sizes, do you think that's still the case? If all navigables (or traversables) share the same limits in terms of resource size and global size, there shouldn't be any surprise to which value will be used.

Again I think it's only an issue if we have an API where you can set both URL patterns AND size limits at the same time.

this is an example for the use case brought up by James before about how the add/remove pattern makes it easier for test harness to clean up the state after test and the example I provide shows how it can be done with the current proposal (the last command undoes test-specific configuration).

It's the same command as the first one, how does it removes any collection/configuration? Might be missing something 🤔

Also are you fine with a add/remove pattern?

I am fine with the add/remove pattern (although I do not think it is necessarily needed in this use case) as long as effective configuration can be resolved at the add/remove time.

As long as the configuration is not mixed with URL patterns, implementations should be able to resolve everything at add/remove time. Not sure we need to call it out in the spec, but we probably can?

And yes I agree it might be problematic. I feel like a global configuration command would be much more straightforward in this case?

I think it would be good if these were two different commands: one is for the static configuration limits like the current proposal and another command for filtering requests/responses (that is more like request interception). maxBodySize is in the grey area as it an be seen as a limit but can be seen as a pre-request filter.

I agree, it would be best to have different commands for setting sizes and for setting URL patterns, otherwise the behavior will be confusing for clients.

We have 3 "features" we want to expose: enabling/disabling collecting bodies, setting size limits for bodies and (later) filtering accepted bodies by URL pattern (and maybe more). My initial idea was to bundle the enabling and the filtering, so that users can express "capture responses which match X in context Y". Size configuration felt more like a global setting that could be delegated to a separated command.

@OrKoN if I understand correctly, you would rather bundle the enabling and the configuration "capture responses in context Y with configuration Z". If we add the filtering later on via another command, should this command also accept contexts/userContexts or be global? Even if we make the "filtering" command support contexts/userContexts, I think it should be clear that this command alone will not enable collecting bodies.

IMO it makes more sense to specify the requests you want to include when you create the collector, and I see less added value to setting size configurations per context/user context. But I guess that's personal preference, and if I'm the only one with this opinion I'm fine with moving forward with something else.

So would an API like:

  • addBodyCollector(userContexts?, contexts?, resourceSize, totalSize)
  • removeBodyCollector(collectorId)

work for you both @OrKoN @jgraham ?

In the short- to mid-term, Chromium could only implement totalBodySize per actual navigable but I agree that per-top-level-traversable would make more sense.

Sounds OK to me for a first iteration of the feature.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 24, 2025

It's the same command as the first one, how does it removes any collection/configuration? Might be missing something 🤔

it resets the limits for the entire scope it manages (e.g., user context or global) and discards anything done by the test code.

@juliandescottes
Copy link
Contributor Author

It's the same command as the first one, how does it removes any collection/configuration? Might be missing something 🤔

it resets the limits for the entire scope it manages (e.g., user context or global) and discards anything done by the test code.

Ah I see, the contexts from the 2nd command are assumed to be in userContext A.

@OrKoN
Copy link
Contributor

OrKoN commented Feb 24, 2025

if I understand correctly, you would rather bundle the enabling and the configuration "capture responses in context Y with configuration Z". If we add the filtering later on via another command, should this command also accept contexts/userContexts or be global? Even if we make the "filtering" command support contexts/userContexts, I think it should be clear that this command alone will not enable collecting bodies.

yeah, I think another command should accept contexts/userContexts as well.

work for you both @OrKoN @jgraham ?

addBodyCollector(userContexts?, contexts?, resourceSize, totalSizePerNavigable*) => collectorId
removeBodyCollector(collectorId)
// * either per navigable or per top-level traversable

this would SGTM with extending it later to

addBodyFilter(userContexts?, contexts?, filterCriteria) => filterId
removeBodyFilter(filterId)

I think we could also discuss naming and scope of the comments, eventually, it would apply at least to request bodies too I asusme? Would we want to make the addBodyCollector more general to mean limits for any future WebDriver-specific caches?

@jgraham
Copy link
Member

jgraham commented Feb 24, 2025

If we only allow a global configuration for the sizes, do you think that's still the case? If all navigables (or traversables) share the same limits in terms of resource size and global size, there shouldn't be any surprise to which value will be used.

It can be surprising if e.g. one test fixture sets the limit to A and another reduces it to B. With the proposal it's arguably less surprising that in that case we always pick the higher number.

@juliandescottes
Copy link
Contributor Author

I think we could also discuss naming and scope of the comments, eventually, it would apply at least to request bodies too I asusme? Would we want to make the addBodyCollector more general to mean limits for any future WebDriver-specific caches?

Yes, I had in mind to support both request and response bodies, even if at the moment we only store and allow to retrieve responses. That being said, in the current version of the PR I only had a collected responses list stored in the session. If we have a single shared size for request and response bodies, I should rather have something generic.

Regarding additional caches, I don't know. What else do we plan to cache? And does it make sense to drive it from the network module?

@OrKoN
Copy link
Contributor

OrKoN commented Feb 24, 2025

Regarding additional caches, I don't know. What else do we plan to cache? And does it make sense to drive it from the network module?

I think there could be other network data not necessarily related to bodies, perhaps, websocket messages or event streams. I think it makes sense to scope to network data but maybe it could apply to all of the network data that WebDriver might need.

@juliandescottes
Copy link
Contributor Author

Seems like we are close to consensus here, so I'll try to update the PR again today.

@juliandescottes
Copy link
Contributor Author

(sorry about the delay, couldn't finish it before going on PTO, back at it today)

@juliandescottes
Copy link
Contributor Author

Updated the PR. I'll write an overview of the changes later, but feel free to have a look in the meantime. cc @jgraham @OrKoN

@juliandescottes juliandescottes force-pushed the pr-856 branch 5 times, most recently from d087fdd to 04d4fc4 Compare March 4, 2025 19:55
@juliandescottes
Copy link
Contributor Author

Quick overview of the changes:
New types:

  • collector is a collector id
  • collectorSizes describes the sizes used for a given collector
  • collectorInfo describes a collector (collectorSizes + contexts/userContexts)
  • collectorData describes a collected network data

New session objects:

  • map of network collectors, from collector id to collectorData
  • map of navigable network collector sizes, which is the effective collector size configuration corresponding to a top level navigable
  • list of collected data. It could be convenient to have one list per navigable but I ended up with a single list because when retrieving a network data, we do not provide the navigable owning the network data in the command. And I didn't want to have too many intermediary data structures to make this look more optimized. But of course implementations can/should deviate from this model to be efficient.

New commands:

  • add/removeDataCollector: will add a new collector in session's network collectors and will update the effective navigable network collector sizes for impacted top level traversables.
  • getResponseBody: retrieve the corresponding data from the session's collected network data

I notice I forgot to update the browsingContext.contextCreated steps, which should also compute the effective navigable collector sizes for the new navigable if it's a top level one.

@OrKoN
Copy link
Contributor

OrKoN commented Mar 6, 2025

I have not reviewed it in details yet but we discussed it internally and we thought that we should be evicting old resources whenever collector configuration changes (IIUC it happens only when new requests arrive in the current PR).

@juliandescottes
Copy link
Contributor Author

(IIUC it happens only when new requests arrive in the current PR).

Yes that's correct. On my side, sounds reasonable to evict on updates as well.

we should be evicting old resources whenever collector configuration changes

To be clear, you mean we should make sure the stored data matches the new sizes defined. Not just evict every old resource right?

@OrKoN
Copy link
Contributor

OrKoN commented Mar 6, 2025

To be clear, you mean we should make sure the stored data matches the new sizes defined. Not just evict every old resource right?

yes, that's right

Copy link
Member

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the most fundamental thing here is that because the body is modeled as a stream, we need to grab a copy of the body before anything else has had a chance to read it. Otherwise I think we don't actually store anything (or we prevent the body being available to other processes). Also the current way of computing the body size appears to be broken.

I somewhat wonder how much of a simplification (if at all) it would be to make the size computations lazy in the spec, rather than storing them all upfront, but I know this is closer to what one actually needs to implement, so it's also fine to keep it like this.


1. If the <code>contexts</code> field of |collector info| is present:

1. If |collector info|["contexts"] [=map/contains=] |navigable|'s [=navigable id=], return true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should transform collector info into a struct on creation, rather than directly storing the map we get from the network. In that case we should check for null / not null rather than present / not present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not sure I follow 😅 Can you give me a bit more details about the suggestion?


1. Let |navigable| be null.

1. If |request|'s [=request/window=] is an [=environment settings object=]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general thought: this design isn't going to naturally extend to requests that don't have an associated window e.g. from shared or service workers. I don't think that's a blocker, but it's something we should at least think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional here since we didn't want to include worker requests in the first iteration, but I agree it would be nice to have some idea about those.

In the absence of a related window, we could link them to their realm ID, and allow to configure collectors for non-window realms? The data from realm collectors would probably have to be evicted whenever the corresponding realm is destroyed (which might be annoying in some cases?)


1. Let |response to collect| be null.

1. Let |response size| be |response|'s [=response/body=]'s [=body/length=].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this Note, thanks


1. Let |sizes| be |session|'s [=navigable network collector sizes=][|top-level navigable|].

1. If |response|’s [=response/body=] is not null and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally these things are linked to the struct item.


1. Let |processBodyError| be this step: Do nothing.

1. [=Fully read=] |response|’s [=response/body=] given |processBody| and |processBodyError|.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to clone the body first (in which case it probably has to happen early in the lifecycle, before anything else has read the body)? Otherwise I'm concerned that if some other process has already ready the response body (which seems likely to be the common case) then at this point we end up not reading any data (or, worse, we stop the data going to platform APIs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh do you mean that using fully read can only be done once for a given body? Then yes we probably have to clone it, I'll need to re-read the spec.

params: network.AddDataCollectorParameters
)

network.AddDataCollectorParameters = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to generalise this to requests? Should we have a "phase" field that's like phase: [+("request". "response")]? but for now only allow "response"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request/response sounds like a "filter". Another filter being "url patterns".

In previous conversations we mentioned the idea of setting up filters separately from adding collectors. So if we follow that plan, restricting to a specific "phase" (or data type, whatever name we pick) could be handled in another command.

But I'm no longer convinced that we should really separate in different commands. We need to clearly describe how the system will behave in scenarios such as:

  • collect requests and responses globally with size configuration C1
  • collect only requests in navigable N with size configuration C2
  • what to do with a response from navigable N? Should it be captured or not? If yes should it use size configuration C1 or C2

We can try to specify it now with a phase or type parameter. But we could also wait until we start collecting requests?


1. If |input context ids| is not empty:

1. Let |navigables| be [=get navigables by ids=] with |input context ids|.
Copy link
Member

@jgraham jgraham Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Let |navigables| be [=get navigables by ids=] with |input context ids|.
1. Let |navigables| be the result of [=trying=] to [=get valid navigables by ids=] with |input context ids|.


1. [=set/Append=] |navigable| to |top-level traversables to update|.

1. Otherwise, if |input user context ids| is not empty:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Otherwise, if |input user context ids| is not empty:
1. Otherwise, if |input user context ids| is not [=set/empty=]:

<dt>Command Type</dt>
<dd>
<pre class="cddl remote-cddl">
network.AddDataCollector = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question I think we should answer now is how we expect this to work with streams? Do we need a type field that is set to e.g. blob so that we could later have type: stream? Or is streaming the response specific to network request intercepts? I think possibly it isn't, because one could get a handle to the stream when the response starts, and read from that handle even without intercepting later phases of the request. In that case maybe we do want the type field here. But maybe others disagree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I'm really not sure how streaming should work.

I can see two use cases for streaming:

  • stream the response as it arrives so that you can potentially rewrite it dynamically
  • stream an already received response because it's too big to retrieve as one chunk

The first use case sounds very tied to network interception and could be restricted to it (eg you have to retrieve your stream handle after intercepting in responseStarted).

For the second use case, the request might already be completed, so I imagine we would have to setup the stream on the BiDi side. And in practice for the second use case I was rather imagining this as a parameter for getResponseBody (or an alternate command).

But that's probably not a very cohesive API, not sure how to handle that.

Copy link
Contributor Author

@juliandescottes juliandescottes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I'll adjust the PR!

params: network.AddDataCollectorParameters
)

network.AddDataCollectorParameters = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request/response sounds like a "filter". Another filter being "url patterns".

In previous conversations we mentioned the idea of setting up filters separately from adding collectors. So if we follow that plan, restricting to a specific "phase" (or data type, whatever name we pick) could be handled in another command.

But I'm no longer convinced that we should really separate in different commands. We need to clearly describe how the system will behave in scenarios such as:

  • collect requests and responses globally with size configuration C1
  • collect only requests in navigable N with size configuration C2
  • what to do with a response from navigable N? Should it be captured or not? If yes should it use size configuration C1 or C2

We can try to specify it now with a phase or type parameter. But we could also wait until we start collecting requests?

<dt>Command Type</dt>
<dd>
<pre class="cddl remote-cddl">
network.AddDataCollector = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I'm really not sure how streaming should work.

I can see two use cases for streaming:

  • stream the response as it arrives so that you can potentially rewrite it dynamically
  • stream an already received response because it's too big to retrieve as one chunk

The first use case sounds very tied to network interception and could be restricted to it (eg you have to retrieve your stream handle after intercepting in responseStarted).

For the second use case, the request might already be completed, so I imagine we would have to setup the stream on the BiDi side. And in practice for the second use case I was rather imagining this as a parameter for getResponseBody (or an alternate command).

But that's probably not a very cohesive API, not sure how to handle that.


1. Let |processBodyError| be this step: Do nothing.

1. [=Fully read=] |response|’s [=response/body=] given |processBody| and |processBodyError|.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh do you mean that using fully read can only be done once for a given body? Then yes we probably have to clone it, I'll need to re-read the spec.


1. Let |response to collect| be null.

1. Let |response size| be |response|'s [=response/body=]'s [=body/length=].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this Note, thanks


1. Let |navigable| be null.

1. If |request|'s [=request/window=] is an [=environment settings object=]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional here since we didn't want to include worker requests in the first iteration, but I agree it would be nice to have some idea about those.

In the absence of a related window, we could link them to their realm ID, and allow to configure collectors for non-window realms? The data from realm collectors would probably have to be evicted whenever the corresponding realm is destroyed (which might be annoying in some cases?)


1. If the <code>contexts</code> field of |collector info| is present:

1. If |collector info|["contexts"] [=map/contains=] |navigable|'s [=navigable id=], return true.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not sure I follow 😅 Can you give me a bit more details about the suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants