Skip to content

chore: bundle v4 catch-up — config redesign + EventManager + StorageClient + ProxyConfiguration#597

Open
B4nan wants to merge 37 commits intov4from
chore/v4-catchup-bundle
Open

chore: bundle v4 catch-up — config redesign + EventManager + StorageClient + ProxyConfiguration#597
B4nan wants to merge 37 commits intov4from
chore/v4-catchup-bundle

Conversation

@B4nan
Copy link
Copy Markdown
Member

@B4nan B4nan commented Apr 30, 2026

What this is

A non-mergeable demo branch that bundles all four v4 catch-up PRs so reviewers can see green CI on the combined state.

Bundled PR Concern
#583 Configuration redesign integration
#594 PlatformEventManager constructor adapt
#595 StorageClient adapter + KeyValueStore.getPublicUrl async URL signing
#596 ProxyConfiguration v4 API

Pinned at crawlee@^4.0.0-beta.51.

Recommended merge order: #583#594#595#596

The four focused PRs are rebased into a linear stack:

v4
 └─ #583 (config-redesign)
     └─ #594 (event-manager)
         └─ #595 (storage)
             └─ #596 (proxy)

Each downstream branch already contains its predecessors as ancestors, so the merge order matters mostly for review/CI clarity:

  • Stack order (recommended): each PR's GitHub diff shows only its own commits, CI runs are scoped, reviews stay narrow.
  • Any other order also merges cleanly (the rebased stack guarantees zero conflicts), but downstream PRs merged first will pull their predecessors along — e.g. merging fix: adapt SDK ProxyConfiguration to crawlee v4 API #596 first would land all four sets of commits in v4 at once, and the still-open upstream PRs would become trivial no-op merges with inflated-looking diffs.

Verified: a sequential merge of #583#594#595#596 into origin/v4 produces zero conflicts at each step. Locally the resulting state passes 75/75 active tests on Node 22 and Node 24. The tree is functionally equivalent to this bundle (sole diff is a stale ~69-line node_modules/@crawlee/linkedom/node_modules/cheerio block left in this bundle's lockfile from the cheerio-workaround era — pure regen artifact, not real divergence).

Do not merge this PR

Merge the four focused PRs above instead. This branch will be deleted once they land.

B4nan and others added 12 commits March 13, 2026 18:36
Refactor the SDK Configuration class to match the new crawlee core
Configuration redesign:

- Subclass core Configuration using `protected static override fields`
- Direct property access (`config.token`) instead of `config.get('token')`
- Immutable: values set via constructor, no `set()` method
- Priority: constructor options > env vars > schema defaults
- isAtHome conditional defaults moved into field definitions
- Use serviceLocator instead of config.useStorageClient/getEventManager
- Import z, coerceNumber, coerceBoolean from @crawlee/core (no direct zod dep)
- Update all .get()/.set() call sites in actor.ts, charging.ts, etc.
- Update tests to use property access

Depends on crawlee PR: apify/crawlee#3474

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Import `z` from `zod` directly (no longer re-exported from crawlee core)
- Define `coerceNumber` locally (no longer exported from crawlee core)
- Add constructor override to accept `ApifyConfigurationInput`
- Import `ConfigurationOptions` from SDK configuration instead of core
- Fix test that mutated env vars after init (immutable config)

Depends on: apify/crawlee#3080

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore the destructuring of `storageDir` and spread of remaining
`storageClientOptions` into the `ApifyClient` constructor so that
arbitrary client options configured via `storageClientOptions` continue
to reach the client.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gration

- Reuse `coerceNumber` from `@crawlee/core` instead of defining a local
  copy; otherwise `FieldsOutput<typeof apifyConfigFields>` produces a
  structurally distinct (but equivalent) `availableMemoryRatio` type
  that breaks declaration-merging with crawlee's `Configuration`.
- Drop the dead `storageClientOptions`/`storageDir` destructuring in
  `Actor.newClient()` — neither key exists in the redesigned
  Configuration; `options` already covers the override path.

The remaining build errors (proxy/storage/event drift) are unrelated
to the config redesign and tracked in separate follow-up PRs against
the v4 branch.
Crawlee v4's `EventManager` constructor now requires
`EventManagerOptions` (just `persistStateIntervalMillis`), and the
base class no longer carries a `config` field — the previous
`override readonly config` pattern is no longer valid.

- Drop the `override` and store `config` as own readonly property.
- Forward `persistStateIntervalMillis` to `super()`.
- Add a `fromConfig()` factory mirroring `LocalEventManager.fromConfig()`
  so the SDK plays nicely with the new ServiceLocator-driven init path.

Stacked on #583 (config redesign); rebases onto v4 once that lands.
Crawlee v4 reshaped its `StorageClient` interface (async factory
methods that accept `id` *or* `name`), removed the cached
`storageObject` from `KeyValueStore`, and made `getPublicUrl` async.
The existing SDK code targeted the v3 shape and no longer compiles.

Changes:
- New `ApifyStorageClient` adapter wraps `apify-client`'s legacy
  `dataset()/keyValueStore()/requestQueue()` accessors and exposes
  the `createDatasetClient/createKeyValueStoreClient/createRequestQueueClient`
  factories crawlee now expects. Names are resolved to IDs via the
  collection `getOrCreate(name)` calls. apify-client's resource
  clients don't yet implement v4-only members like `getMetadata` /
  `getRecordPublicUrl`; the adapter casts through with a TODO
  comment so the structural alignment can land separately upstream.
- `Actor.init` and `_openStorage` now wrap `this.apifyClient` in
  `ApifyStorageClient` before handing it to crawlee.
- `KeyValueStore.getPublicUrl` is now async; the per-store
  `urlSigningSecretKey` is fetched on demand via the (private)
  `client.getMetadata()` instead of the removed `storageObject`
  cache. URL-signing behaviour for platform-mode reads is preserved.
- `Actor.openRequestQueue` reads `totalRequestCount` via the new
  `client.getMetadata()` (the old `client.get()` was dropped).
- `StorageManager.openStorage` is now `(class, id?, client?)` —
  removed the trailing `this.config` argument.

Stacked on #583 (config redesign); rebases onto v4 once that lands.
Crawlee v4 reshaped `ProxyConfiguration`:
- `newProxyInfo` and `newUrl` now take a single `TieredProxyOptions`
  argument; the previous `(sessionId, options)` pair is gone.
- The protected `_handleCustomUrl(sessionId)` helper was removed; the
  `_callNewUrlFunction` and `_handleTieredUrl` helpers now take options
  only.
- `ProxyInfo` (in `@crawlee/types`) no longer carries `sessionId`.

Changes:
- `newProxyInfo` and `newUrl` accept `string | number |
  TieredProxyOptions | undefined` so existing SDK callers that pass a
  raw `sessionId` keep working, while the override remains compatible
  with crawlee's v4 signature. A small `parseSessionIdOrOptions`
  helper discriminates and pulls `sessionId` from `options.request`
  when no explicit one is given.
- Inlined custom-URL session stickiness via a new private
  `getSessionIndex(sessionId)` (replacing the removed
  `_handleCustomUrl`), keyed on `usedProxyUrls` like the base class.
- Re-declared `sessionId?: string` on the SDK's `ProxyInfo` interface
  so users can still read `proxyInfo.sessionId` (v3 carried it on the
  base type).
- Re-imported `ProxyInfo` from `@crawlee/types` (no longer re-exported
  from `@crawlee/core`).
- Tightened a `proxyUrls.some(url => url.includes(...))` access for
  the new `(string | null)[]` array shape.

Stacked on #583 (config redesign); rebases onto v4 once that lands.
…ent cases

Crawlee v4's Configuration resolves env vars eagerly at construction,
so the existing 'Actor.newClient() reads environment variables
correctly' test reads stale values once a prior test or import-time
side effect has already created the singleton. Reset both before
each case.
`Configuration.useEventManager()` was removed in crawlee v4. Install
the platform event manager via the global service locator instead, and
reset between tests so each case can register a fresh manager without
hitting `ServiceConflictError`.
Replace the removed `StorageManager.clearCache()` and
`Configuration.useStorageClient()` with `serviceLocator.reset()`
plus `serviceLocator.setStorageClient()`.
Crawlee v4's `Configuration` is eager — `actorEventsWsUrl` is read
once at construction, so a global config that pre-existed the
`beforeEach` would never see the websocket URL we set, and
`events.init()` would silently never connect. Move the env-var setup
above `Configuration.getGlobalConfig()` and reset the SDK's static
singleton so each test rebuilds a fresh config.
The SDK's `Configuration` keeps its own static singleton separate from
crawlee's serviceLocator. Resetting only the locator wasn't enough —
`Configuration.getGlobalConfig()` still handed back the stale cached
SDK config (which was built before the test set `APIFY_TOKEN`).
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from b493549 to b9ed56b Compare April 30, 2026 16:30
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from b9ed56b to f27efdd Compare April 30, 2026 16:35
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from f27efdd to 53477ed Compare April 30, 2026 17:12
B4nan added 8 commits April 30, 2026 19:14
…apter

- `openRequestQueue should open storage`: mock client uses
  `getMetadata()` (the v3 `get()` was dropped on
  RequestQueueClient).
- Both Storage API tests assert that StorageManager.openStorage is
  called with an ApifyStorageClient (matched structurally) instead of
  the raw ApifyClient — the SDK now wraps it for crawlee v4.
- Reword "empty string maxTotalChargeUsd" assertion: under Option A
  the empty env var is now treated as unset, so `config.maxTotalChargeUsd`
  is `undefined` (charging manager still defaults to Infinity).
- Actor.getInput tests now build a fresh Actor *after* setting the
  env vars they exercise — eager config resolution means a single
  module-scoped TestingActor would carry stale values.
- Custom URL rotation: post-increment the round-robin index so the
  first sessionless call returns proxyUrls[0] (was off-by-one).
- Surface `username` on the returned ProxyInfo by parsing it out of
  the resolved URL — v3 carried it via `super.newProxyInfo`.
- parseSessionIdOrOptions now rejects non-plain objects (e.g. Date,
  Array) so `newUrl(new Date())` throws as users expect.

test: `newUrl({})` is no longer 'invalid' — empty TieredProxyOptions
is a legal v4 call shape; documented the carve-out.
…oxyInfo shape

- newUrl/newProxyInfo accept an optional second `legacyOptions`
  argument so existing callers that pass `(sessionId, {request})`
  keep working under the v4 shape too.
- Returned ProxyInfo omits Apify-only fields (groups, countryCode)
  when not using Apify Proxy and only includes `proxyTier` when
  defined — matches v3's strict-deep-equal expectations.
…nfiguration tests

- ProxyInfo.username is now the decoded form (`user@name` rather
  than `user%40name`), matching v3 behaviour and the test
  expectations.
- Added a beforeEach to the `Actor.createProxyConfiguration()`
  describe that resets serviceLocator + Configuration.globalConfig +
  Actor._instance so each test sees the env vars it sets.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from 53477ed to fe64186 Compare April 30, 2026 17:31
Crawlee's Configuration uses crawleeConfigFields and only knows about
`CRAWLEE_INPUT_KEY`. The SDK extension adds `ACTOR_INPUT_KEY` /
`APIFY_INPUT_KEY` env-var aliases, which the test relies on.
Importing Configuration from 'apify' makes `new Configuration()`
inside buildActor() resolve those env vars correctly.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from fe64186 to 09fdf11 Compare April 30, 2026 17:33
`@crawlee/linkedom@4.0.0-beta.49`'s `linkedom-crawler.js` imports
`cheerio` without declaring it as a dependency. Locally this works
when a parent directory has cheerio installed; CI's fresh install
fails. Adding it directly here keeps tests green until the upstream
package fixes the missing dep declaration.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from 9bb7058 to 5b6598a Compare April 30, 2026 17:45
Actor.init() calls Configuration.storage.enterWith(this.config), which
sticks the resolved config onto the current async context and persists
across tests on Node 22 (but not Node 24+). The cached value short-
circuits Configuration.getGlobalConfig() so subsequent tests never see
the env vars they just set.

Reset the AsyncLocalStorage value alongside the other singletons in
the test emulator so addWebhook (and friends) see ACTOR_RUN_ID etc.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from 856d554 to 5b6598a Compare April 30, 2026 17:51
B4nan added a commit to apify/crawlee that referenced this pull request Apr 30, 2026
## Summary

`packages/linkedom-crawler/src/internals/linkedom-crawler.ts` imports
`cheerio` (`import * as cheerio from 'cheerio'`) but
`@crawlee/linkedom`'s `package.json` doesn't list it as a dependency.

It works inside the monorepo because cheerio is hoisted to the workspace
root via other packages (`@crawlee/cheerio`, `@crawlee/utils`,
`@crawlee/http`, …), so Node always finds it. **Downstream installs that
depend only on `crawlee`** (which re-exports `@crawlee/linkedom`) **and
don't pull any cheerio-using sibling** fail at runtime:

```
Error: Cannot find package 'cheerio' imported from .../node_modules/@crawlee/linkedom/internals/linkedom-crawler.js
```

This bit the apify-sdk-js v4 catch-up PRs (apify/apify-sdk-js#597) on a
clean CI install — without this fix, every consumer has to ship a
`cheerio` dev-dep workaround.

The fix is one-line: declare `cheerio: "^1.0.0"` (matching what
`@crawlee/cheerio` already pins).
`@crawlee/linkedom@4.0.0-beta.51` now declares cheerio as a direct
dependency (apify/crawlee#3620), so the SDK no longer has to ship its
own cheerio devDep to mask the missing declaration.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from e10479c to fda6873 Compare April 30, 2026 18:15
crawlee v4 (apify/crawlee#3599, beta.51) removed `tieredProxyUrls`,
`tieredProxyConfig`, `_handleTieredUrl`, and `proxyTier` from
`ProxyConfiguration` / `ProxyInfo`. The SDK's wrapper used to thread
those through to the base class; with the upstream API gone, that
plumbing has to go too.

- Remove the `tieredProxyConfig` field from the SDK's
  `ProxyConfigurationOptions`.
- Drop the constructor branch that forwarded `tieredProxyUrls` /
  `tieredProxyConfig` to the base class and the now-unreachable
  `_generateTieredProxyUrls` helper.
- Drop the `tieredProxyUrls` short-circuit and `proxyTier` field
  from `newUrl` / `newProxyInfo`.
- Drop the corresponding test groups in `proxy_configuration.test.ts`.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from fda6873 to d649a33 Compare April 30, 2026 18:23
B4nan added 4 commits April 30, 2026 20:55
…eset

`Actor.init()` calls `Configuration.storage.enterWith(this.config)`,
which sets the AsyncLocalStorage value on whichever async context the
test runner happened to be on. `enterWith(undefined)` from a child
async branch (vitest's beforeEach) doesn't unwind that — on Node 22
the test body re-enters a sibling context where the original
`enterWith` is still in effect, so `getStore()` still returns the
stale Configuration even after our reset.

Swapping the entire `AsyncLocalStorage` instance for a fresh one
guarantees `getStore()` returns `undefined` for every async branch
that follows, fixing the addWebhook test failures on Node 22.
@B4nan B4nan force-pushed the chore/v4-catchup-bundle branch from d649a33 to f0ced4f Compare April 30, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants