feat(crawler): Add federation crawler CLI tool with config and state tracking#87
feat(crawler): Add federation crawler CLI tool with config and state tracking#87ontehfritz merged 8 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a new CLI binary Changes
Sequence DiagramsequenceDiagram
participant CLI as demarkus-agent CLI
participant Config as Config Loader
participant State as State Manager
participant Crawler as Crawler
participant Fetch as Fetch Client
participant Hub as Hub Publisher
CLI->>Config: Load(path, seeds)
Config->>Config: Validate()
Config-->>CLI: Config
CLI->>State: LoadState(path)
State-->>CLI: State
CLI->>Crawler: NewCrawler(cfg, client, state, tokens)
Crawler-->>CLI: Crawler
CLI->>Crawler: Run(ctx)
activate Crawler
Crawler->>Crawler: Seed worker queue
loop until ctx cancel or limits reached
Crawler->>Fetch: List(host, path, token)
Fetch-->>Crawler: Directory entries
Crawler->>Fetch: Fetch(host, path, token)
Fetch-->>Crawler: File + metadata (etag, content-hash)
Crawler->>State: RecordVisit(url, etag, status, hash)
Crawler->>Crawler: Discover mark:// links -> enqueue new hosts
end
Crawler->>State: Save()
Crawler-->>CLI: CrawlResult
deactivate Crawler
alt publish enabled and hubs configured
CLI->>Crawler: PublishToHubs(ctx, perServer)
activate Crawler
Crawler->>Crawler: Generate indexes (per-server or global)
loop for each hub
Crawler->>Hub: Publish(host, path, body, expectedVersion=0, meta)
Hub-->>Crawler: Result
end
Crawler-->>CLI: publish count / errors
deactivate Crawler
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/cmd/demarkus-agent/main.go`:
- Line 143: The flag help incorrectly labels config as required: either remove
"(required)" from the fs.String call or add explicit validation after flag
parsing to enforce it; to enforce, check the value pointed to by configPath (the
variable returned from fs.String) in main after fs.Parse(), and if empty,
log/fatal with a clear message and exit before calling fedcrawl.Load, otherwise
proceed to call fedcrawl.Load(*configPath) as before.
- Line 57: The call to fs.Parse(args) ignores its returned error; change it to
capture and handle the error (e.g., err := fs.Parse(args); if err != nil {
processLogger.Errorf("flag parse failed: %v", err); os.Exit(2) } or return the
error) so errcheck is satisfied—update the main.go usage around fs.Parse and
keep flag.ExitOnError behavior if desired.
- Line 154: The call to fs.Parse(args) in main (file
client/cmd/demarkus-agent/main.go) doesn't check the returned error; update this
call to mirror the crawlMain fix by capturing the error from fs.Parse(args),
logging the error with context (including the error details) and exiting
non‑zero when parse fails; locate the fs.Parse(args) invocation and modify it to
handle the error similarly to how crawlMain checks and handles fs.Parse.
- Around line 15-16: The imports block in package main
(client/cmd/demarkus-agent/main.go) contains an extra blank line separating
import groups; edit the imports section in main.go to remove the blank line or
consolidate groups so all imports are in one contiguous import block (i.e.,
adjust the import statement in the file to eliminate the empty line between
groups).
In `@client/internal/fedcrawl/config_test.go`:
- Around line 158-207: Add validation in the Config.Validate() method to treat a
negative Politeness.RequestDelay as invalid (return an error) and add a unit
test in client/internal/fedcrawl/config_test.go (inside TestValidate) that sets
cfg.Politeness.RequestDelay = -1 and asserts Validate() returns an error;
reference the Validate() method and the Politeness.RequestDelay field when
making the change so the negative duration is rejected at validation time.
In `@client/internal/fedcrawl/config.go`:
- Around line 86-107: The duplicated default-setting block that resets fields
like cfg.Crawl.MaxDepth, cfg.Crawl.MaxServers, cfg.Crawl.MaxDocuments,
cfg.Crawl.Workers, cfg.Schedule.Interval, cfg.Politeness.RequestDelay and
cfg.Politeness.PerServerConcurrency should be removed and rely on
DefaultConfig() to provide defaults (DefaultConfig() already pre-populates cfg
before TOML decode); either delete this block in config.go or replace it with
references to shared constants if you want a single source of truth, and add a
brief comment near DefaultConfig() documenting the chosen semantics for TOML
zero values (i.e., whether explicit 0 in TOML should override defaults).
- Around line 117-126: Replace the manual length-and-slice URL scheme checks in
the validation loop over c.Seeds and c.Hubs with strings.HasPrefix to make the
intent clearer and safer; specifically, update the loops that currently test
`len(seed) < 7 || seed[:7] != "mark://"` and `len(hub) < 7 || hub[:7] !=
"mark://"` to use `strings.HasPrefix(seed, "mark://")` and
`strings.HasPrefix(hub, "mark://")` respectively, keep the existing fmt.Errorf
messages, and add the "strings" import to the file.
- Around line 64-84: In Load, CLI seeds set via cfg.Seeds before toml.Decode get
overwritten by the TOML file; change the logic in Load so toml.Decode runs first
into cfg and then apply CLI seeds (or merge them) so CLI flags take precedence:
run toml.Decode(string(data), &cfg) before modifying cfg.Seeds, and if seeds
slice is non-empty copy/append those values to cfg.Seeds (or replace cfg.Seeds)
to ensure CLI-provided seeds override any seeds parsed from the file (refer to
Load, cfg.Seeds and toml.Decode).
In `@client/internal/fedcrawl/crawl_test.go`:
- Around line 303-314: The custom contains and containsMiddle functions
duplicate standard library behavior; replace all uses of contains(s, substr)
with strings.Contains(s, substr) and remove the contains and containsMiddle
helpers; ensure you import "strings" in client/internal/fedcrawl/crawl_test.go
and run tests to confirm no remaining references to contains or containsMiddle
remain.
In `@client/internal/fedcrawl/crawl.go`:
- Around line 364-404: PublishToHubs currently swallows parse and publish
errors; update it to collect and return errors instead of silently continuing:
when fetch.ParseMarkURL(hub+"/") fails, append a descriptive error (including
hub and parse error) to an errors slice; when c.publishIndex(...) returns an
error (both in the perServer loop and the aggregated branch), append a
descriptive error (including hub, idxPath and serverHost if applicable) to the
same slice; after the loops, if the slice is non-empty return successCount and a
combined error (use errors.Join or fmt.Errorf to aggregate messages) so callers
of PublishToHubs (and callers of c.resolveToken / c.publishIndex) can observe
failures.
- Around line 41-51: NewCrawler currently accepts nil for State and tokens.Store
but doesn't document it; update the NewCrawler function comment to state that
the State and tokenStore parameters are optional (can be nil) and that the
Crawler methods guard access via c.state != nil and c.tokens != nil, or
alternatively add defensive checks inside NewCrawler to populate safe no-op
defaults for state and tokenStore; reference NewCrawler, Config, FetchClient,
State, tokens.Store, Crawler, and the c.state / c.tokens guards so reviewers can
locate and verify the change.
- Line 67: The channel buffer is hardcoded as make(chan string, 100) which can
underprovision when MaxServers is larger; change the buffer size to the maximum
of MaxServers and 100 (e.g., compute bufSize := max(cfg.MaxServers, 100) and use
make(chan string, bufSize)) so discovery won’t backpressure workers; update the
queue creation in crawl.go (the queue variable) to use that computed buffer size
and ensure cfg.MaxServers (or the appropriate MaxServers variable) is used.
- Around line 292-302: The code holds c.mu while calling wg.Add(1) and sending
queue <- host which can block and starve other goroutines; change the flow in
the discovery section (the block around c.mu.Lock()/c.mu.Unlock() that checks
c.servers and c.cfg.Crawl.MaxServers) so you only hold the mutex to test/update
shared state (c.servers and the MaxServers check), record whether the host was
newly added in a local flag, then release c.mu before calling wg.Add(1) and
queue <- host; ensure the conditional that increments the waitgroup and enqueues
the host (wg.Add and sending on queue) happens only if that local flag indicates
a new host.
- Around line 92-97: The early check using activeServers against
c.cfg.Crawl.MaxServers is racy and measures in-flight crawls not total
discovered servers; remove the block that reads activeServers.Load() and the
subsequent activeServers.Add(1)/defer activeServers.Add(-1) here, or if you want
an early exit use the real total-server metric by checking len(c.servers) under
the crawler's mutex (e.g. protect access to c.servers with the same mutex used
elsewhere) and return when that guarded length >= c.cfg.Crawl.MaxServers so the
true limit (enforced later at len(c.servers) >= MaxServers) is respected without
TOCTOU races.
In `@client/internal/fedcrawl/state.go`:
- Around line 86-90: The assignment at the end of load/initialization sets
s.servers = doc.Servers and s.urls = doc.URLs which can be nil if JSON had
null/omitted keys and will cause map write panics in RecordVisit or
RecordServer; guard against nil by initializing empty maps when doc.Servers or
doc.URLs are nil before assigning (e.g., if doc.Servers == nil then set
s.servers = make(map[...]) else assign doc.Servers; same for doc.URLs) so later
calls to RecordVisit and RecordServer can safely write to the maps.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 86375c33-87ec-4395-8683-286c5ba96c8f
📒 Files selected for processing (9)
.gitignoreclient/cmd/demarkus-agent/main.goclient/demarkus-agentclient/internal/fedcrawl/config.goclient/internal/fedcrawl/config_test.goclient/internal/fedcrawl/crawl.goclient/internal/fedcrawl/crawl_test.goclient/internal/fedcrawl/state.goclient/internal/fedcrawl/state_test.go
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/cmd/demarkus-agent/main.go`:
- Around line 204-209: The daemon currently allows cfg.Schedule.Interval == 0
(cfg.Validate only checks < 0) which causes
time.NewTicker(cfg.Schedule.Interval) to panic; add an explicit non-positive
check before creating the ticker: after calling cfg.Validate() and before
time.NewTicker, check if cfg.Schedule.Interval <= 0 and return/log a fatal error
(or surface the error to caller) so the process exits with a clear message;
alternatively update cfg.Validate() to reject <= 0 for Schedule.Interval to
ensure invalid configs are caught earlier (reference symbols: cfg.Validate,
cfg.Schedule.Interval, time.NewTicker).
In `@client/internal/fedcrawl/crawl_test.go`:
- Around line 145-186: The current limit tests miss overshoot paths: update the
tests so they exercise cross-server document enqueueing and the seed-enqueue
path that can bypass caps. For TestCrawlerMaxDocuments add a scenario where two
different servers each serve documents (e.g., seed points to serverA which links
to serverB) and set cfg.Crawl.MaxDocuments = 1 to ensure the crawler caps across
workers/servers; for TestCrawlerMaxServers add a case where len(cfg.Seeds) >
cfg.Crawl.MaxServers (e.g., provide two seed URLs but MaxServers=1) to verify
seeds are not enqueued past the cap; keep using newMockClient(), addList/addDoc
and Run() to assert result.DocumentsCrawled and result.ServersDiscovered are
properly capped. Ensure the new subcases reference TestCrawlerMaxDocuments and
TestCrawlerMaxServers so CI covers both failing paths.
In `@client/internal/fedcrawl/crawl.go`:
- Around line 175-177: The current code checks int(docCount.Load()) >=
c.cfg.Crawl.MaxDocuments only after the network call, allowing races; change to
reserve a slot atomically before calling Fetch by performing an atomic increment
on docCount (or using CompareAndSwap semantics) and validate the new value does
not exceed c.cfg.Crawl.MaxDocuments—if it does, decrement/rollback and return
immediately. After reserving, call Fetch; on any fetch error or non-OK response,
decrement/rollback the reserved slot so DocumentsCrawled only reflects
successful documents. Apply the same pattern to all similar checks/blocks
referenced (the one using docCount.Load() and any places updating
DocumentsCrawled around Fetch) so reservation happens before the network request
and rollback occurs on failures.
- Around line 120-137: The initial seed loop in Run() queues all seeds without
honoring the configured MaxServers, so update that loop (the section iterating
c.cfg.Seeds) to check the cap before marking and enqueueing a host: inside the
c.mu-guarded section check the current number of entries in c.servers (or a
separate counter) against c.cfg.MaxServers and if the cap is reached skip
remaining seeds (or break), otherwise mark c.servers[host]=true and then
enqueue; ensure the check is done while holding c.mu to avoid races and keep the
existing use of wg.Add(1) and queue <- host semantics unchanged; this enforces
the same MaxServers limit as discoverServers.
- Around line 312-316: Remove the nil guard in Crawler.resolveToken so
tokens.Resolve is always called; currently the function returns "" when c.tokens
== nil and skips env-based fallback (DEMARKUS_AUTH). Update Crawler.resolveToken
to unconditionally call tokens.Resolve("", host, c.tokens) so the documented
cascade (explicit > env > store) is honored even when c.tokens is nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d5de9808-5a11-4f70-b769-155900d14016
📒 Files selected for processing (6)
client/cmd/demarkus-agent/main.goclient/internal/fedcrawl/config.goclient/internal/fedcrawl/config_test.goclient/internal/fedcrawl/crawl.goclient/internal/fedcrawl/crawl_test.goclient/internal/fedcrawl/state.go
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
client/internal/fedcrawl/crawl.go (3)
173-175:⚠️ Potential issue | 🟠 MajorReserve the document slot before
Fetch().The load-then-fetch pattern is racy. Multiple workers can all observe spare capacity, issue extra requests, and increment past
MaxDocuments. Those over-budget documents also still reachdiscoverServers()on Line 257.Reserve and roll back
- doc, err := c.client.Fetch(host, fullPath, token) + if n := docCount.Add(1); int(n) > c.cfg.Crawl.MaxDocuments { + docCount.Add(-1) + return nil + } + + doc, err := c.client.Fetch(host, fullPath, token) if err != nil { + docCount.Add(-1) log.Printf("[WARN] fetch %s%s: %v", host, fullPath, err) continue } @@ if doc.Response.Status != protocol.StatusOK { + docCount.Add(-1) continue } @@ - docCount.Add(1) *count++Also applies to: 198-200, 225-260
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 173 - 175, The check-then-fetch is racy: instead of calling Fetch() after reading docCount.Load(), reserve a document slot atomically before issuing Fetch by incrementing the shared counter (docCount) using an atomic increment/CAS; if the increment would exceed c.cfg.Crawl.MaxDocuments, decrement/rollback and skip Fetch. After Fetch returns, decrement the counter on any error or when the document is not accepted, and ensure successful documents remain counted. Apply the same reserve-and-rollback pattern around the other occurrences noted (the blocks referencing docCount at lines ~198-200 and ~225-260) and keep discoverServers() unchanged except that it should only be called for requests that successfully reserved a slot.
309-314:⚠️ Potential issue | 🟠 MajorDo not special-case
niltoken stores here.This makes the crawler take a different auth-resolution path than the rest of
client/internal. Lettokens.Resolveown the fallback logic and call it unconditionally. If you take this change, Line 43 should also stop promising ac.tokens != nilguard.As per coding guidelines, `Token resolution: consistent across CLI, TUI, MCP`.Minimal fix
func (c *Crawler) resolveToken(host string) string { - if c.tokens == nil { - return "" - } return tokens.Resolve("", host, c.tokens) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 309 - 314, The resolveToken method in Crawler special-cases a nil token store which makes auth resolution inconsistent; remove the nil check in Crawler.resolveToken and always call tokens.Resolve("", host, c.tokens) so tokens.Resolve owns fallback behavior, and update any comment/contract that promises c.tokens != nil (e.g., the earlier method/class comment asserting a non-nil guard) to no longer claim that guarantee; keep function name Crawler.resolveToken and the tokens.Resolve call unchanged.
118-136:⚠️ Potential issue | 🟠 MajorHonor
MaxServerswhile queuing seeds.If the config contains more seeds than the cap, this loop still queues every parsed host and
ServersDiscoveredcan exceedMaxServersbefore discovery starts.Minimal fix
c.mu.Lock() + if len(c.servers) >= c.cfg.Crawl.MaxServers { + c.mu.Unlock() + break + } if c.servers[host] { c.mu.Unlock() continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 118 - 136, The seeding loop adds every parsed host without checking the configured cap, so ensure you stop queuing once the cap is reached by checking MaxServers under c.mu before marking/adding a host: after parsing the seed URL but before setting c.servers[host]=true, lock c.mu and if len(c.servers) >= c.cfg.MaxServers then unlock and break/stop queuing; otherwise mark the host, unlock, wg.Add(1) and queue <- host. Reference c.cfg.Seeds, c.cfg.MaxServers, the c.servers map and the mutex c.mu in crawl.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/fedcrawl/crawl.go`:
- Around line 35-39: Crawler currently retains previous crawl state because the
maps on the Crawler struct (servers, hashes) are never cleared; update the Run()
method to reinitialize/reset these maps (and any related counters) at the start
of Run() so each invocation begins with an empty crawl state, e.g., set
c.servers = make(map[string]bool) and c.hashes = make(map[string]index.Entry)
(while respecting c.mu for concurrency) and ensure any code paths that rely on
aggregated counts use the fresh maps; alternatively, document Crawler as
single-use if you intend to keep cumulative state.
- Around line 77-82: The code currently logs file-fetch failures only to the
process log, so callers can’t see partial-crawl errors; pass the existing
thread-safe recorder function (recordError) into crawlServer() and walkDir()
(and any helper that performs file fetches around the fetch/error handling at
the other spot noted ~225-228) and call it to append failures (including host
and fullPath context) so those messages end up in CrawlResult.Errors instead of
only the process log; ensure you use the same recordError closure (or an
equivalent function that appends to crawlErrors under errorsMu) wherever file
downloads or host/fullPath failures occur.
- Around line 37-38: The current map field 'hashes map[string]index.Entry'
collapses multiple locations with the same content-hash into one entry causing
data loss in Hashes(), IndexForServer(), and GlobalIndex(); change the storage
to preserve multiplicity (e.g., make 'hashes map[string][]index.Entry' or use a
composite key that includes server/path) and update all uses—particularly the
building logic in Hashes(), IndexForServer(), and GlobalIndex() and any
tie-break logic around lines 331-357—to append entries instead of overwriting so
you aggregate/flatten entries when producing the final indexes.
- Around line 113-116: The crawl routine currently spawns workers in a loop
using c.cfg.Crawl.Workers and then enqueues seeds, which deadlocks if Workers <=
0 because no goroutine calls wg.Done() (worker(), wg.Done). Add an explicit
guard early in the crawl function: check c.cfg.Crawl.Workers and if it's <= 0
return an error (or validation failure) before starting workers or enqueuing
seeds; reference the c.cfg.Crawl.Workers field and the worker() goroutine to
locate where to insert this validation. Ensure the function returns/aborts
rather than proceeding so the seed-enqueue and wg.Wait paths cannot block
forever.
---
Duplicate comments:
In `@client/internal/fedcrawl/crawl.go`:
- Around line 173-175: The check-then-fetch is racy: instead of calling Fetch()
after reading docCount.Load(), reserve a document slot atomically before issuing
Fetch by incrementing the shared counter (docCount) using an atomic
increment/CAS; if the increment would exceed c.cfg.Crawl.MaxDocuments,
decrement/rollback and skip Fetch. After Fetch returns, decrement the counter on
any error or when the document is not accepted, and ensure successful documents
remain counted. Apply the same reserve-and-rollback pattern around the other
occurrences noted (the blocks referencing docCount at lines ~198-200 and
~225-260) and keep discoverServers() unchanged except that it should only be
called for requests that successfully reserved a slot.
- Around line 309-314: The resolveToken method in Crawler special-cases a nil
token store which makes auth resolution inconsistent; remove the nil check in
Crawler.resolveToken and always call tokens.Resolve("", host, c.tokens) so
tokens.Resolve owns fallback behavior, and update any comment/contract that
promises c.tokens != nil (e.g., the earlier method/class comment asserting a
non-nil guard) to no longer claim that guarantee; keep function name
Crawler.resolveToken and the tokens.Resolve call unchanged.
- Around line 118-136: The seeding loop adds every parsed host without checking
the configured cap, so ensure you stop queuing once the cap is reached by
checking MaxServers under c.mu before marking/adding a host: after parsing the
seed URL but before setting c.servers[host]=true, lock c.mu and if
len(c.servers) >= c.cfg.MaxServers then unlock and break/stop queuing; otherwise
mark the host, unlock, wg.Add(1) and queue <- host. Reference c.cfg.Seeds,
c.cfg.MaxServers, the c.servers map and the mutex c.mu in crawl.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1b7cf47f-c9e2-4078-968c-719c2333e565
📒 Files selected for processing (1)
client/internal/fedcrawl/crawl.go
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
client/internal/fedcrawl/crawl.go (1)
331-335:⚠️ Potential issue | 🟠 MajorDo not bypass centralized token resolution.
The nil guard changes auth behavior based on constructor wiring: when
c.tokens == nil, env-based fallback is never consulted. Calltokens.Resolve("", host, c.tokens)unconditionally.As per coding guidelines, "Token resolution: consistent across CLI, TUI, MCP".Minimal change
func (c *Crawler) resolveToken(host string) string { - if c.tokens == nil { - return "" - } return tokens.Resolve("", host, c.tokens) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 331 - 335, The resolveToken method's nil guard causes env-based fallback to be skipped when c.tokens is nil; remove the conditional and unconditionally call tokens.Resolve("", host, c.tokens) inside func (c *Crawler) resolveToken(host string) so centralized token resolution is always used (leave the function signature and return value unchanged and only delete the nil-check branch).client/cmd/demarkus-agent/main.go (1)
204-211:⚠️ Potential issue | 🟠 MajorReject invalid daemon intervals before the first crawl.
At Line 205,
daemonMainstill runs one crawl before Line 208 rejectsschedule.interval <= 0. With publishing enabled, an invalid daemon config can mutate hubs and state before exiting. Move this guard above the initialrunCrawl, or reject it inConfig.Validate().Minimal change
- // Run initial crawl. - runCrawl(ctx, &cfg, client, state, tokenStore, *publish, *perServer) - - // Schedule subsequent crawls. if cfg.Schedule.Interval <= 0 { log.Fatal("schedule.interval must be > 0 in daemon mode") } + + // Run initial crawl. + runCrawl(ctx, &cfg, client, state, tokenStore, *publish, *perServer) + + // Schedule subsequent crawls. ticker := time.NewTicker(cfg.Schedule.Interval)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/cmd/demarkus-agent/main.go` around lines 204 - 211, daemonMain currently calls runCrawl before validating cfg.Schedule.Interval, allowing a crawl to run with an invalid interval; move the guard that checks cfg.Schedule.Interval <= 0 to run before the initial runCrawl call (or add the same check into Config.Validate()), so that the code rejects invalid daemon intervals (log.Fatal("schedule.interval must be > 0 in daemon mode")) before any state-mutating action like runCrawl(ctx, &cfg, client, state, tokenStore, *publish, *perServer) is executed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/fedcrawl/crawl_test.go`:
- Around line 324-349: The test currently only verifies in-memory state because
it loads "/nonexistent/state.json" and never reloads from disk; change it to use
t.TempDir() to create a real state file path, pass that path into LoadState so
the crawler writes to a real file, run crawler.Run(ctx), then call
LoadState(path) again (or otherwise reopen the persisted file) and assert the
persisted values via the same checks (ServerCount, URLCount, GetURL.ContentHash)
to prove JSON persistence; make sure the assertions reference the existing
symbols LoadState, NewCrawler, crawler.Run, state.ServerCount, state.URLCount,
and state.GetURL.
In `@client/internal/fedcrawl/crawl.go`:
- Around line 157-162: The Run() function currently swallows c.state.Save()
failures by only logging; modify Run (the function that calls c.state.Save()) so
that when c.state != nil and c.state.Save() returns an error you either return
that error from Run() or append it to the returned CrawlResult.Errors slice
(update the CrawlResult struct usage if needed) instead of just calling
log.Printf; specifically replace the log.Printf("[WARN] save state: %v", err)
with logic that captures the error (e.g., result.Errors = append(result.Errors,
fmt.Errorf("save state: %w", err)) or return fmt.Errorf("save state: %w", err))
so callers can observe persistence failures from Run() and adjust any callers
expecting an error return or to consume CrawlResult.Errors.
- Around line 379-388: The entries slice is not fully ordered and not
preallocated: ensure deterministic ordering by precomputing total capacity from
c.hashes and creating entries with make([]index.Entry, 0, total), then append
all entries; update the slices.SortFunc comparator to break ties fully (compare
a.Hash, then a.Server, then a.Path or the unique field that distinguishes
entries) so distinct entries never compare equal and output is stable (refer to
the entries slice, c.hashes, and the slices.SortFunc call).
- Around line 224-228: The current code returns immediately when c.walkDir(...)
fails, aborting the rest of the host; change the behavior so non-cancellation
errors are recorded and iteration continues. In the loop where you call
c.walkDir(ctx, host, fullPath, ...), replace the immediate return on err with
logic that: if ctx.Err() != nil or errors.Is(err, context.Canceled) then return
err to propagate cancellation, otherwise call recordError(err, fullPath) (or the
existing recordError callback) and continue to the next sibling directory; do
the same for other recursive calls so only cancellation/ctx errors are returned
while path-level failures are logged and traversal continues.
---
Duplicate comments:
In `@client/cmd/demarkus-agent/main.go`:
- Around line 204-211: daemonMain currently calls runCrawl before validating
cfg.Schedule.Interval, allowing a crawl to run with an invalid interval; move
the guard that checks cfg.Schedule.Interval <= 0 to run before the initial
runCrawl call (or add the same check into Config.Validate()), so that the code
rejects invalid daemon intervals (log.Fatal("schedule.interval must be > 0 in
daemon mode")) before any state-mutating action like runCrawl(ctx, &cfg, client,
state, tokenStore, *publish, *perServer) is executed.
In `@client/internal/fedcrawl/crawl.go`:
- Around line 331-335: The resolveToken method's nil guard causes env-based
fallback to be skipped when c.tokens is nil; remove the conditional and
unconditionally call tokens.Resolve("", host, c.tokens) inside func (c *Crawler)
resolveToken(host string) so centralized token resolution is always used (leave
the function signature and return value unchanged and only delete the nil-check
branch).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c60d3130-bb51-4f15-9c72-9c261b6bb62e
📒 Files selected for processing (3)
client/cmd/demarkus-agent/main.goclient/internal/fedcrawl/crawl.goclient/internal/fedcrawl/crawl_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (4)
client/internal/fedcrawl/crawl.go (4)
388-393:⚠️ Potential issue | 🟡 MinorSort comparator is incomplete for full determinism.
Entries with the same
HashandServerbut differentPathreturn0from the comparator. Output order depends on map iteration, which varies between runs.slices.SortFunc(entries, func(a, b index.Entry) int { if cmp := strings.Compare(a.Hash, b.Hash); cmp != 0 { return cmp } - return strings.Compare(a.Server, b.Server) + if cmp := strings.Compare(a.Server, b.Server); cmp != 0 { + return cmp + } + return strings.Compare(a.Path, b.Path) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 388 - 393, The comparator passed to slices.SortFunc for entries only compares index.Entry.Hash and index.Entry.Server, leaving entries with equal Hash and Server unordered; update the comparator (the anonymous func used in slices.SortFunc over entries) to also compare a.Path and b.Path (e.g., return strings.Compare(a.Path, b.Path) as the final tie-breaker) so sorting is fully deterministic.
330-336:⚠️ Potential issue | 🟠 MajorNil guard bypasses environment auth fallback.
tokens.Resolvehandles nil stores and provides the cascade: explicit →DEMARKUS_AUTHenv → store. The nil check here prevents env-based auth when no token store is configured. Remove it.func (c *Crawler) resolveToken(host string) string { - if c.tokens == nil { - return "" - } return tokens.Resolve("", host, c.tokens) }As per coding guidelines, token resolution must be consistent across CLI, TUI, MCP.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 330 - 336, The nil check in Crawler.resolveToken bypasses tokens.Resolve's built-in fallback to environment auth; remove the early return and always call tokens.Resolve("", host, c.tokens) so the cascade (explicit → DEMARKUS_AUTH env → store) is preserved; update the resolveToken method to directly return tokens.Resolve("", host, c.tokens) (referencing the Crawler.resolveToken function and the c.tokens field).
157-162:⚠️ Potential issue | 🟠 MajorState persistence failure is still swallowed.
Run()logsState.Save()errors but returnsnil, making failed persistence indistinguishable from success. Append toresult.Errorsor return the error.if c.state != nil { if err := c.state.Save(); err != nil { - log.Printf("[WARN] save state: %v", err) + recordError("save state: %v", err) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 157 - 162, The Run() method currently logs errors from c.state.Save() but swallows them; update the Save error handling inside Run (the block referencing c.state and c.state.Save()) to propagate the failure by creating/wrapping the error (e.g. using fmt.Errorf("save state: %w", err)) and appending it to result.Errors (or returning it if you prefer early-exit semantics) instead of just logging — ensure you still log the warning but also add the wrapped error to result.Errors so callers can observe persistence failures.
224-230:⚠️ Potential issue | 🟠 MajorSubdirectory failure still aborts siblings.
A transient LIST failure in one subdirectory returns immediately, skipping unrelated siblings. Only propagate context cancellation; record other errors and continue.
if strings.HasSuffix(dest, "/") { if err := c.walkDir(ctx, host, fullPath, token, docCount, queue, wg, count, recordError); err != nil { - return err + if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { + return err + } + recordError("walkDir %s%s: %v", host, fullPath, err) } continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/fedcrawl/crawl.go` around lines 224 - 230, The current crawl loop returns immediately when c.walkDir(...) fails, aborting sibling directories; change it so only context cancellation errors are propagated: call c.walkDir(ctx, host, fullPath, ...) and if it returns an error check if errors.Is(err, context.Canceled) or ctx.Err() != nil then return err, otherwise call recordError(err) (or the existing recordError function) and continue to the next sibling instead of returning; keep the existing variables (c.walkDir, recordError, ctx) to locate the call site.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@client/internal/fedcrawl/crawl.go`:
- Around line 388-393: The comparator passed to slices.SortFunc for entries only
compares index.Entry.Hash and index.Entry.Server, leaving entries with equal
Hash and Server unordered; update the comparator (the anonymous func used in
slices.SortFunc over entries) to also compare a.Path and b.Path (e.g., return
strings.Compare(a.Path, b.Path) as the final tie-breaker) so sorting is fully
deterministic.
- Around line 330-336: The nil check in Crawler.resolveToken bypasses
tokens.Resolve's built-in fallback to environment auth; remove the early return
and always call tokens.Resolve("", host, c.tokens) so the cascade (explicit →
DEMARKUS_AUTH env → store) is preserved; update the resolveToken method to
directly return tokens.Resolve("", host, c.tokens) (referencing the
Crawler.resolveToken function and the c.tokens field).
- Around line 157-162: The Run() method currently logs errors from
c.state.Save() but swallows them; update the Save error handling inside Run (the
block referencing c.state and c.state.Save()) to propagate the failure by
creating/wrapping the error (e.g. using fmt.Errorf("save state: %w", err)) and
appending it to result.Errors (or returning it if you prefer early-exit
semantics) instead of just logging — ensure you still log the warning but also
add the wrapped error to result.Errors so callers can observe persistence
failures.
- Around line 224-230: The current crawl loop returns immediately when
c.walkDir(...) fails, aborting sibling directories; change it so only context
cancellation errors are propagated: call c.walkDir(ctx, host, fullPath, ...) and
if it returns an error check if errors.Is(err, context.Canceled) or ctx.Err() !=
nil then return err, otherwise call recordError(err) (or the existing
recordError function) and continue to the next sibling instead of returning;
keep the existing variables (c.walkDir, recordError, ctx) to locate the call
site.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 703f3f08-ac61-47c1-b0c8-d8bbe9597e03
📒 Files selected for processing (1)
client/internal/fedcrawl/crawl.go
|
@coderabbitai what is blocking the approval? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 129959
🔴 Critical
🟠 Major
🔵 Minor / Nitpick
The fastest path to approval is resolving the Critical and Major items — particularly the nil-map panic, deadlock on |
|
@coderabbitai ok the issues should be addressed now, can you review? |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/cmd/demarkus-agent/main.go`:
- Around line 198-203: The log of the interval is emitted before validating it;
move the validation for cfg.Schedule.Interval (the if cfg.Schedule.Interval <= 0
check) to occur before the log.Printf("demarkus-agent daemon starting (interval:
%s)", cfg.Schedule.Interval) call so the program first enforces that
schedule.interval > 0 and only logs the starting message after the interval has
been validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 468c24b1-84c9-4c8d-9616-72563190a96e
📒 Files selected for processing (3)
client/cmd/demarkus-agent/main.goclient/internal/fedcrawl/config.goclient/internal/fedcrawl/crawl.go
|
@coderabbitai what is blocking the approval? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 15022
🔴 Critical (2 open)
🟠 Major (14 open)
🔵 Minor / Nitpick (7 open)
Fastest path to approval: fix the two 🔴 Critical items and the top 🟠 Major items (nil map panic, Workers=0 deadlock, ticker panic, |
--- Done --- I'll fix the GlobalIndex function to preallocate the entries
--- Proposed: Add Path as third sort key to ensure deterministic
|
| Issue | Status | Location |
|---|---|---|
| 🔴 Nil map panic in state.go | ✅ Fixed | Lines 86-92: guards |
if doc.Servers != nil |
||
| 🔴 Workers <= 0 deadlock | ✅ Fixed | Line 66: `if |
| c.cfg.Crawl.Workers <= 0` returns error | ||
| 🟠 time.NewTicker panic | ✅ Fixed | Lines 198-201: validation |
| before ticker | ||
| 🟠 fs.Parse error handling | ✅ Fixed | Lines 56-58: |
log.Fatalf("parse flags: %v", err) |
||
| 🟠 CLI seed overwrite | ✅ Fixed | Lines 90-93: seeds applied |
| after TOML decode | ||
| 🟠 servers/hashes reset | ✅ Fixed | Lines 70-74: maps reset |
| at Run() start | ||
| 🟠 MaxDocuments after Fetch | ✅ Fixed | Lines 235-240: atomic |
| reservation BEFORE Fetch | ||
| 🟠 GlobalIndex sort deterministic | ✅ Fixed | Lines 389-397: |
| 3-level sort (Hash→Server→Path) | ||
| 🟠 wg.Add/queue under mutex | ✅ Fixed | Lines 323-329: mutex |
| released before enqueue | ||
| 🟠 resolveToken DEMARKUS_AUTH | ✅ Fixed | Lines 334-336: |
| direct call, no nil guard | ||
| 🟠 Log before validation | ✅ Fixed | Lines 198-203: |
| validation before log | ||
| 🔵 strings.HasPrefix | ✅ Fixed | Lines 103-109: using |
| strings.HasPrefix | ||
| 🔵 Queue hardcoded to 100 | ✅ Fixed | Line 79: |
max(c.cfg.Crawl.MaxServers, 100) |
Remaining items (design choices, not bugs):
-configflag not enforced (workflow choice)- State.Save() errors logged only (intentional design)
- PublishToHubs returns errors joined (caller can check)
- Test coverage improvements (not bugs)
The bot is reporting stale issues. All Critical and Major items
have been addressed.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 6195
Triggering a fresh full review against the latest commits now. [full_review] 🧠 Learnings used |
|
@coderabbitai now what is blocking the approval? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 12798 🏁 Script executed: Length of output: 19254
All previously blocking items — now fixed ✅
Only nitpick-level items remain open:
None of these block a merge. The fresh full review (triggered in my previous reply) should reflect all the fixes and approve. 🧠 Learnings used |
|
@coderabbitai I cannot merge without an approval |
|
Tip For best results, initiate chat on the files or code changes.
[approve] 🧠 Learnings used |
@coderabbitai you need to mark the PR as approved, I still can't merge without it1 |
|
Tip For best results, initiate chat on the files or code changes.
[approve] 🧠 Learnings used |
Summary by CodeRabbit
New Features
Tests
Chores
.project/.