feat(backend): scheduled reconciliation, manifest persistence, resuma…#867
Merged
Conversation
…ble backfill, and SLO metrics
|
@king-aj-the-first Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ble backfill, and SLO metrics
Summary
This PR introduces four backend observability and durability improvements: proactive ledger drift detection, durable export manifests, resumable transaction backfill jobs, and Prometheus-based SLO breach metrics.
Task 1 — Scheduled Reconciliation Drift Detection
Refactored reconciliation logic for reuse and integrated automated drift detection via scheduled jobs.
Changes
Refactor
reconciliationReport.tsnow exports:runReconciliationReport()reconcile()fetchers and types
Scheduler Integration
Added:
runLedgerReconciliationJob()startLedgerReconciliationScheduler()Registered under existing report generation policy with retry/backoff via
jobGovernance.tsDrift Handling (status === 'DRIFT_DETECTED')
Emits structured logs
Increments:
reconciliation_drift_total{issue}Updates gauges:
reconciliation_statusreconciliation_last_run_timestampSends webhook alerts when drift exceeds threshold:
Controlled by:
RECONCILIATION_DRIFT_ALERT_THRESHOLDRECONCILIATION_ALERT_COOLDOWN_MSPersistence
Stores snapshots in
ReconciliationSnapshot(Prisma)Keeps latest automated summary in memory for diagnostics
New Endpoint
GET /admin/reconciliation/latestReturns last automated summary without re-querying Horizon
Diagnostics
Bundle now includes
lastReconciliationEnvironment Variables
LEDGER_RECONCILIATION_ENABLEDLEDGER_RECONCILIATION_INTERVAL_MSRECONCILIATION_WINDOW_HOURSRECONCILIATION_DRIFT_ALERT_THRESHOLDRECONCILIATION_ALERT_COOLDOWN_MSRECONCILIATION_ALERT_WEBHOOK_URLTask 2 — Export Manifest Persistence & Verification
Introduced durable export manifests with verification and retention controls.
Changes
Added
ExportManifestPrisma model + migrationexportManifest.ts:Persists manifests to Prisma
Supports memory fallback:
EXPORT_MANIFEST_STORAGE=memoryAdds:
Paginated listing
Checksum verification
Retention pruning (
EXPORT_MANIFEST_RETENTION, default: 500)Integrated manifest creation into:
bulkExportJobs.ts(on job completion)New Endpoint
POST /admin/reports/exports/manifests/:id/verifyReturns match/mismatch without exposing raw data
Updated list endpoint:
Offset pagination
Total count support
Task 3 — Resumable Transaction Backfill Jobs
Enabled durable, restart-safe backfill processing.
Changes
Added
TransactionBackfillJobPrisma model + migrationtransactionBackfill.ts:Persists:
Job metadata
Checkpoints (
lastProcessedLedger)Jobs hydrate from DB on startup
Running jobs resume after restart
Behavior
Dry-run jobs:
Persisted
Do NOT mutate
ProcessedEventrowsCompleted/failed jobs:
Prunable via:
pruneOldBackfillJobs()BACKFILL_JOB_RETENTION_DAYS(default: 30)Existing endpoints:
POST /admin/transactions/backfillGET /admin/transactions/backfillNow return durable job state
Task 4 — Endpoint SLA Breach Prometheus Metrics
Added SLO monitoring metrics aligned with
ENDPOINT_SLA_REGISTRY.Metrics
backend_slo_breach_total{path,tier,type}Counter incremented on alert dispatch (cooldown-aware)
backend_slo_p95_latency_ms{path,tier,type}Rolling P95 latency
backend_slo_budget_ms{path,tier,type}Configured latency budget
backend_slo_breach{path,tier,type}Breach status gauge (0/1)
Integration
latencyMonitoringService.syncSloMetrics()runs on each/metricsscrapeCritical endpoints:
/health/readyTagged with
tier="critical"Documentation updated:
docs/MONITORING_OBSERVABILITY.md(§1.6, §1.7)Schema Migration
20260627120000_add_manifest_reconciliation_backfillAdds:
ExportManifestReconciliationSnapshotTransactionBackfillJobTests Added
File | Coverage -- | -- ledgerReconciliationJob.test.ts | Clean vs drift scenarios, metrics validation exportManifest.test.ts | Creation, verification, pagination, retention transactionBackfill.persistence.test.ts | Start, dry-run, restart, failure handling sloMetrics.test.ts | Breach gauge + cooldown-aware counterTest Plan
Run migrations:
Verify metrics:
GET /metricsexposes:reconciliation_*backend_slo_*Reconciliation:
Set
LEDGER_RECONCILIATION_ENABLED=trueConfirm:
GET /admin/reconciliation/latestreturns summary after scheduler runExport manifests:
Create export:
POST /admin/reports/exportsRestart service
Verify:
Manifest persists
Checksum via:
POST /admin/reports/exports/manifests/:id/verifyBackfill jobs:
Start backfill
Restart service
Confirm resume from
lastProcessedLedgerSLO metrics:
Trigger latency on
/healthConfirm:
backend_slo_breach == 1Counter increments once per cooldown window
Overall Impact
Adds proactive ledger drift detection and alerting
Introduces durable export verification layer
Enables fault-tolerant, resumable backfill processing
Provides production-grade SLO observability via Prometheus
Strengthens reliability, monitoring, and operational visibility
closes #861
closes #863
closes #864
closes #865