Skip to content

Webhook Dead-Letter Queue and Retry Policy #765

Description

@DokaIzk

Title: feat: implement dead-letter queue and configurable retry policy for webhook delivery

Labels: enhancement reliability feature
Complexity: medium
Branch: feat/webhook-dead-letter-queue


Problem Context

The current webhook dispatch task retries on failure, but permanently failed deliveries are silently dropped. Operators have no visibility into which webhooks failed, why they failed, or how to replay them. For any subscriber relying on webhooks for critical data pipelines, silent drops are unacceptable.


Scope

Included:

  • WebhookDelivery model: records every delivery attempt (status, response code, response body, duration)
  • Dead-letter queue: after N failed attempts, move to a dead_letter state
  • Manual replay: admin action to re-enqueue a dead-lettered delivery
  • GET /api/webhooks/{id}/deliveries/ endpoint returning delivery history
  • Slack/email alert when a subscription enters dead-letter state (configurable, optional)
  • Delivery retention policy: purge delivery records older than 30 days via a periodic Celery task

Not included:

  • Signed webhook payloads (separate security issue)
  • Per-subscriber retry configuration UI (Phase 2)

Implementation Guidelines

Files to update:

  • django-backend/soroscan/ingest/models.py — add WebhookDelivery model
  • django-backend/soroscan/ingest/tasks.py — update dispatch_webhook to record delivery; add purge_old_webhook_deliveries periodic task
  • django-backend/soroscan/ingest/views.py — add WebhookDeliveryViewSet
  • django-backend/soroscan/ingest/admin.py — register WebhookDelivery with replay action
  • django-backend/soroscan/ingest/serializers.py — add WebhookDeliverySerializer
  • django-backend/soroscan/urls.py — add delivery history route

Model sketch:

class WebhookDelivery(models.Model):
    class Status(models.TextChoices):
        PENDING = 'pending'
        SUCCESS = 'success'
        FAILED = 'failed'
        DEAD_LETTER = 'dead_letter'

    subscription = models.ForeignKey(
        WebhookSubscription, on_delete=models.CASCADE, related_name='deliveries'
    )
    event = models.ForeignKey(ContractEvent, on_delete=models.SET_NULL, null=True)
    status = models.CharField(max_length=16, choices=Status.choices, default=Status.PENDING)
    attempt_number = models.PositiveIntegerField(default=1)
    response_status_code = models.IntegerField(null=True, blank=True)
    response_body = models.TextField(blank=True)
    duration_ms = models.IntegerField(null=True, blank=True)
    error_message = models.TextField(blank=True)
    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        ordering = ['-created_at']
        indexes = [
            models.Index(fields=['subscription', '-created_at']),
            models.Index(fields=['status']),
        ]

Retry policy:

@app.task(bind=True, max_retries=5, default_retry_delay=60)
def dispatch_webhook(self, delivery_id: int):
    delivery = WebhookDelivery.objects.get(pk=delivery_id)
    try:
        # ... HTTP dispatch ...
        delivery.status = WebhookDelivery.Status.SUCCESS
    except Exception as exc:
        if self.request.retries >= self.max_retries:
            delivery.status = WebhookDelivery.Status.DEAD_LETTER
            delivery.save()
            return
        delivery.attempt_number += 1
        delivery.save()
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))

Constraints:

  • Delivery records must be created before the HTTP request, status updated after
  • purge_old_webhook_deliveries must run daily and delete records older than WEBHOOK_DELIVERY_RETENTION_DAYS (default: 30)
  • Response body truncated to 4 KB to prevent large payloads bloating the DB

Acceptance Criteria

  • WebhookDelivery model with migration added
  • Every dispatch attempt creates a WebhookDelivery record with status, response code, and duration
  • After 5 failed attempts, delivery moves to dead_letter status
  • GET /api/webhooks/{id}/deliveries/ returns paginated delivery history
  • Admin action replays a dead-lettered delivery
  • purge_old_webhook_deliveries task deletes records older than 30 days
  • Unit tests cover success, failure, max-retry, and purge scenarios


Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions