Skip to content

[professional-crm] Proposal: tag_dictionary + validation trigger to prevent tag drift #332

@UpLeafed

Description

@UpLeafed

Problem

extensions/professional-crm stores contact tags as a free-form text[]. There's no canonical vocabulary, alias mapping, or write-time validation. This guarantees tag drift over time. In a real CRM I built on top of this extension, after ~700 contacts the data already showed:

medtronic (33), post-medtronic-network (4), ex-mdt (1), pan-mdt (1),
medtronic-employment-counsel (1), post-medtronic (1), ex-medtronic (1)

seven distinct ways to spell "Medtronic", all accidental. Other examples: kids-school + school-staff + school-admin (three variants of "involved with a kid's school"); sponsor-active + sponsor-grade + advice-grade-not-sponsor-grade (overlapping grade tags).

This is not a one-time data accident; it's an inevitable consequence of tools that accept arbitrary strings. The Gemma-/Claude-/agent-driven workflows OB1 enables will accelerate it.

Proposed pattern

A small, opt-in addition that any user can adopt without breaking the standard schema:

  1. New tag_dictionary table — canonical names, aliases, definitions, parent (for hierarchy), dimension (affiliation / role / cohort / etc.), is_classifier_safe flag (so legacy or strategic tags can be preserved but not classifier-applied).
  2. validate_contact_tags() triggerBEFORE INSERT OR UPDATE OF tags on professional_contacts. Auto-resolves known aliases to canonical names. Raises a clear error on unknown tags.
  3. resolveTags() helper in index.ts — pre-validates before the DB write, returns Levenshtein-based "did you mean" suggestions in the error message.
  4. Optional consolidation migration — one-time pass to canonicalize existing contact tags using the dictionary.

The standard professional_contacts.tags column type stays text[]. The trigger gracefully no-ops if tag_dictionary is empty (or absent, with a guard). So existing OB1 deployments aren't broken.

Behavior changes (when adopted)

Successful writecrm_add_contact({ tags: ['medtronic', 'ent', 'sponsor-active'] }) resolves to stored ['mdt', 'mdt-ent', 'sponsor'] (via alias map).

Rejected writecrm_add_contact({ tags: ['mdt', 'bogus'] }) returns:

Tag validation failed. Add new tags to tag_dictionary first, or use a canonical name:
  'bogus' is not a known tag — did you mean: bus, busy?

Tested end-to-end against a deployed Supabase Edge Function (post-#330 patch). Defense in depth: trigger catches all writers (MCP tool, direct SQL, dashboard, future tools) while the tool-layer helper provides clearer errors.

Open questions for the maintainer

  1. Is this within scope for OB1, or better as a separate "ob-extension-tag-hygiene" companion?
  2. If in scope: should the dictionary itself be part of the canonical schema.sql (with no seed rows) or shipped as an optional tag-dictionary.sql add-on?
  3. The trigger as drafted is strict (rejects unknown tags). Alternative is soft (warns + writes anyway). Strong opinions either way?
  4. The Gemma/agent workflow benefits significantly from the dictionary because it can be read at classification time and constrains the model's output vocabulary. Would docs/examples for that flow be valuable upstream?

Happy to draft a PR if there's interest. Local reference implementation:

  • tag_dictionary schema: composite of name, aliases[], definition, parent, dimension, is_classifier_safe, notes
  • Trigger: ~30 lines plpgsql
  • Helper: ~50 lines TS

Related: PR #330 (deploy-blockers) is the prerequisite — without the StreamableHTTPTransport fix the extension doesn't run as published.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions