Thanks for sharing the full blog post from Wilfred Hughes about building Difftastic. It's a phenomenally rich source of both insight and hard-won lessons that are directly applicable to git-ast and your broader mission of structure-first version control.
Let me synthesize what this post reveals through the lens of your project — treating code as structured data, stored in Git, and later enabling querying, transformation, and compilation. I'll highlight what Difftastic teaches us technically and strategically, and how we can use that knowledge to de-risk and sharpen your roadmap.
⸻
✅ Why Difftastic Matters to Your Project
1. Proof of Possibility: Difftastic shows that structural diffs can be done on real-world code at scale across multiple languages. That alone is validating. It was built with Tree-sitter, supports 40+ grammars, and works well enough for daily use.
2. Parse Tree → S-expression Conversion: Difftastic's trick of converting Tree-sitter CSTs into uniform s-expressions is a practical approach to normalize parse trees across languages. This reinforces that git-ast can do the same to support unified storage and diffs.
3. Diffing as Matching: The idea that diffing is "figuring out what hasn't changed" (vs what did) is profound. This mindset shift drives the use of graph search algorithms to optimize for readable, minimal diffs — a key concern for git-ast's semantic diffs and merges.
4. AST Diffs ≠ Minimal Textual Diffs: Just because something is minimal structurally doesn't mean it's intuitively helpful to humans. Difftastic tweaked cost heuristics and even did post-processing for aesthetic reasons. git-ast needs similar UX-level tuning.
⸻
-
Performance Will Be a Bottleneck
"The biggest performance bottleneck is vertex construction… the graph is O(L × R), where L and R are number of nodes in the left/right trees."
- 🧠 For git-ast: Your system will do this all the time as part of
git difformerge. You must optimize this early. Consider:- Lazy graph construction (as Wilfred does)
- Aggressive horizon trimming (ignore unchanged sections at the edges)
- Possibly switching to approximate match algorithms for big trees.
- 🧠 For git-ast: Your system will do this all the time as part of
-
Nesting & Delimiters Are Hard
"Graph vertices are really (L_pos, R_pos, parents_to_exit_together)… exponentially increases the graph size."
- 🧠 For git-ast: Tree nesting is a worst-case scenario in diffs or merges. You will need to restrict depth, memoize subtree hashes, or use simplifications for large diffs. Consider:
- Hashing top-level nodes and only comparing diffs on changed hashes.
- Splitting files into semantic units (functions, classes) and diffing those separately.
- 🧠 For git-ast: Tree nesting is a worst-case scenario in diffs or merges. You will need to restrict depth, memoize subtree hashes, or use simplifications for large diffs. Consider:
-
Minimal Edit Scripts ≠ Human-Legible Diffs
"Users would rather see the semantic intention (e.g., a new argument added) rather than minimal syntactic change."
- 🧠 For git-ast: This is critical. Your diffs are not for machines — they're for reviewers. Consider heuristics like:
- Matching functions or expressions by structure and location, not just identity.
- "Prefer whole-node changes over partial" when unclear.
- Provide "diff styles": minimal vs readable, like
--diff-style readable.
- 🧠 For git-ast: This is critical. Your diffs are not for machines — they're for reviewers. Consider heuristics like:
-
Fallbacks Are Required
"If the graph is too big, Difftastic falls back to a line-oriented diff."
- 🧠 For git-ast: You'll need similar fallbacks:
- For language grammar gaps (unsupported Tree-sitter grammars).
- For malformed syntax (use Tree-sitter's error-tolerant nodes).
- For performance cliffs (auto-revert to text mode for massive files).
- Maybe even a
git configsetting to disable AST mode for select files or repos.
- 🧠 For git-ast: You'll need similar fallbacks:
⸻
-
Adopt the S-expression Model
Normalize all Tree-sitter parse trees into a simplified, structural, S-expression-style representation.
- Flattened, but parent-aware.
- Normalizes across grammars.
- Enables generic diff/merge algorithms.
This can become your internal interchange format.
-
Build a Heuristic Diff Strategy, Not a Perfect One
Difftastic uses Dijkstra's algorithm with adjusted edge costs, not "optimal" diffs.
You should design your diff/merge engine around heuristics and cost models:
- Penalize unmatched delimiters, reordered parameters, etc.
- Tweak cost models to match real-world readability (optimize UX, not theory).
- Train on real diffs from your codebase to tune parameters.
-
Reframe Merges as Incremental Reconciliation
Instead of "merge trees A and B", think:
- Split both trees into regions (e.g., functions).
- Match nodes by name, position, or structure.
- Compare matched nodes individually (smaller graphs).
- Fallback to 3-way merge for unmatched/ambiguous nodes.
This keeps the merge task scalable and human-decodable.
-
Dogfood Diff Quality Early
Make a
git-ast diffcommand the first user-facing milestone. It will:- Prove the AST pipeline works.
- Validate Tree-sitter grammars.
- Build buy-in from developers (e.g., "wow, clean diffs!").
Dogfooding diffs ≫ merging early. This is Wilfred's own learning: diff quality is what people feel first.
-
Expect Structural Edge Cases
Wilfred mentions tricky patterns like:
(foo (bar)) → (foo (novel) (bar))You'll see this in JS, Python, etc., where order matters subtly. Have a test corpus of structural patterns:
- Param reordering
- Function inlining
- Nested conditionals
- Decorator changes
- Large string mutations
Each of these will need different cost tweaks or even custom handling.
⸻
You're building something ambitious and long overdue. Wilfred Hughes' Difftastic is the clearest example of someone wrestling the same dragons — and succeeding, with real, daily usability.
The biggest lesson from Difftastic is this:
"Structural correctness is not the same as developer usefulness." → You must constantly tune for real-world readability, not theoretical minimality.
Take this wisdom into git-ast. Build aggressively, but always in the loop with users. Prioritize diffs over merges. Tune over optimize. Fall back gracefully. And lean hard on Tree-sitter — it's your superpower.
If you'd like, I can help you:
- Extract a sample structural corpus from your codebase to benchmark against.
- Design a prototype of your own s-expression tree serializer.
- Or even propose a
git-ast diffCLI MVP based on Difftastic's model.
Let me know how you'd like to dive in next.
This document synthesizes insights from Wilfred Hughes' blog post "Difftastic, the Fantastic Diff" (Sept 6, 2022) as applied to the git-ast project.