Generalize computable multivariate polynomials to arbitrary ordered index types

Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval.

This design direction arose out of discussion with Julian Sutherland (@Julek), Katy Hristova (@katyhr), and Derek Sorensen (@dhsorens).

## Summary

We should generalize `CMvPolynomial` from the current finite-arity surface

```lean
CMvPolynomial n R
```

to a variable-indexed surface

```lean
CMvPolynomial σ R
```

for an arbitrary ordered index type `σ`, possibly infinite.

The key design constraint is that this remains a **computable** multivariate polynomial representation. In particular, we do **not** want to simply redefine the core in terms of `Finsupp`, because that defeats the purpose of this library’s computable sparse runtime representation.

The design direction described below is the one currently implemented in the in-progress branch:

- sparse monomials as canonical arrays of `(σ × Nat)` entries
- sparse outer polynomial storage as `Std.ExtTreeMap (CMvMonomial σ) R`
- comparator-style assumptions on variables (`Ord` / `TransOrd` / `LawfulEqOrd`), not `LinearOrder`
- `Finsupp` used only at the boundary for `MvPolynomial` interoperability

This issue documents:
- the design choice
- why we think it is the right one
- what alternatives we considered and rejected
- how this compares to other computer algebra systems
- the current blocker: normalization and sorting proofs, especially `qsort` vs `List.mergeSort`

---

## Goal

We want the computable multivariate polynomial layer to support arbitrary variable types `σ`, including possibly infinite ones, while preserving:

- sparse computable storage
- support-driven algorithms
- canonical monomial keys
- a public surface closer to `Mathlib`’s `MvPolynomial σ R`

Today, much of the existing core is built around `Fin n` and dense exponent vectors. That is workable for fixed finite arity, but it bakes ambient-dimension assumptions into many operations and creates friction with the `MvPolynomial` surface.

The generalized core should instead support:

- `X : σ → CMvPolynomial σ R`
- support-driven `degreeOf`, `vars`, `degrees`, `restrictDegree`, `eval`, `rename`, `bind₁`
- variable types `σ` that may be infinite
- a bridge to `MvPolynomial σ R` at the boundary

without sacrificing computability of the core representation.

---

## Chosen design

### Monomials

A monomial is represented as a sparse finite `Nat`-valued map on variables `σ`, where:

- missing variable means exponent `0`
- stored zero exponents are forbidden
- the representation is canonical

Concretely, the runtime representation is:

- a canonical sparse array of `(σ × Nat)` entries
- sorted by variable
- with duplicate variables merged
- with zero exponents removed
- with cached `totalDegree`

The current implementation branch models this through a raw/lawful split:

- `RawMonomial σ` stores:
  - `totalDegree : Nat`
  - `entries : Array (σ × Nat)`
- `CMvMonomial σ` is the canonical lawful wrapper

This preserves the semantic model of a finite `Nat`-valued map while keeping the concrete runtime representation sparse and efficient.

### Polynomials

A polynomial remains a sparse outer map from monomials to coefficients:

- `Unlawful σ R := Std.ExtTreeMap (CMvMonomial σ) R`
- `Lawful σ R := { p : Unlawful σ R // no zero coefficients }`
- `CMvPolynomial σ R := Lawful σ R`

So the overall design is sparse at both levels:

1. sparse monomials
2. sparse outer polynomial map

### Typeclass assumptions on variables

We want to stay at comparator granularity rather than requiring `LinearOrder`.

The intended contract is:

- raw normalization code uses `[Ord σ]`
- lawful monomials / polynomial APIs use `[TransOrd σ] [LawfulEqOrd σ]`

This is the right fit for the stdlib ordered-map world we are already using.

### `Finsupp` only at the boundary

`Finsupp` is still useful and appropriate for the Mathlib bridge:

- `CMvMonomial.toFinsupp`
- `CMvMonomial.ofFinsupp`
- `CMvPolynomial.support : Finset (σ →₀ ℕ)`
- conversions to and from `MvPolynomial σ R`

But `Finsupp` should not become the core runtime representation.

---

## Why this design

### 1. It preserves computability

This is the core reason.

A direct redefinition in terms of `Finsupp` would make the representation mathematically convenient, but it would give up the explicit computable sparse runtime representation that `CompPoly` is trying to provide.

The chosen design keeps the runtime model explicit:
- sparse support
- canonical normalization
- ordered storage
- computable key comparison

### 2. It supports arbitrary and infinite variable types

Dense exponent vectors fundamentally assume a finite ambient dimension. That is exactly what we want to move away from.

With sparse support arrays:
- only variables that actually occur are stored
- only stored support is traversed
- infinite `σ` becomes perfectly natural

This is a major improvement for:
- `degreeOf`
- `totalDegree`
- `vars`
- `degrees`
- `restrictDegree`
- `evalMonomial`
- `rename`

### 3. It is symmetric in variables

A recursive sparse representation would privilege one “main variable” at each layer. That is attractive for some algorithms, but it is the wrong default for the surface we want.

The array-of-support design keeps variables symmetric and fits better with:
- arbitrary index types
- `rename : (σ → τ) → ...`
- `bind₁`
- a surface aligned with `MvPolynomial σ R`

### 4. It is a good performance compromise

We considered tree-backed sparse monomials as well, but monomials are also keys of the outer `ExtTreeMap`. That makes the concrete monomial representation very important.

A sparse monomial stored as a canonical sorted array is preferable to a tree-backed monomial because:
- key comparison is zero-allocation lexicographic array comparison
- memory layout is tighter
- monomial add/divide/divides become linear merges over sorted supports
- we avoid the “tree as key” problem where extensional comparison wants flattening/caching anyway

So the current choice is:
- outer tree map for polynomials
- inner sorted sparse arrays for monomials

That looks like the right split.

---

## Alternative designs we considered

### A. Keep dense exponent tuples (`Vector ℕ n`-style monomials)

This is closest to the old design.

#### Benefits
- simpler proofs
- simpler `MvPolynomial` bridge
- very good constants for tiny dense `Fin n` workloads

#### Drawbacks
- fundamentally tied to finite ambient arity
- ambient-dimension scans everywhere
- poor fit for arbitrary or infinite variable types
- worse for sparse monomials in large ambient spaces

This does not meet the goal of a genuinely generic `CMvPolynomial σ R`.

---

### B. Represent monomials as `ExtTreeMap σ Nat`

This is the most obvious “sparse map” design if one reasons only semantically.

#### Benefits
- sparse lookup by variable
- conceptually close to “finite map from variables to exponents”

#### Drawbacks
- monomials are also keys of the outer polynomial map
- extensional comparison of tree-backed monomials is awkward
- comparing monomials by flattening/sorting their entries would allocate and be expensive
- memory layout is much worse than a compact sorted array

We concluded that a tree-backed monomial is the wrong **concrete runtime representation**, even though the semantics are indeed “finite `Nat`-valued map plus invariant”.

---

### C. Make the whole polynomial recursively sparse

This is closer to FriCAS / Maxima style.

#### Benefits
- better fit for algorithms that privilege a main variable
- more natural for elimination-style division and GCD-style recursion
- `finSucc` / split-variable equivalences become more structural

#### Drawbacks
- loses symmetry between variables
- less natural surface for `MvPolynomial σ R`
- renaming and arbitrary-index substitution become more awkward
- poorer fit for the “generic arbitrary index type” goal

We do not think recursive sparse should be the core representation for this migration.

---

### D. Use an outer sorted term array rather than `ExtTreeMap`

This is closer to a Singular-style sorted term list.

#### Benefits
- better iteration locality
- attractive for term-heavy kernels
- potentially a better fit for future Gröbner-oriented algorithms

#### Drawbacks
- coefficient lookup is worse
- maintaining canonicality after insertion is more work
- incremental arithmetic is more awkward than with a tree map

This remains a plausible future alternative for performance experiments, but it is not the right move for the current cutover.

---

### E. Rebuild the core around `Finsupp`

This would maximize parity with `MvPolynomial`.

#### Benefits
- simplest bridge to Mathlib
- easiest statement of theorems
- closest surface alignment

#### Drawbacks
- gives up the explicit computable sparse runtime representation
- defeats the point of `CompPoly`’s computable multivariate layer

This is the main alternative we explicitly do **not** want.

---

## Comparison with other computer algebra systems

The chosen design is best understood as a **sparse distributed** representation.

### Sparse distributed systems

Examples:
- Sage generic fallback
- Maple POLY
- Mathematica’s polynomial-facing `CoefficientRules` view
- likely the practical storage view in Singular

CompPoly is closest to this family, with one extra twist:
- monomials themselves are also stored sparsely

So instead of full exponent tuples, we store only the nonzero support of each monomial.

#### Why this is good
- supports arbitrary ordered `σ`
- works naturally for infinite variable types
- avoids ambient scans
- keeps support-driven operations efficient

#### Cost
- bridge proofs to `MvPolynomial` are harder than if we stored full tuples

---

### Recursive sparse systems

Examples:
- FriCAS
- Maxima CRE polynomial form

These privilege a main variable and recurse on the coefficient side.

This can be a good design for elimination-heavy algorithms, but it is not the best fit for:
- arbitrary index types
- symmetric variable operations
- an `MvPolynomial`-aligned public surface

---

### Recursive dense systems

Example:
- SymPy `dmp`

This is the wrong fit for CompPoly’s goals:
- poor for sparse support
- poor for infinite variable types
- poor fit for an explicitly sparse computable design

---

## Current blockers

The main blocker is **not** the runtime representation itself. The runtime direction is reasonably clear.

The blocker is the proof story around **normalization**.

### The current normalization pipeline

Right now the sparse monomial normalization path is effectively:

1. sort entries by variable
2. merge adjacent duplicates
3. drop zero exponents

In the current implementation branch this is expressed through:

- `Array.qsortOrd`
- `Array.mergeAdjacentDups` from Batteries
- `filter`

This is fine computationally, but the theorem surface is weak.

### Why this blocks the bridge

To finish the `MvPolynomial` bridge cleanly, we need monomial roundtrip lemmas such as:

- `toFinsupp_single`
- `toFinsupp_add`
- `toFinsupp_ofFinsupp`
- `ofFinsupp_toFinsupp`

Those in turn need semantic correctness lemmas for normalization:
- normalization preserves the exponent-map semantics
- normalization is canonical
- normalization is identity on already canonical input

That is exactly where the current proof burden spikes.

---

## Sorting algorithm choice: `qsort` vs `List.mergeSort`

This is the immediate engineering decision.

### Option 1: stay with `Array.qsortOrd`

#### Pros
- runtime-friendly
- natural for the current array-backed representation
- already in stdlib
- likely the best constant factors among the obvious pure Lean options

#### Cons
- theorem API is currently very thin
- proving `qsort` correctness from current public API is nontrivial
- Batteries merge helpers also have very little semantic theorem surface
- this makes monomial normalization proofs expensive

We spun off a branch specifically to probe this:
- [`quang/qsort-api-proofs-spike`](https://github.com/Verified-zkEVM/CompPoly/tree/quang/qsort-api-proofs-spike)

That branch contains a small exploratory module:
- `CompPoly/Data/Array/QSortApiProofs.lean`

The conclusion from that probe is:

- a small theorem slice is easy
- the real permutation/sortedness API for `qsort` is feasible in principle
- but a clean upstream story likely requires exposing/supporting induction over the recursive `qpartition` / `qsort` helpers and also upstreaming some small `Perm.extract`-style lemmas

Lean already has a historical draft in this direction:
- `qsort_grind` branch
- proof file: `tests/lean/run/grind_qsort.lean`

So the `qsort` route is attractive, but it is a larger proof project.

---

### Option 2: switch normalization proofs to `List.mergeSort`

#### Pros
- much better existing theorem surface
- standard lemmas already available:
  - permutation
  - membership preservation
  - pairwise sortedness
  - identity on sorted input
- much easier to prove canonicalization and roundtrip lemmas
- likely the shortest path to finishing the `MvPolynomial` bridge

#### Cons
- normalization becomes list-based in the proof-critical path
- may lose some constant factors relative to the direct array path
- requires converting `Array` to `List` and back in normalization

This is currently the most pragmatic path if the priority is:
- finishing the generalized bridge
- reducing proof burden
- keeping the runtime representation as `Array`

In that design, we would still store monomials as arrays, but define normalization through:
- `entries.toList`
- `List.mergeSort`
- a repo-local `collapseSorted`
- `toArray`

---

## Current recommendation

The representation choice itself is sound:

- monomials: canonical sparse arrays
- polynomials: `ExtTreeMap` on canonical monomial keys
- variables: arbitrary ordered type `σ`
- `Finsupp`: boundary only

The open design question is not “what should a monomial be?” anymore.

The open question is:
- how should we implement and prove **normalization**?

At this point the two realistic next steps are:

1. **Pragmatic route**
   - switch normalization proofs to `List.mergeSort`
   - finish the `MvPolynomial` bridge
   - keep array runtime storage

2. **Upstream / ecosystem route**
   - push on `qsort` / Batteries merge theorem infrastructure
   - then come back and finish the CompPoly bridge on top of stronger upstream APIs

Given current project priorities, I think (1) is the better immediate route for CompPoly, while (2) is a worthwhile ecosystem contribution in parallel.

---

## Questions for follow-up

- Do we want to prioritize shipping the generalized `CMvPolynomial σ R` bridge, or investing first in upstream `qsort` / merge theorem APIs?
- Do we want to keep the current existential notion of monomial lawfulness, or move to a more explicit canonical-form predicate for proof friendliness?
- Should we separate storage order from algebraic monomial order now, or defer that until Gröbner-style algorithms become a concrete priority?





Generalize computable multivariate polynomials to arbitrary ordered index types #175

Description

Summary

Goal

Chosen design

Monomials

Polynomials

Typeclass assumptions on variables

Finsupp only at the boundary

Why this design

1. It preserves computability

2. It supports arbitrary and infinite variable types

3. It is symmetric in variables

4. It is a good performance compromise

Alternative designs we considered

A. Keep dense exponent tuples (Vector ℕ n-style monomials)

Benefits

Drawbacks

B. Represent monomials as ExtTreeMap σ Nat

Benefits

Drawbacks

C. Make the whole polynomial recursively sparse

Benefits

Drawbacks

D. Use an outer sorted term array rather than ExtTreeMap

Benefits

Drawbacks

E. Rebuild the core around Finsupp

Benefits

Drawbacks

Comparison with other computer algebra systems

Sparse distributed systems

Why this is good

Cost

Recursive sparse systems

Recursive dense systems

Current blockers

The current normalization pipeline

Why this blocks the bridge

Sorting algorithm choice: qsort vs List.mergeSort

Option 1: stay with Array.qsortOrd

Pros

Cons

Option 2: switch normalization proofs to List.mergeSort

Pros

Cons

Current recommendation

Questions for follow-up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Finsupp` only at the boundary

A. Keep dense exponent tuples (`Vector ℕ n`-style monomials)

B. Represent monomials as `ExtTreeMap σ Nat`

D. Use an outer sorted term array rather than `ExtTreeMap`

E. Rebuild the core around `Finsupp`

Sorting algorithm choice: `qsort` vs `List.mergeSort`

Option 1: stay with `Array.qsortOrd`

Option 2: switch normalization proofs to `List.mergeSort`