Join and PartialJoin traits#20
Conversation
43b2c19 to
e295fb3
Compare
bc1cindy
left a comment
There was a problem hiding this comment.
looks good, utACK
accumulating conflicts into a set instead of picking a value arbitrarily is the right call
arminsabouri
left a comment
There was a problem hiding this comment.
PartialJoin and Join definitions looked sane. Conflicts are still tripping me up.
What is the flow? if there is a conflict between two PSBTs and a third needs to be partially joined is it:
- Join all three in one operation and deal with the conflicts?
- Join the first two, resolve conflicts / or rejects. If Ok() then join the third?
The latter makes more sense to me. Curious to know your thoughts
dc777b0 to
d8edb43
Compare
|
Re-ack required, IntoIterator &Conflict implementation was a brain fart, not sure why my brain expected it to magically implement .iter() on the target type (but now that that exists, its implementation is in terms of .iter()) I also reordered some of the impl blocks, conslidated two of them by changing my mind for the 40th time about whether Conflict should require |
d8edb43 to
76f775e
Compare
arminsabouri
left a comment
There was a problem hiding this comment.
re-ACK
Had a question about the use of wrap() in a unit test. Shouldn't be a blocker.
89ed490 to
2a8d621
Compare
b050ba6 to
1345567
Compare
25357ea to
9477920
Compare
aac45b5 to
c792f9a
Compare
Use cargo-llvm-cov nextest --no-tests=warn instead of the exit-status fallback that silently swallowed build failures.
c792f9a to
f1227fb
Compare
This is conditional on coverage_nightly cfg_attr (set by cargo llvm-cov on nightly toolchains), this cfg_attr was allowed in nrrq
Join: infallible binary merge (semilattice operation). Implementors must satisfy idempotent, commutative, and associative laws. JoinMut: in-place variant; blanket impl provides Join automatically. assert_join_laws! is a crate-internal macro (gated behind prop-tests feature, scoped via #[macro_use]). Given an arbitrary strategy, it generates three proptests verifying the semilattice laws for any Join + Clone + PartialEq + Debug type. Used by downstream commits to validate collection and domain-type Join implementations.
f1227fb to
3c9f5a8
Compare
PartialJoin<V>::try_join returns JoinResult<V>, either Ok(v), where v
the least upper bound, or a Err(Conflict<V>) where the conflict contains
values that could not be joined.
Conflict<V> is a multiset with set-equality semantics (order-independent).
It implements JoinMut to merge conflict sets (union of distinct values).
JoinResult<V> itself implements Join:
(Ok(a), Ok(b)) => a.try_join(b) (delegate to PartialJoin on V)
(Ok(v), Err(c)) => Err(c ∪ {v}) (absorb into conflict)
(Err(c), Ok(v)) => Err(c ∪ {v}) (absorb into conflict)
(Err(a), Err(b)) => Err(a ∪ b) (join conflict sets)
Containers or product types that wrap their fields in JoinResult<V> can
implement Join recursively to compute the field-by-field merge without
early exit.
assert_partial_join_laws! is a crate-internal macro (gated behind
prop-tests feature, scoped via #[macro_use]). Given clean-value and
result strategies, it generates try_join law tests, JoinResult law tests
(via assert_join_laws!), and a wrap roundtrip test.
## Rationale for Conflict as flat, order preserving & order insensitive
Conflicts represent a formal completion of the partial semilattice `V :
PartialJoin`. If `a` and `b` are conflicts, commutativity requires that
`a.join(b) == b.join(a)`, and idempotence requires that `a.join(a) ==
a`. This is somewhat at odds with keeping track of where each
conflicting value originated from.
The main purpose of preserving the order is to allow provenance to be
tracked.
If `a` and `b` are both conflict free, and `let c = a.join(b)`, then
`c.conflicted_field.len() == 2` and the first value is from `a` whereas
the second is from `b`, which makes reporting this as an error with
clear diagnostics easier, without requiring the provenance be tracked by
some kind of surrogate ID.
This does not generalize to n > 2, because if
`a.join(b).join(c).some_conflicted_field.len() == 2`, the values could
originate from `(a, b)`, `(b, c)`, or `(a, c)`.
This compromise keeps the interface and implementation simple, allows
provenance to be tracked as long as it's done one pair of values at a
time in a straightforward way, but imposes no additional burdens on
users that do not care about provenance (for instance if computing
something like `vs.reduce(|a, b| a.join(b))`)
### Alternatives considered
Several alternative approaches were tried, of which the compromise of
making `Conflict` just a thin wrapper around Vec seemed the best.
#### HashSet or BTreeSet based
This alternative is very close to what is implemented. The differences
are that with a Vec, the order is preserved, the implementation of
equality and `join` has quadratic complexity. We expect `n` to be very
small so this shouldn't make a difference in practice.
Using a lookup based set requires `V : Hash` or `V : Ord` which the
current `Conflict` does not require (unfortunately adding it later would
be semver breaking, as would be changing the return value from `iter()`
or the associated `IntoIter` type of the `IntoIterator` impl).
#### Recursive data type
The following definition could in principle shadow the `join` structure:
```rs
enum Conflict<V> {
Value(V),
Pair([Box<Conflict>; 2])
}
```
In this case, if `a.join(b).join(c.join(d))` has 4 conflicting values,
they would take the form `Pair([ Pair([ Value(x), Value(y) ]), Pair([
Value(z), Value(w) ]) ])`, which is arguably more informative.
Unfortunately this is still imprecise because if `a.join(b).join(c)` has
a binary conflict `Pair([ Value(x), Value(y) ])`, it's still ambiguous
in the same way.
In order for this approach to be workable it has to shadow the syntax
tree of the join operation for the Ok branch too, in which case this
entire abstraction kind of only computing the transpose, going from e.g.
a list of structs with values, to a struct of lists of values, but not
reducing any of the complexity unless there are no conflicts anywhere.
The purpose of these abstractions is to take the problem of merging two
or more compound values into the a series of simpler problems, merging
two or more elements of a simpler type. Tracking provenance with perfect
fidelity means that if there is any conflict the structure is not
simplified at all.
#### n-ary join
The final option considered was defining join not as a binary operation
but n-ary. This is no different than the other options in terms of
expressive power, it could be just a simple transpose step followed by
some kind of collapse-compatible-values-to-LUB step. Presumably this
could be impl on `SomeNewType(Vec<Psbt>)`, with a transpose operation to
a `VecPsbt` (analog of `ResultPsbt`) which internally contains `Vec<>`
wrappers (instead of `JoinResult` wrappers) for each field. Then this
transposed structure would be joined field-wise, attempting to collapse
the n items to their LUB.
The current implementation can kinda represent this operation by mapping
*everything* to a conflict, joining and then reducing the conflicts, at
the cost of no longer having the special meaning of a unary conflict
representing a global invariant violation.
### Associated error type or generics
The above alternatives imply a "one size fits all" approach. However,
PartialJoin could have an Error type, where `JoinResult<V> = Result<V,
<V as PartialJoin>::Error>`.
Ostensibly this would allow some choice, but with associated types the
choice is fixed per implementation of the trait and so would not afford
users the choice of whether to opt out of provenance tracking for
simpler errors or opt in and deal with the added complexity.
Making the error type fully generic would make that possible with even
more complexity and syntactic overhead. However, no generality would be
gained for this additional complexity.
Thinking of Conflict<V> as just "deferred arguments for a join" (i.e. a
formal product), any arbitrary merge operation can be expressed by just
taking those arguments.
More formally, Conflict<V> is the free semilattice (sets under union)
over V. Since every semilattice is a quotient of the free semilattice,
so there is no operation that can be expressed by setting Error to some
type that merges V's according to some rules (e.g. taking the max of
integers) that can't be expressed by simply processing the conflict
after the fact.
### Conclusion
For these reasons, making Conflict a thin wrapper around `Vec` seems
like the best compromise: has the same expressive power but results in a
simpler interface than all the alternatives, and makes provenance
tracking possible and even relatively straightforward without forcing it
on all users.
3c9f5a8 to
7ca5400
Compare
|
the more i think about it, the n-ary join approach might still be on the table. it has some nice properties. api sketcha at the bottom there will be a
next there would be a the leaf level wrapping for this variant is to wrap in a unary Join struct, instead of wrapping in an this would retain full provenance in an n-ary situation, independently of the conflicted status. missing values could be represented as then this is where JoinResult and Conflict would enter the picture. every should it be the case that downsidesthis adds extra steps but it seems like those steps could be ellided with a convenience API. this sort of forces all of the implementations to be a bit more complex, but since they've already mostly been reduced to a single macro for product types and simple recursive ones for the collection types, this doesn't seem like a major downside. not as direct an expression of order theoretical basis for this library, but a different structure whose correctness validity hinges on that basis (in the reduce() step PartialJoin could provide a pairwise implementation that still obeys the lattice laws) |
arminsabouri
left a comment
There was a problem hiding this comment.
re-ACK
re: n-ary vs binary join. I think n-ary makes more sense and removes the need for Conflict to impl join. It just becomes a means of reporting conflicts. How do you want to proceed?
what n-ary can simplify is the recursive descent into the values, i.e. go from container types (collections & product types) to just primitive values that can conflict or not. but when faced with a list of vluees and they are in conflict, it's very hard to convey what exactlyu the conflict is in a simple way. for n values there may be several subsets that are incompatible with each other, and actually making use of the fact that it's n-ary not binary requires sorting through them, or requires the eerror types to convey this. we could reuse the Conflict value, and just put an Rc in every slot that refers to all of the conflicting slots, etc... but IMO this is a nightmare for what i expect is the most common usecase which is dealing with things pairwise (in the various interactive protocols, you deal with one message at a time and update the state if it works out or reject the message if it doesn't, leaving the state unmodified). n-ary makes a bit more sense in the case where, in a cli, you deal with multiple files and there's no clear precedence order between them. a pairwise reduction might not give the complete information about everything one particular transaction conflicts with without trying to apply join but it's not clear that that is going to be more useful than just reporting the first error discovered. i'm not entirely convinced this can't be improved or that binary is the way to go but it seems like the more conservative choice all things considered and we can always go back and replace it, mathematically the two models are entirely equivalent if we ignore the provenance question so it's not like one is more correct than the other. |

Introduces
JoinandPartialJoin, traits for combining values that respect commutativity, associativity, and idempotence.This allows collections of values to be merged, tracking conflicts between them.
Join::joinis total (infallible),PartialJoin::try_joinreturns aJoinResult.JoinResultimplementsJoin.PartialJoin::wrapwrapsselfinOkinjecting or "lifting" it into theJoinResultspace, which is essentially a completion of join by way of a formal product: theConflictstruct just accumulates arguments tojoinwhen one does not exist in the partial definition.Lifting to the result type is more convenient since a total join can be used to reduce items in any order and obtain the least upper bound of all of the inputs, which in this case may be a conflict.