-
Notifications
You must be signed in to change notification settings - Fork 124
More reproducibility fixes #1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve reproducibility in JSON serialization by introducing a custom to_canonical_json_string
function. The intent is good, and the changes replacing HashMap
with BTreeMap
in ostree-ext/src/container/store.rs
are a positive step towards this goal as they ensure sorted map keys.
However, the core of this PR, the use of olpc_cjson::CanonicalFormatter
in to_canonical_json_string
, raises a critical concern. As you've noted in the PR description, and as confirmed by the olpc-cjson
documentation, this formatter can produce JSON that is not strictly valid according to standard JSON specifications (e.g., unquoted keys). This could lead to significant interoperability problems if the output is consumed by standard JSON parsers.
Additionally, there are a few places where BufWriter
has been removed, which might have minor performance implications for very large JSON outputs. These are secondary to the main concern about JSON validity.
Overall, the goal of reproducible JSON is important, but the current approach for canonicalization needs careful reconsideration to ensure standard JSON validity.
Summary of Findings
- Potential for Non-Standard JSON Output: The core change relies on
olpc_cjson::CanonicalFormatter
, which is documented to produce JSON that may not be strictly valid (e.g., unquoted keys). This is a critical issue if the output needs to be consumed by standard JSON parsers, as it could lead to interoperability problems. - Removal of Buffered Writers:
BufWriter
has been removed in several places where JSON is written to files or stdout. This is likely due to the newto_canonical_json_string
function returning aString
. For very large JSON objects, serializing to a full string in memory first and then writing might be less efficient in terms of memory and I/O patterns compared to streaming to a buffered writer. This should be re-evaluated if the canonicalization strategy changes. - Improved Map Serialization with BTreeMap: The changes from
HashMap
toBTreeMap
in various parts ofostree-ext/src/container/store.rs
are a positive step towards reproducible JSON, asBTreeMap
ensures keys are serialized in a consistent, sorted order.
Merge Readiness
This pull request addresses an important aspect of reproducibility. The move to BTreeMap
for map serialization is a good improvement. However, the use of olpc_cjson::CanonicalFormatter
introduces a critical risk of producing non-standard JSON, which could break consumers expecting valid JSON.
I strongly recommend that this critical issue regarding JSON validity be resolved before merging. The PR author has already noted this concern in the description, and further discussion on the best path forward (potentially using standard serde_json
with BTreeMap
for canonical and valid output) is needed.
As a reviewer, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers after addressing the highlighted concerns.
ostree-ext/src/utils.rs
Outdated
#[context("Serializing to canonical JSON")] | ||
pub(crate) fn to_canonical_json_string<T>(value: T) -> anyhow::Result<String> | ||
where | ||
T: serde::Serialize, | ||
{ | ||
let mut ser = serde_json::Serializer::with_formatter(Vec::new(), olpc_cjson::CanonicalFormatter::new()); | ||
value.serialize(&mut ser)?; | ||
let str = String::from_utf8(ser.into_inner())?; | ||
Ok(str) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of to_canonical_json_string
using olpc_cjson::CanonicalFormatter
is a key change for reproducibility. However, as you pointed out in the PR description ('not sure it's the best choice has it can produce invalid json'), and as confirmed by the olpc-cjson
crate's documentation, this formatter is known to produce a format that is 'not necessarily valid JSON (object keys are not quoted)'.
This is a critical concern:
- Interoperability: If the JSON generated by this function is intended for systems expecting standard, valid JSON, this will likely cause parsing errors or unexpected behavior.
- Maintainability: Relying on a non-standard JSON format can create a hidden dependency on consumers understanding this specific format.
Could you clarify the intended consumers of this JSON output?
If standard JSON is a requirement, olpc_cjson::CanonicalFormatter
seems unsuitable. For achieving reproducible JSON with serde_json
while maintaining validity:
- The primary step is ensuring map-like structures (previously
HashMap
) areBTreeMap
. This PR already makes this change in several places (e.g.,MetaFilteredData
), which is excellent for ensuring keys are serialized in a defined order. - Standard
serde_json
will then serializeBTreeMap
keys in sorted order, achieving canonical output for maps.
What are your thoughts on using standard serde_json
combined with BTreeMap
for all map types to achieve reproducibility without sacrificing JSON validity? This would align with the goal of replacing direct serde_json::to_*
calls with helpers that ensure canonical (and valid) output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
olpc_cjson::CanonicalFormatter breaks things:
error: Installing to filesystem: Creating ostree deployment: Pulling: Importing: control character (\u0000-\u001F) found while parsing a string at line 2 column 0
This is used in ocidir since ostreedev/ostree-rs-ext@d655a16
@cgwalters what do you think about https://docs.rs/cjson/0.1.2/cjson/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind this is what you removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking containers/ocidir-rs#10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
837b3ff
to
fd68fdf
Compare
Should I move |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust solution for ensuring reproducible JSON serialization by leveraging serde_json::Value
which internally uses BTreeMap
for objects, thus guaranteeing key order. The new JsonOrderedSerialize
trait provides a clean abstraction for this, and its application throughout the codebase, along with the switch from HashMap
to BTreeMap
for relevant data structures, effectively addresses the non-deterministic output issues. The changes are well-targeted and improve the predictability of serialized outputs. Great work!
Summary of Findings
- Documentation for
JsonOrderedSerialize
trait: Inostree-ext/src/json.rs
, the newJsonOrderedSerialize
trait could benefit from a doc comment explaining its purpose: to ensure reproducible JSON output by ordering map keys, typically by first converting toserde_json::Value
. This would enhance maintainability. (Severity: low, not commented due to review settings) - Clarity of inline comment: In
ostree-ext/src/container/store.rs
(line 1007), the comment// Use serde_json::Value to make output reproducible
is accurate. For slightly improved clarity, it could be rephrased to focus on the outcome, e.g.,// Serialize to JSON with ordered keys for reproducibility via
JsonOrderedSerializetrait.
(Severity: low, not commented due to review settings)
Merge Readiness
The changes in this pull request are well-implemented and effectively address the issue of non-reproducible JSON output. The introduction of the JsonOrderedSerialize
trait is a clean solution. There are a couple of minor suggestions for documentation and comment clarity (mentioned in the findings summary) that the author might consider, but they are not blockers. The code appears to be in good shape for merging. As an AI, I am not authorized to approve pull requests; please ensure further human review and approval before merging.
serde_json doesn't enforce any ordering when serializing HashMap Signed-off-by: Etienne Champetier <[email protected]>
Use BtreeMap and JsonOrderedSerialize to ensure ordering Signed-off-by: Etienne Champetier <[email protected]>
Signed-off-by: Etienne Champetier <[email protected]>
S: Serialize, | ||
{ | ||
fn to_json_ordered_string(&self) -> Result<String> { | ||
let val = serde_json::to_value(self)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that the JSON we serialize here is large, but this is inefficient. It's also not really clear to me why this solves the problem...actually it looks like it only does if a not-on-by-default feature is passed https://github.com/serde-rs/json/blob/c1826ebcccb1a520389c6b78ad3da15db279220d/Cargo.toml#L61
I think we really just need to fix the olpc json formatter crate. In order to not block on upstream I'd tentatively be OK forking it into ocidir or so if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it rely on preserve_order
being off / HashMap being converted to BtreeMap, and is inefficient, but nothing compared to all the single threaded gzip compression / decompression we are doing in between the milliseconds of json processing ;).
Even if a lesser issue, we also need a solution for pretty printing BTW
Let's continue the discussion in containers/ocidir-rs#10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we really just need to fix the olpc json formatter crate. In order to not block on upstream I'd tentatively be OK forking it into ocidir or so if that helps.
Passing HashMap to serde_json gives you random json :(
olpc_cjson::CanonicalFormatter
(already used by ocidir) break stuff (error: Installing to filesystem: Creating ostree deployment: Pulling: Importing: control character (\u0000-\u001F) found while parsing a string at line 2 column 0
) so for now I'm just converting to serde_json::Value so HashMap are converted to BtreeMap, and serializing that