Skip to content

More reproducibility fixes #1347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

More reproducibility fixes #1347

wants to merge 3 commits into from

Conversation

champtar
Copy link
Contributor

@champtar champtar commented Jun 2, 2025

Passing HashMap to serde_json gives you random json :(

olpc_cjson::CanonicalFormatter (already used by ocidir) break stuff (error: Installing to filesystem: Creating ostree deployment: Pulling: Importing: control character (\u0000-\u001F) found while parsing a string at line 2 column 0) so for now I'm just converting to serde_json::Value so HashMap are converted to BtreeMap, and serializing that

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve reproducibility in JSON serialization by introducing a custom to_canonical_json_string function. The intent is good, and the changes replacing HashMap with BTreeMap in ostree-ext/src/container/store.rs are a positive step towards this goal as they ensure sorted map keys.

However, the core of this PR, the use of olpc_cjson::CanonicalFormatter in to_canonical_json_string, raises a critical concern. As you've noted in the PR description, and as confirmed by the olpc-cjson documentation, this formatter can produce JSON that is not strictly valid according to standard JSON specifications (e.g., unquoted keys). This could lead to significant interoperability problems if the output is consumed by standard JSON parsers.

Additionally, there are a few places where BufWriter has been removed, which might have minor performance implications for very large JSON outputs. These are secondary to the main concern about JSON validity.

Overall, the goal of reproducible JSON is important, but the current approach for canonicalization needs careful reconsideration to ensure standard JSON validity.

Summary of Findings

  • Potential for Non-Standard JSON Output: The core change relies on olpc_cjson::CanonicalFormatter, which is documented to produce JSON that may not be strictly valid (e.g., unquoted keys). This is a critical issue if the output needs to be consumed by standard JSON parsers, as it could lead to interoperability problems.
  • Removal of Buffered Writers: BufWriter has been removed in several places where JSON is written to files or stdout. This is likely due to the new to_canonical_json_string function returning a String. For very large JSON objects, serializing to a full string in memory first and then writing might be less efficient in terms of memory and I/O patterns compared to streaming to a buffered writer. This should be re-evaluated if the canonicalization strategy changes.
  • Improved Map Serialization with BTreeMap: The changes from HashMap to BTreeMap in various parts of ostree-ext/src/container/store.rs are a positive step towards reproducible JSON, as BTreeMap ensures keys are serialized in a consistent, sorted order.

Merge Readiness

This pull request addresses an important aspect of reproducibility. The move to BTreeMap for map serialization is a good improvement. However, the use of olpc_cjson::CanonicalFormatter introduces a critical risk of producing non-standard JSON, which could break consumers expecting valid JSON.

I strongly recommend that this critical issue regarding JSON validity be resolved before merging. The PR author has already noted this concern in the description, and further discussion on the best path forward (potentially using standard serde_json with BTreeMap for canonical and valid output) is needed.

As a reviewer, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers after addressing the highlighted concerns.

Comment on lines 38 to 47
#[context("Serializing to canonical JSON")]
pub(crate) fn to_canonical_json_string<T>(value: T) -> anyhow::Result<String>
where
T: serde::Serialize,
{
let mut ser = serde_json::Serializer::with_formatter(Vec::new(), olpc_cjson::CanonicalFormatter::new());
value.serialize(&mut ser)?;
let str = String::from_utf8(ser.into_inner())?;
Ok(str)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The introduction of to_canonical_json_string using olpc_cjson::CanonicalFormatter is a key change for reproducibility. However, as you pointed out in the PR description ('not sure it's the best choice has it can produce invalid json'), and as confirmed by the olpc-cjson crate's documentation, this formatter is known to produce a format that is 'not necessarily valid JSON (object keys are not quoted)'.

This is a critical concern:

  1. Interoperability: If the JSON generated by this function is intended for systems expecting standard, valid JSON, this will likely cause parsing errors or unexpected behavior.
  2. Maintainability: Relying on a non-standard JSON format can create a hidden dependency on consumers understanding this specific format.

Could you clarify the intended consumers of this JSON output?

If standard JSON is a requirement, olpc_cjson::CanonicalFormatter seems unsuitable. For achieving reproducible JSON with serde_json while maintaining validity:

  • The primary step is ensuring map-like structures (previously HashMap) are BTreeMap. This PR already makes this change in several places (e.g., MetaFilteredData), which is excellent for ensuring keys are serialized in a defined order.
  • Standard serde_json will then serialize BTreeMap keys in sorted order, achieving canonical output for maps.

What are your thoughts on using standard serde_json combined with BTreeMap for all map types to achieve reproducibility without sacrificing JSON validity? This would align with the goal of replacing direct serde_json::to_* calls with helpers that ensure canonical (and valid) output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

olpc_cjson::CanonicalFormatter breaks things:
error: Installing to filesystem: Creating ostree deployment: Pulling: Importing: control character (\u0000-\u001F) found while parsing a string at line 2 column 0

This is used in ocidir since ostreedev/ostree-rs-ext@d655a16

@cgwalters what do you think about https://docs.rs/cjson/0.1.2/cjson/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind this is what you removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@champtar champtar force-pushed the repro2 branch 3 times, most recently from 837b3ff to fd68fdf Compare June 3, 2025 09:14
@champtar
Copy link
Contributor Author

champtar commented Jun 3, 2025

Should I move JsonOrderedSerialize to bootc-utils and cleanup the whole codebase of serde_json::to_* ?

@champtar

This comment was marked as outdated.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust solution for ensuring reproducible JSON serialization by leveraging serde_json::Value which internally uses BTreeMap for objects, thus guaranteeing key order. The new JsonOrderedSerialize trait provides a clean abstraction for this, and its application throughout the codebase, along with the switch from HashMap to BTreeMap for relevant data structures, effectively addresses the non-deterministic output issues. The changes are well-targeted and improve the predictability of serialized outputs. Great work!

Summary of Findings

  • Documentation for JsonOrderedSerialize trait: In ostree-ext/src/json.rs, the new JsonOrderedSerialize trait could benefit from a doc comment explaining its purpose: to ensure reproducible JSON output by ordering map keys, typically by first converting to serde_json::Value. This would enhance maintainability. (Severity: low, not commented due to review settings)
  • Clarity of inline comment: In ostree-ext/src/container/store.rs (line 1007), the comment // Use serde_json::Value to make output reproducible is accurate. For slightly improved clarity, it could be rephrased to focus on the outcome, e.g., // Serialize to JSON with ordered keys for reproducibility via JsonOrderedSerialize trait. (Severity: low, not commented due to review settings)

Merge Readiness

The changes in this pull request are well-implemented and effectively address the issue of non-reproducible JSON output. The introduction of the JsonOrderedSerialize trait is a clean solution. There are a couple of minor suggestions for documentation and comment clarity (mentioned in the findings summary) that the author might consider, but they are not blockers. The code appears to be in good shape for merging. As an AI, I am not authorized to approve pull requests; please ensure further human review and approval before merging.

champtar added 3 commits June 3, 2025 07:50
serde_json doesn't enforce any ordering when serializing HashMap

Signed-off-by: Etienne Champetier <[email protected]>
Use BtreeMap and JsonOrderedSerialize to ensure ordering

Signed-off-by: Etienne Champetier <[email protected]>
S: Serialize,
{
fn to_json_ordered_string(&self) -> Result<String> {
let val = serde_json::to_value(self)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that the JSON we serialize here is large, but this is inefficient. It's also not really clear to me why this solves the problem...actually it looks like it only does if a not-on-by-default feature is passed https://github.com/serde-rs/json/blob/c1826ebcccb1a520389c6b78ad3da15db279220d/Cargo.toml#L61

I think we really just need to fix the olpc json formatter crate. In order to not block on upstream I'd tentatively be OK forking it into ocidir or so if that helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it rely on preserve_order being off / HashMap being converted to BtreeMap, and is inefficient, but nothing compared to all the single threaded gzip compression / decompression we are doing in between the milliseconds of json processing ;).

Even if a lesser issue, we also need a solution for pretty printing BTW

Let's continue the discussion in containers/ocidir-rs#10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really just need to fix the olpc json formatter crate. In order to not block on upstream I'd tentatively be OK forking it into ocidir or so if that helps.

containers/ocidir-rs#39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants