Skip to content

fix(ci): prefer fresh repo reddit baseline#23

Merged
Sam-24-dev merged 2 commits into
mainfrom
codex/ci-reddit-baseline-repo
Mar 28, 2026
Merged

fix(ci): prefer fresh repo reddit baseline#23
Sam-24-dev merged 2 commits into
mainfrom
codex/ci-reddit-baseline-repo

Conversation

@Sam-24-dev
Copy link
Copy Markdown
Owner

This pull request introduces a robust mechanism for restoring the freshest valid Reddit data baseline in the ETL workflow. It adds a new script, integrates it into the workflow to snapshot and restore baselines on failure, and provides comprehensive tests to ensure correctness. The key changes are grouped below:

Workflow Enhancements:

  • Added a step to snapshot the current Reddit data outputs into a repo_baseline directory during the ETL workflow, ensuring a local backup is always available for fallback.
  • Replaced manual shell logic for restoring previous Reddit bridge files with calls to a new Python script, improving reliability and maintainability.
  • Added a step to restore the Reddit source baseline using the new script when the Reddit job fails, ensuring the workspace is reverted to the latest valid state.

New Script for Baseline Restoration:

  • Introduced scripts/restore_reddit_baseline.py, a script that selects the freshest valid baseline from multiple candidates and restores either the full source data or just the bridge files, based on the mode. This script handles validation, selection, and copying of relevant files and directories.

Testing and Contract Updates:

  • Added tests/test_restore_reddit_baseline.py to verify that the restoration logic correctly prefers fresher candidates and accurately restores both source and bridge data.
  • Updated workflow contract tests to check for the presence of the new script and related workflow steps, ensuring contract coverage for the new logic. [1] [2]

Copilot AI review requested due to automatic review settings March 28, 2026 21:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an automated fallback mechanism in the weekly ETL GitHub Actions workflow to restore the freshest valid Reddit baseline (from either the checked-in repo snapshot or previously downloaded aggregate artifacts) when the Reddit source job fails.

Changes:

  • Added scripts/restore_reddit_baseline.py to select the freshest valid Reddit baseline candidate and restore either source CSV/history or bridge JSONs.
  • Updated .github/workflows/etl_semanal.yml to snapshot a repo baseline and invoke the restore script on Reddit-job failure (for both source data and bridges).
  • Added unit tests for baseline selection/restoration and updated workflow contract tests to assert the new steps/script.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/test_workflow_etl_contract.py Extends workflow contract assertions to require the new script and workflow steps.
tests/test_restore_reddit_baseline.py Adds tests for selecting the freshest candidate and restoring source/bridge outputs.
scripts/restore_reddit_baseline.py Implements candidate validation/selection and restore logic for Reddit source and bridge baselines.
.github/workflows/etl_semanal.yml Adds repo baseline snapshotting and fallback restoration steps for Reddit outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/restore_reddit_baseline.py Outdated
Comment on lines +66 to +71
topics_date = _parse_snapshot_date(
_load_json(topics_bridge).get("latest_snapshot_date")
)
intersection_date = _parse_snapshot_date(
_load_json(intersection_bridge).get("latest_snapshot_date")
)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_discover_candidate() calls _load_json() without guarding against JSONDecodeError / OSError. If either bridge JSON is corrupted or partially written, this will raise and abort candidate selection instead of skipping that candidate and trying the next root, which undermines the “freshest valid baseline” goal. Consider wrapping the JSON read/parse in try/except inside _discover_candidate (or _load_json) and returning None on failure so selection can continue.

Copilot uses AI. Check for mistakes.
Comment thread scripts/restore_reddit_baseline.py Outdated
Comment on lines +129 to +130
if target_history_root.exists():
shutil.rmtree(target_history_root)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the selected candidate has no datos/history/ directory, the code deletes any existing target_history_root before writing a single snapshot file. In the workflow this runs after restoring prev_artifacts history, so selecting repo_baseline (which won’t contain history) will wipe previously recovered Reddit history and the uploaded aggregate artifact will permanently collapse Reddit history. Instead of removing target_history_root, keep existing history when present and just add/overwrite the snapshot file for latest_snapshot_date (or merge history from another candidate).

Suggested change
if target_history_root.exists():
shutil.rmtree(target_history_root)

Copilot uses AI. Check for mistakes.
@Sam-24-dev Sam-24-dev merged commit aa16396 into main Mar 28, 2026
2 checks passed
@Sam-24-dev Sam-24-dev deleted the codex/ci-reddit-baseline-repo branch March 28, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants