fix(ci): prefer fresh repo reddit baseline#23
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an automated fallback mechanism in the weekly ETL GitHub Actions workflow to restore the freshest valid Reddit baseline (from either the checked-in repo snapshot or previously downloaded aggregate artifacts) when the Reddit source job fails.
Changes:
- Added
scripts/restore_reddit_baseline.pyto select the freshest valid Reddit baseline candidate and restore either source CSV/history or bridge JSONs. - Updated
.github/workflows/etl_semanal.ymlto snapshot a repo baseline and invoke the restore script on Reddit-job failure (for both source data and bridges). - Added unit tests for baseline selection/restoration and updated workflow contract tests to assert the new steps/script.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/test_workflow_etl_contract.py |
Extends workflow contract assertions to require the new script and workflow steps. |
tests/test_restore_reddit_baseline.py |
Adds tests for selecting the freshest candidate and restoring source/bridge outputs. |
scripts/restore_reddit_baseline.py |
Implements candidate validation/selection and restore logic for Reddit source and bridge baselines. |
.github/workflows/etl_semanal.yml |
Adds repo baseline snapshotting and fallback restoration steps for Reddit outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| topics_date = _parse_snapshot_date( | ||
| _load_json(topics_bridge).get("latest_snapshot_date") | ||
| ) | ||
| intersection_date = _parse_snapshot_date( | ||
| _load_json(intersection_bridge).get("latest_snapshot_date") | ||
| ) |
There was a problem hiding this comment.
_discover_candidate() calls _load_json() without guarding against JSONDecodeError / OSError. If either bridge JSON is corrupted or partially written, this will raise and abort candidate selection instead of skipping that candidate and trying the next root, which undermines the “freshest valid baseline” goal. Consider wrapping the JSON read/parse in try/except inside _discover_candidate (or _load_json) and returning None on failure so selection can continue.
| if target_history_root.exists(): | ||
| shutil.rmtree(target_history_root) |
There was a problem hiding this comment.
When the selected candidate has no datos/history/ directory, the code deletes any existing target_history_root before writing a single snapshot file. In the workflow this runs after restoring prev_artifacts history, so selecting repo_baseline (which won’t contain history) will wipe previously recovered Reddit history and the uploaded aggregate artifact will permanently collapse Reddit history. Instead of removing target_history_root, keep existing history when present and just add/overwrite the snapshot file for latest_snapshot_date (or merge history from another candidate).
| if target_history_root.exists(): | |
| shutil.rmtree(target_history_root) |
This pull request introduces a robust mechanism for restoring the freshest valid Reddit data baseline in the ETL workflow. It adds a new script, integrates it into the workflow to snapshot and restore baselines on failure, and provides comprehensive tests to ensure correctness. The key changes are grouped below:
Workflow Enhancements:
repo_baselinedirectory during the ETL workflow, ensuring a local backup is always available for fallback.New Script for Baseline Restoration:
scripts/restore_reddit_baseline.py, a script that selects the freshest valid baseline from multiple candidates and restores either the full source data or just the bridge files, based on the mode. This script handles validation, selection, and copying of relevant files and directories.Testing and Contract Updates:
tests/test_restore_reddit_baseline.pyto verify that the restoration logic correctly prefers fresher candidates and accurately restores both source and bridge data.