feat: Add deletion warning notifications for expiring files#59
feat: Add deletion warning notifications for expiring files#59ttak-apphelix merged 9 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a “deletion warning” notification path to the partner retirement reporting workflow, intended to alert external POCs when previously-uploaded GDPR report CSVs are approaching their retention-based deletion date.
Changes:
- Adds
deletion_warning_daysto test config scaffolding. - Extends
retirement_partner_report.pyto optionally scan Drive for near-expiry report files and post warning comments. - Adds a new Drive API helper
list_comments_for_fileto support idempotency checks before posting warnings.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
tubular/tests/retirement_helpers.py |
Adds deletion_warning_days to the fake YAML config used by tests. |
tubular/scripts/retirement_partner_report.py |
Introduces expiration scanning + warning comment creation and new CLI options to control/override retention and warning windows. |
tubular/google_api.py |
Adds DriveApi.list_comments_for_file for retrieving comments on a Drive file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
robrap
left a comment
There was a problem hiding this comment.
I haven't gotten through this yet, but is this ready for review, and has it been through internal review?
I also would need to think through things again, but when I originally thought about this I thought we would need 2 toggles, one for when we drop the special prefix. Is that something that will be coming later?
Internal review is in progress. For prefix related change, yes it is part of separate PR. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
I reviewed the PR Its okay to go |
robrap
left a comment
There was a problem hiding this comment.
@ttak-apphelix: Thank you. Here are some high-level comments:
- I want you to get to the P2 work sooner rather than later, so please update whatever you wish from my comments. Mostly, I want something that isn't broken, and I didn't review tests or code carefully enough to ensure that. I'm hoping that was covered internally.
- I can get very picky over language, and I got caught up in that. Apologies. Again - pick what you want, and try to move quickly.
- If you need me to review certain logic more carefully, let me know. And again, I really want to get to the more challenging follow-up PR about the files we are trying to delete that don't match prefix.
| """ | ||
| Add comments to the uploaded csv files, triggering email notification. | ||
|
|
||
| Extract external email addresses from partner folder permissions. |
There was a problem hiding this comment.
- I'm trying to separate what it does (first line) from how it does it (second line)
- Is "external" address meaningful to you?
- Are there "internal" email addresses that are different? Does this word help? Maybe
external_emailsbelow should just bepartner_emails? - Also, since the files contain learner emails as PII, I think calling these "partner email addresses" in the docs adds some clarity.
| """ | |
| Add comments to the uploaded csv files, triggering email notification. | |
| Extract external email addresses from partner folder permissions. | |
| Returns partner email addresses for provided partner names. | |
| Partner email addresses are extracted from the partner folder permissions. |
| This shared helper ensures consistent handling of permissions and denied domains | ||
| across different notification paths (upload comments and deletion warnings). |
There was a problem hiding this comment.
Nit:
| This shared helper ensures consistent handling of permissions and denied domains | |
| across different notification paths (upload comments and deletion warnings). | |
| This shared helper ensures consistent handling of permissions and denied domains | |
| across different notification paths (e.g. upload comments, deletion warnings, etc.). |
|
|
||
|
|
||
| def _add_comments_to_files(config, file_ids): | ||
| def _get_external_emails_for_partners(drive, config, partners=None): |
There was a problem hiding this comment.
See other comment for more details. Does this simpler name make sense?
| def _get_external_emails_for_partners(drive, config, partners=None): | |
| def _get_partner_emails(drive, config, partners=None): |
| Add comments to the uploaded csv files, triggering email notification. | ||
|
|
||
| Args: | ||
| file_ids (dict): Mapping of partner names to Drive file IDs corresponding to the newly uploaded csv files. |
There was a problem hiding this comment.
In place of file_ids, how do you like:
partner_to_file_ids_map, orpartner_file_ids_dict, or- `partner_file_ids?
I don't like that the current name does not seem like it would be a dict, and gives no indication of what the keys would be.
There was a problem hiding this comment.
updated to partner_file_ids_dict
| Args: | ||
| config (dict): Configuration dictionary containing retention settings and Drive folder mappings. | ||
| """ | ||
| retention_days = config.get('age_in_days', 60) |
There was a problem hiding this comment.
Moved comment out of docstring:
| retention_days = config.get('age_in_days', 60) | |
| # CLI parameter "--age_in_days" represents the retention period. | |
| retention_days = config.get('age_in_days', 60) |
| if file_created <= warning_threshold: | ||
| file_id = file_info.get('id') | ||
|
|
||
| # Check for filename in existing comments (stable even if message wording changes) |
There was a problem hiding this comment.
I'm not completely clear on this comment. The first-half seems reversed? The second-half I'm unsure about. Is this mostly what you are trying to say?
| # Check for filename in existing comments (stable even if message wording changes) | |
| # First check if the file already has the deletion warning to avoid | |
| # duplicate comments on re-run. |
There was a problem hiding this comment.
That's correct. Updated
|
|
||
| # Check for filename in existing comments (stable even if message wording changes) | ||
| existing_comments = drive.list_comments_for_file(file_id, fields='content') | ||
| has_warning = any( |
There was a problem hiding this comment.
- Remind me (and we should probably document here) the relationship between file comments and email notifications. Are these two different things, or one and the same?
- How do you know that the file couldn't have comments that are not deletion warning comments? Can someone add a comment to a file manually and would this mess things up? I'm not clear, and we should document this once I am more clear.
I am not suggesting the following yet, but we could protect against wording changes, but also ensure we are looking at deletion warning comments only:
# NOTE: The following could be added above DELETION_WARNING_MESSAGE_TEMPLATE
# Used to verify deletion comments; must exist in DELETION_WARNING_MESSAGE_TEMPLATE
DELETION_WARNING_PHRASE = "will be automatically deleted"
# Note: Then here, you could...
- Ensure the template still includes DELETION_WARNING_PHRASE. If not, maybe fail hard.
- Check only for comments that include DELETION_WARNING_PHRASE.
There was a problem hiding this comment.
- File comments and email notification are same thing. POC will get the email from comments-noreply@docs.google.com with same content what it got posted for File comments
- Idea was to keep some unique in the comments that was filename but yes, somebody can reply on the comments with filename and this would have skipped the notification. Updated with DELETION_WARNING_PHRASE logic.
| if days_until_deletion > warning_days: | ||
| # File is older than warning_threshold but not yet in the warning window; | ||
| # this shouldn't normally happen but guard against clock skew or config changes. | ||
| LOG('File {} has {} days until deletion, outside warning window, skipping'.format( |
There was a problem hiding this comment.
Will we be ok with a log message for every file, or will we be flooding the log?
There was a problem hiding this comment.
We can keep this as this job run weekly but let me know if you think otherwise.
| seconds_until_deletion = (deletion_datetime - now).total_seconds() | ||
| days_until_deletion = int(seconds_until_deletion / 86400) | ||
|
|
||
| if days_until_deletion <= 0: |
There was a problem hiding this comment.
Notes:
- My understanding is that you are implementing in parts to keep PR size under control. I think that is fine, as long as:
- We test the combination in Stage once all parts are complete, and we don't roll out any individual parts to Prod until the whole things is complete.
- We make sure that we are focused primarily on the minimum number of changes to resolve the P2, which is handling these files:
not filename.startswith(file_prefix).
- We discussed having a deletion notification (as opposed to this deletion warning notification), but maybe we do not need that for this P2 work (and possibly ever)? It's something for us to consider, but especially to unblock the P2 work. Also, I remember there was some complication around timing of notifying on a file that is going to be deleted, and how to ensure that the notifications actually get sent before the deletion happens.
- I'd love for the next PR to be around sending deletion warning notifications for the files that we care about (i.e.
not filename.startswith(file_prefix)). I think this will get tricky and needs some thought around how you'll handle the fact that they are already past due for deletion, and you want to give a 7 days warning, and you want to not delete for at least 7 days. Since most of this is a unique rollout situation, with temporary toggles, I'd like you to be able to get to this as soon as possible.
| '--age_in_days', | ||
| type=int, | ||
| default=None, | ||
| help='Number of days before files are deleted (matches cleanup job AGE_IN_DAYS).' |
There was a problem hiding this comment.
I'm not a fan of having to keep parameters in sync across jobs. What are the jobs, and do they still make sense to keep separate? Do we need linting somewhere where this is configured to ensure they stay in sync?
There was a problem hiding this comment.
This is retirement-partner-report-cleanup job where a CSV file gets deleted after AGE_IN_DAYS is finished. This has been updated in RetirementJobs.groovy in PR edx/jenkins-job-dsl#1830
Overview
This PR adds a “deletion warning” notification path to the partner retirement reporting workflow, intended to alert external POCs when previously-uploaded GDPR report CSVs are approaching their retention-based deletion date.
Changes:
JIRA ticket
Related PR: