Skip to content

using epidata metadata to determine stalenes for nhsn pulling #2142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

aysim319
Copy link
Contributor

@aysim319 aysim319 commented Apr 7, 2025

Description

using metadata api to determine staleness

Associated Issue(s)

Copy link
Contributor

@minhkhul minhkhul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit was the utcfromtimestamp thing but everything else looks good.

@aysim319 aysim319 changed the title moved time diff into constants and extended to 36 hours using epidata metadata to determine stalenes for nhsn pulling Apr 9, 2025
@aysim319 aysim319 requested a review from melange396 April 9, 2025 21:54
@aysim319 aysim319 requested a review from minhkhul April 9, 2025 21:58
Copy link
Contributor

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory, i really like the idea of using our own metadata, but in practice, its going to have problems that make it unworkable. The biggest is that there is an indeterminate delay between data getting inserted into the database and when the metadata is updated to include that data (the delay is bounded, but long enough to throw a wrench in this). Another is that patches will affect the metadata, thereby disrupting scheduling.

Comment on lines +56 to +57
est = timezone(timedelta(hours=-5))
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gonna have issues because of DST changes, however this timestamp should be for UTC already.

it shouldnt make too much of a difference because the probability of it biting us should be rare, but i think youll also want a max instead of a min (in case we change signal names or discontinue signals, among other things).

Suggested change
est = timezone(timedelta(hours=-5))
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est)
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].max(), tz=timezone.utc)

last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est)

# currently set to run twice a week, RECENTLY_UPDATED_DIFF may need adjusting based on the cadence
recently_updated_source = (updated_timestamp - last_updated) > RECENTLY_UPDATED_DIFF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think this math is quite right... why wouldnt we want to proceed any time socrata has a newer timestamp than we do? the form you have here has the potential to delay processing if updates are frequent enough or if RECENTLY_UPDATED_DIFF is too large.

Suggested change
recently_updated_source = (updated_timestamp - last_updated) > RECENTLY_UPDATED_DIFF
socrata_ts = updated_timestamp
delphi_ts = last_updated
recently_updated_source = socrata_ts > delphi_ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants