-
Notifications
You must be signed in to change notification settings - Fork 16
using epidata metadata to determine stalenes for nhsn pulling #2142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One nit was the utcfromtimestamp thing but everything else looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory, i really like the idea of using our own metadata, but in practice, its going to have problems that make it unworkable. The biggest is that there is an indeterminate delay between data getting inserted into the database and when the metadata is updated to include that data (the delay is bounded, but long enough to throw a wrench in this). Another is that patches will affect the metadata, thereby disrupting scheduling.
est = timezone(timedelta(hours=-5)) | ||
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is gonna have issues because of DST changes, however this timestamp should be for UTC already.
it shouldnt make too much of a difference because the probability of it biting us should be rare, but i think youll also want a max instead of a min (in case we change signal names or discontinue signals, among other things).
est = timezone(timedelta(hours=-5)) | |
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est) | |
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].max(), tz=timezone.utc) |
last_updated = datetime.fromtimestamp(nhsn_meta_df["last_update"].min(), tz=est) | ||
|
||
# currently set to run twice a week, RECENTLY_UPDATED_DIFF may need adjusting based on the cadence | ||
recently_updated_source = (updated_timestamp - last_updated) > RECENTLY_UPDATED_DIFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think this math is quite right... why wouldnt we want to proceed any time socrata has a newer timestamp than we do? the form you have here has the potential to delay processing if updates are frequent enough or if RECENTLY_UPDATED_DIFF
is too large.
recently_updated_source = (updated_timestamp - last_updated) > RECENTLY_UPDATED_DIFF | |
socrata_ts = updated_timestamp | |
delphi_ts = last_updated | |
recently_updated_source = socrata_ts > delphi_ts |
Description
using metadata api to determine staleness
Associated Issue(s)