-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2085 add proportions nhsn #2111
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on adding the new RSV signals in this PR or in a separate one?
nhsn/delphi_nhsn/pull.py
Outdated
updated_timestamp = datetime.utcfromtimestamp(int(response["rowsUpdatedAt"])) | ||
now = datetime.utcnow() | ||
recently_updated = (now - updated_timestamp) < timedelta(days=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I think this "recently-updated" logic is sufficient but not robust. For example, if we fail to pull data for multiple days, the next day we run we would not pull data we had never seen before if it was not posted in the last day.
The more robust solution would be to save last pull's updated_timestamp
to a local file. We would then load that and compare updated_timestamp
to that -- if exactly equal, skip update; if unequal, pull data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely makes sense and something I didn't think about! The only thing I did different was use the api instead of scanning the file since I imagine the file list is going to go and doesn't make much sense to scan the file list every day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses.
Second, the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous
approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses
since the data and the dates are just date and not datetime, I didn't take timezones into account....hmm i also don't know for sure which timezone, i believe it's EST, but have to double check
the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.
since this is data that generally updates weekly, I was planning on just running once a day, so I thought timezone wouldn't be as much of an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, given these complications, I'm thinking reading/writing to a file is easier. We wouldn't need to keep a complete list of all update datetimes ever, just the single most recent datetime. So the file wouldn't keep getting bigger and bigger, we could just read a single line.
This lets us store a UTC date (no timezones to worry about), no API date-processing to worry about, and we can store a datetime to be extra precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't a fan of have metadata files, seems overkill / introduce more complexity than I would like, so after talking things through with Nolan just now, I decided to simplify the logic and create backups daily, but still do simple check to see recently updated to actually continue processing and create the csv files, so if there are outages that happened after the initial pulls, we can go back and do patches for them.
Nolan also mentioned that for the future, we could look into creating a generic tool/script to dedup things specifically and I like that direction since it would seperate the complexity away from this code base
I was originally planning for a seperate one. I was considering adding the new signal in this PR, but when I tried to add just the new columns and locally ran the tests and ran across issues and looks like it might be more involved, so i think it'd be better to create a seperate one. I also kinda shoved in other issues (retry and daily checking) and didn't want to add more things on top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small questions
df = df.astype(TYPE_DICT) | ||
try: | ||
df = df.astype(TYPE_DICT) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this just pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that some of the older data didn't have newer signals (rsv, reporting hospitals) in the source back up it would log, but resume the patching process.
Previously I tried modify TYPE_DICT, but being mutable caused some issue in patching runs. So this was the next solution...is it a good one ehhh....I log that that the signal is unavailable eariler (line 150) and I thought I shouldn't log basically the same message twice
Since you brought up...I should also check if the rest of the columns actually changed data types and maybe look into a less janky way
logger.info(f"{prelim_prefix}NHSN data is stale; Skipping", updated_timestamp=updated_timestamp) | ||
# pylint: disable=W0703 | ||
except Exception as e: | ||
logger.info("error while processing socrata metadata; continuing processing", error=str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this function would return true if it encounters an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah because it's just checking if it's stale or not, if we already have the data I don't want this to be the reason that the indicator stops. This was a check to limit duplicating data
time.sleep(2 + random.randint(0, 1000) / 1000.0) | ||
page = client.get(dataset_id, limit=limit, offset=offset) | ||
else: | ||
raise err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably log as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yup, I missed that;
Description
address #2085
Changelog
added new function that checks the metadata for last update
-> the function also checks for 503 error
-> added signals for total reporting hospitals
Associated Issue(s)