Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2085 add proportions nhsn #2111

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

2085 add proportions nhsn #2111

wants to merge 19 commits into from

Conversation

aysim319
Copy link
Contributor

Description

address #2085

Changelog

added new function that checks the metadata for last update
-> the function also checks for 503 error
-> added signals for total reporting hospitals

Associated Issue(s)

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on adding the new RSV signals in this PR or in a separate one?

nhsn/delphi_nhsn/constants.py Outdated Show resolved Hide resolved
Comment on lines 29 to 31
updated_timestamp = datetime.utcfromtimestamp(int(response["rowsUpdatedAt"]))
now = datetime.utcnow()
recently_updated = (now - updated_timestamp) < timedelta(days=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I think this "recently-updated" logic is sufficient but not robust. For example, if we fail to pull data for multiple days, the next day we run we would not pull data we had never seen before if it was not posted in the last day.

The more robust solution would be to save last pull's updated_timestamp to a local file. We would then load that and compare updated_timestamp to that -- if exactly equal, skip update; if unequal, pull data.

Copy link
Contributor Author

@aysim319 aysim319 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely makes sense and something I didn't think about! The only thing I did different was use the api instead of scanning the file since I imagine the file list is going to go and doesn't make much sense to scan the file list every day

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses.

Second, the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.

Copy link
Contributor Author

@aysim319 aysim319 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checking the API could make sense, too. The one thing I'd caution is timezones -- your previous
approach explicitly used UTC on both "old" and "now" timestamps, but I don't know what the API uses

since the data and the dates are just date and not datetime, I didn't take timezones into account....hmm i also don't know for sure which timezone, i believe it's EST, but have to double check

the API only has dates, not times. Would that ever cause problems? E.g. we want to check for updates multiple times a day.

since this is data that generally updates weekly, I was planning on just running once a day, so I thought timezone wouldn't be as much of an issue

Copy link
Contributor

@nmdefries nmdefries Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, given these complications, I'm thinking reading/writing to a file is easier. We wouldn't need to keep a complete list of all update datetimes ever, just the single most recent datetime. So the file wouldn't keep getting bigger and bigger, we could just read a single line.

This lets us store a UTC date (no timezones to worry about), no API date-processing to worry about, and we can store a datetime to be extra precise.

Copy link
Contributor Author

@aysim319 aysim319 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't a fan of have metadata files, seems overkill / introduce more complexity than I would like, so after talking things through with Nolan just now, I decided to simplify the logic and create backups daily, but still do simple check to see recently updated to actually continue processing and create the csv files, so if there are outages that happened after the initial pulls, we can go back and do patches for them.

Nolan also mentioned that for the future, we could look into creating a generic tool/script to dedup things specifically and I like that direction since it would seperate the complexity away from this code base

nhsn/delphi_nhsn/pull.py Outdated Show resolved Hide resolved
@aysim319
Copy link
Contributor Author

aysim319 commented Feb 4, 2025

Are you planning on adding the new RSV signals in this PR or in a separate one?

I was originally planning for a seperate one. I was considering adding the new signal in this PR, but when I tried to add just the new columns and locally ran the tests and ran across issues and looks like it might be more involved, so i think it'd be better to create a seperate one. I also kinda shoved in other issues (retry and daily checking) and didn't want to add more things on top

@aysim319 aysim319 requested a review from nmdefries February 5, 2025 15:57
nhsn/delphi_nhsn/pull.py Outdated Show resolved Hide resolved
nhsn/delphi_nhsn/pull.py Outdated Show resolved Hide resolved
@aysim319 aysim319 requested a review from nolangormley February 5, 2025 20:19
Copy link
Contributor

@nolangormley nolangormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small questions

df = df.astype(TYPE_DICT)
try:
df = df.astype(TYPE_DICT)
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this just pass?

Copy link
Contributor Author

@aysim319 aysim319 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that some of the older data didn't have newer signals (rsv, reporting hospitals) in the source back up it would log, but resume the patching process.

Previously I tried modify TYPE_DICT, but being mutable caused some issue in patching runs. So this was the next solution...is it a good one ehhh....I log that that the signal is unavailable eariler (line 150) and I thought I shouldn't log basically the same message twice

Since you brought up...I should also check if the rest of the columns actually changed data types and maybe look into a less janky way

logger.info(f"{prelim_prefix}NHSN data is stale; Skipping", updated_timestamp=updated_timestamp)
# pylint: disable=W0703
except Exception as e:
logger.info("error while processing socrata metadata; continuing processing", error=str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this function would return true if it encounters an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah because it's just checking if it's stale or not, if we already have the data I don't want this to be the reason that the indicator stops. This was a check to limit duplicating data

time.sleep(2 + random.randint(0, 1000) / 1000.0)
page = client.get(dataset_id, limit=limit, offset=offset)
else:
raise err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably log as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup, I missed that;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add retry for 50x error for scorata api and also throw error when that happens
3 participants