Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example H3 data sets #841

Open
ajfriend opened this issue May 18, 2024 · 3 comments
Open

Example H3 data sets #841

ajfriend opened this issue May 18, 2024 · 3 comments

Comments

@ajfriend
Copy link
Contributor

ajfriend commented May 18, 2024

It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.

Some ideas:

  • countries, US states, US zip codes as H3 cells at various resolutions
  • water vs land cells

https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.

Aside from what examples we want, I think we'd also need to decide:

  • what data format we'd use, or if we'd use multiple
  • how we store the examples---in the repo, or point to external hosting
@isaacbrodsky
Copy link
Collaborator

It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.

This seems like it could be helpful as a reference dataset.

Some ideas:

Another one that comes to mind are the various US census geometries (essentially, anything in the TIGER dataset).

https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.

Aside from what examples we want, I think we'd also need to decide:

* what data format we'd use, or if we'd use multiple

I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).

* how we store the examples---in the repo, or point to external hosting

Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in master for country geometries used in testing.

@ajfriend
Copy link
Contributor Author

ajfriend commented May 19, 2024

I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).

Agreed.

Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in master for country geometries used in testing.

Yes, I definitely agree we should host these through a separate repo (maybe something like h3datasets?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. The geodatasets package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?

@isaacbrodsky
Copy link
Collaborator

Yes, I definitely agree we should host these through a separate repo (maybe something like h3datasets?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. The geodatasets package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?

Ah, I see. The two options I'd suggest are S3 and Cloudflare R2. R2 is cheaper and more modern (which incidentally can cause issues if you happen to use HTTP-only software, as it enforces SSL). In the mean time in the repo seems like an OK place to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants