-
Notifications
You must be signed in to change notification settings - Fork 4
Add Euclid MER HATS Parquet notebook #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. If you plan to push many commits here while developing, we should consider temporarily turning off execution for the rendering, too.
I'm not sure we should do the numpy uninstall trick in a notebook, it's bad enough that we have installs in there 😅 (That said, I wonder why the install command is not picking up on the numpy upgrade, after all the minimum dependency is changed due to lash/hats requirements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, some suggestions for swapping out the pip uninstall line.
Sorry about the conf.py conflict, you may want to rebase now. |
Co-authored-by: Brigitta Sipőcz <[email protected]>
Co-authored-by: Brigitta Sipőcz <[email protected]>
Rebased and force pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troyraen this notebook was a good exercise for me to learn how to access HATS-format data.
The code looks good and was easy to follow, I mostly have comments about text. Please note that my comments are coming from a POV of someone who is new to HATS, LSDB, Dask, etc. so feel free to ignore the ones you think are too obvious for an average reader of this tutorial.
supporting the application of both column and row filters when reading the data (very similar to a SQL query) | ||
so that only the desired data is loaded into memory. | ||
|
||
HATS is a spatial partitioning scheme based on HEALPix that aims to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add hyperlink to "HEALPix"?
Found this through HATS docs: https://healpix.jpl.nasa.gov/healpixBackgroundPurpose.shtml - the figure here (along with text esp point 1, 2) made it intuitive for me what it means.
It does this by adapting the HEALPix order at which data is partitioned in a given catalog based | ||
on the on-sky density of the rows it contains. | ||
In other words, data from dense regions of sky will be partitioned at a higher order | ||
(i.e., higher resolution; smaller pixel size) than data in sparse regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing to something like:
(i.e., higher resolution; smaller pixel size) than data in sparse regions. | |
(i.e., higher resolution; more pixels/tiles with smaller area) than data in sparse regions. |
Maybe it's just me: I got confused about relationship of order and pixels resolution - does higher order mean we are zoomed in or out of sphere? My first guess was zoomed out but that's not the case.
euclid_hats = hats.read_hats(euclid_s3_path) | ||
|
||
# Visualize the on-sky distribution of objects in the Q1 MER Catalog. | ||
hats.inspection.plot_density(euclid_hats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The units on colorbar of resultant plot is count / deg2 sq
, it should be count / deg2
or count / deg sq
- is this a bug in upstream hats library?
# Visualize the HEALPix orders of the dataset partitions. | ||
hats.inspection.plot_pixels(euclid_hats) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a textual explanation of what we notice in the resultant plot to make sure readers of this tutorial are following along.
At first it was not obvious to me how to read this plot (since I have never seen such plots) until I saw this: https://hats.readthedocs.io/en/latest/guide/directory_scheme.html#partitioning-scheme. The bullets mentioned here were very helpful, maybe we should add something similar here (higher order partition means lower area tiles aka smaller pixel size) or at least put this link.
client = dask.distributed.Client( | ||
n_workers=os.cpu_count(), threads_per_worker=2, memory_limit="auto" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused here: I was expecting subsequent lsdb methods to take this client
explicitly as a parameter but that doesn't happen. lsdb docs were not very clear on this either so after some googling (and asking ChatGPT), I learned that LSDB's internal Dask-aware code implicitly picks up the client
we created (we're essentially overriding the default synchronous Dask scheduler by this parallelized/distributed-task scheduler we created).
Maybe we should add this "gotcha" in narrative for readers unfamiliar with Dask?
|
||
# Set up the query for likely stars. | ||
star_cuts = "FLUX_VIS_PSF > 0 & FLUX_Y_TEMPLFIT > 0 & FLUX_J_TEMPLFIT > 0 & FLUX_H_TEMPLFIT > 0 & POINT_LIKE_FLAG == 1" | ||
euclid_stars = euclid_lsdb.query(star_cuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
euclid_stars = euclid_lsdb.query(star_cuts) | |
euclid_stars = euclid_lsdb.query(star_cuts) | |
euclid_stars |
The output of this makes it clear what the text above says. (Also I was just curious to see :))
star_cuts = "FLUX_VIS_PSF > 0 & FLUX_Y_TEMPLFIT > 0 & FLUX_J_TEMPLFIT > 0 & FLUX_H_TEMPLFIT > 0 & POINT_LIKE_FLAG == 1" | ||
euclid_stars = euclid_lsdb.query(star_cuts) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some text here to clarify that head()
implicitly calls compute()
to load the starting slice of data into memory so the following cell will take some time. It took 20-30s for me locally and I wasn't sure why until I read this: https://docs.lsdb.io/en/stable/tutorials/filtering_large_catalogs.html#Previewing-part-of-the-data
We peeked at the data but we haven't loaded all of it yet. | ||
What we really need in order to create a CMD is the magnitudes, so let's calculate those now. | ||
Appending `.compute()` to the commands will trigger Dask to actually load this data into memory. | ||
It is not strictly necessary, but will allow us to look at the data repeatedly without having to re-load it each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it helpful if the text warns about long-running cell to avoid assuming that I did something wrong. Following cell took ~12min for me locally (VPNed at home). Maybe we can add a rough estimate here?
client.close() | ||
``` | ||
|
||
## 4. Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this section as something standalone-ish rather than continuation of a story that section 1-3 were building.
Maybe change the name a bit to reflect that?
## 4. Schema | |
## 4. Inspecting MER Catalog's Parquet Schema |
hats.inspection.plot_pixels(euclid_hats) | ||
``` | ||
|
||
## 3. CMD of stars in Euclid Q1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To highlight the USP of this notebook!
## 3. CMD of stars in Euclid Q1 | |
## 3. CMD of ALL stars in Euclid Q1 |
This PR adds a notebook with an introduction to the HATS version of the Euclid Q1 MER catalogs that IRSA is preparing to release. The dataset is currently in a testing bucket that is available from Fornax and IPAC networks only (see nasa-fornax/fornax-demo-notebooks#394 for details).
Note: Before release, I plan to update both the dataset and the notebook to include the data from the Q1 PHZ (photo-z) catalogs along with the MER data that is already there. Many Euclid use cases will require a redshift -- making this product will give users easier access to that information because they won't have to join the tables themselves. We are interested in adding the spectroscopy catalogs as well but that may or may not happen in this first round.