@@ -18,43 +18,53 @@ kernelspec:
1818## Learning Goals
1919By the end of this tutorial, you will:
2020
21- - Understand the format, partitioning, and schema of this dataset.
22- - Be able to query this dataset for likely stars.
21+ - Access basic metadata to understand the format and schema of this unified HATS Parquet dataset.
22+ - Visualize the HATS partitioning of this dataset.
23+ - Query this dataset for likely stars and create a color-magnitude diagram. (Recreate the figure from
24+ [ Introduction to Euclid Q1 MER catalog] ( https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html ) ,
25+ this time with * all* likely stars.)
2326
2427+++
2528
2629## Introduction
2730
2831+++
2932
30- This notebook demonstrates accesses to a copy of the
33+ This notebook demonstrates accesses to a version of the
3134[ Euclid Q1] ( https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html ) MER Catalogs
3235that is in Apache Parquet format, partitioned according to the
3336Hierarchical Adaptive Tiling Scheme (HATS), and stored in an AWS S3 bucket.
34- Parquet is a file format that enables flexible and efficient data access by, among other things,
35- supporting the application of both column and row filters when reading the data (very similar to a SQL query)
36- so that only the desired data is loaded into memory.
3737
38- This is a single parquet dataset which comprises all three MER Catalogs
38+ The catalog version accessed here is a single dataset which comprises all three MER Catalogs
3939-- MER, MER Morphology, and MER Cutouts -- which have been joined by Object ID.
4040Their schemas (pre-join) can be seen at
4141[ Euclid Final Catalog description] ( http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html ) .
4242Minor modifications were made to the parquet schema to accommodate the join (de-duplicating column names)
4343and for the HATS standard. These differences are shown below.
4444
45+ Parquet is a file format that enables flexible and efficient data access by, among other things,
46+ supporting the application of both column and row filters when reading the data (very similar to a SQL query)
47+ so that only the desired data is loaded into memory.
48+
4549HATS is a spatial partitioning scheme based on HEALPix that aims to
4650produce partitions (files) of roughly equal size.
47- This makes them more efficient to work with,
51+ This makes the files more efficient to work with,
4852especially for large-scale analyses and/or parallel processing.
49- This notebook demonstrates the basics.
53+ It does this by adapting the HEALPix order at which data is partitioned in a given catalog based
54+ on the on-sky density of the rows it contains.
55+ In other words, data from dense regions of sky will be partitioned at a higher order
56+ (i.e., higher resolution; smaller pixel size) than data in sparse regions.
57+ HATS-aware python packages are being developed to take full advantage of the partitioning.
58+ In this notebook, we will use the [ hats] ( https://hats.readthedocs.io/ ) library to visualize the
59+ catalog and access the schema, and [ lsdb] ( https://docs.lsdb.io/ ) to do a query for all likely stars.
5060
5161+++
5262
5363## Installs and imports
5464
5565``` {code-cell}
56- # !pip uninstall -y numpy pyerfa # Helps resolve numpy>=2.0 dependency issues .
57- # !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib numpy s3fs
66+ # # Uncomment the next line to install dependencies if needed .
67+ # !pip install 'hats>=0.5' 'lsdb>=0.5' matplotlib ' numpy>=2.0' 'pyerfa>=2.0.1.3' s3fs
5868```
5969
6070``` {code-cell}
@@ -74,27 +84,28 @@ If you run into an error that starts with,
7484make sure you have restarted the kernel since doing `pip install`. Then re-run the cell.
7585```
7686
87+ +++
88+
7789## 1. Setup
7890
7991``` {code-cell}
80- # Need UPath for the testing bucket. Otherwise hats will ignore the credentials that Fornax
81- # provides under the hood. Will be unnecessary after the dataset is released in a public bucket.
82- from upath import UPath
83-
8492# AWS S3 path where this dataset is stored.
8593s3_bucket = "irsa-fornax-testdata"
8694s3_key = "EUCLID/q1/mer_catalogue/hats"
87- euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
88-
89- # Note: If running from IPAC, you need an anonymous connection. Uncomment the next line.
90- # euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}", anon=True)
91- ```
92-
93- We will use [ ` hats ` ] ( https://hats.readthedocs.io/ ) to visualize the catalog and access the schema.
94-
95- ``` {code-cell}
96- # Load the parquet dataset using hats.
97- euclid_hats = hats.read_hats(euclid_s3_path)
95+ euclid_s3_path = f"s3://{s3_bucket}/{s3_key}"
96+
97+ # Temporary try/except to handle credentials in different environments before public release.
98+ try:
99+ # If running from within IPAC's network (maybe VPN'd in with "tunnel-all"),
100+ # your IP address acts as your credentials and this should just work.
101+ hats.read_hats(euclid_s3_path)
102+ except FileNotFoundError:
103+ # If running from Fornax, credentials are provided automatically under the hood, but
104+ # hats ignores them in the call above and raises a FileNotFoundError.
105+ # Construct a UPath which will pick up the credentials.
106+ from upath import UPath
107+
108+ euclid_s3_path = UPath(f"s3://{s3_bucket}/{s3_key}")
98109```
99110
100111## 2. Visualize the on-sky density of Q1 Objects and HATS partitions
@@ -105,20 +116,17 @@ Euclid Q1 covers four non-contiguous fields: Euclid Deep Field North (22.9 sq de
105116We can visualize the Object density in the four fields using ` hats ` .
106117
107118``` {code-cell}
119+ # Load the dataset.
120+ euclid_hats = hats.read_hats(euclid_s3_path)
121+
108122# Visualize the on-sky distribution of objects in the Q1 MER Catalog.
109123hats.inspection.plot_density(euclid_hats)
110124```
111125
112- HATS does this by adjusting the partitioning order (i.e., HEALPix order at which data is partitioned)
113- according to the on-sky density of the objects or sources (rows) in the dataset.
114- In other words, dense regions are partitioned at a
115- higher HEALPix order (smaller pixel size) to reduce the number of objects in those partitions towards the mean;
116- vice versa for sparse regions.
117-
118- We can see this by plotting the partitioning orders.
126+ We can see how the on-sky density maps to the HATS partitions by calling ` plot_pixels ` .
119127
120128``` {code-cell}
121- # Visualize the HEALPix order of each partition .
129+ # Visualize the HEALPix orders of the dataset partitions .
122130hats.inspection.plot_pixels(euclid_hats)
123131```
124132
@@ -128,12 +136,10 @@ hats.inspection.plot_pixels(euclid_hats)
128136
129137In this section, we query the Euclid Q1 MER catalogs for likely stars and create a color-magnitude diagram (CMD), following
130138[ Introduction to Euclid Q1 MER catalog] ( https://caltech-ipac.github.io/irsa-tutorials/tutorials/euclid_access/2_Euclid_intro_MER_catalog.html ) .
131- Here, we'll use [ ` lsdb ` ] ( https://docs.lsdb.io/ ) to query the parquet files that are sitting in an S3 bucket (the intro notebook uses ` pyvo ` to query the TAP service).
139+ Here, we use ` lsdb ` to query the parquet files that are sitting in an S3 bucket (the intro notebook uses ` pyvo ` to query the TAP service).
132140` lsdb ` enables efficient, large-scale queries on HATS catalogs, so let's look at * all* likely stars in Euclid Q1 instead of limiting to 10,000.
133141
134- +++
135-
136- ` lsdb ` uses Dask for parallelization. Set up the workers.
142+ ` lsdb ` uses Dask for parallelization. So first, set up the workers.
137143
138144``` {code-cell}
139145client = dask.distributed.Client(
@@ -144,7 +150,7 @@ client = dask.distributed.Client(
144150The data will be lazy-loaded. This means that commands like ` query ` are not executed until the data is actually required.
145151
146152``` {code-cell}
147- # Load the parquet dataset using lsdb .
153+ # Load the dataset.
148154columns = [
149155 "TILEID",
150156 "FLUX_VIS_PSF",
@@ -209,7 +215,9 @@ notebook shows how to work with parquet schemas.
209215
210216``` {code-cell}
211217# Fetch the pyarrow schema from hats.
218+ euclid_hats = hats.read_hats(euclid_s3_path)
212219schema = euclid_hats.schema
220+
213221print(f"{len(schema)} columns in the combined Euclid Q1 MER Catalogs")
214222```
215223
@@ -254,6 +262,6 @@ print(schema.field("RIGHT_ASCENSION-CUTOUTS").metadata)
254262
255263** Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team.
256264
257- ** Updated:** 2025-03-25
265+ ** Updated:** 2025-03-29
258266
259267** Contact:** [ IRSA Helpdesk] ( https://irsa.ipac.caltech.edu/docs/help_desk.html ) with questions or problems.
0 commit comments