diff --git a/toc.yml b/toc.yml index 4d34fb0e..c999ec93 100644 --- a/toc.yml +++ b/toc.yml @@ -21,6 +21,7 @@ project: - file: tutorials/euclid_access/4_Euclid_intro_PHZ_catalog.md - file: tutorials/euclid_access/5_Euclid_intro_SPE_catalog.md - file: tutorials/cloud_access/euclid-cloud-access.md + - pattern: tutorials/parquet-catalog-demos/euclid-q1-hats/*-euclid-q1-hats-*.md - title: WISE children: - title: AllWISE diff --git a/tutorials/cloud_access/euclid-cloud-access.md b/tutorials/cloud_access/euclid-cloud-access.md index 66b7b95d..c34e56f6 100644 --- a/tutorials/cloud_access/euclid-cloud-access.md +++ b/tutorials/cloud_access/euclid-cloud-access.md @@ -227,7 +227,7 @@ s3.ls(f'{BUCKET_NAME}/q1/catalogs/MER_FINAL_CATALOG/{mer_tile_id}') As per "Browsable Directiories" section in [user guide](https://irsa.ipac.caltech.edu/data/Euclid/docs/euclid_archive_at_irsa_user_guide.pdf), we can use `catalogs/MER_FINAL_CATALOG/{tile_id}/EUC_MER_FINAL-CAT*.fits` for listing the objects catalogued. We can read the identified FITS file as table and do filtering on ra, dec columns to find object ID(s) only for the target we picked. But it will be an expensive operation so we will instead use astroquery (in next section) to do a spatial search in the MER catalog provided by IRSA. ```{note} -Once the catalogs are available as Parquet files in the cloud, we can efficiently do spatial filtering directly on the cloud-hosted file to identify object ID(s) for our target. But for the time being, we can use catalog VO services through astroquery to do the same. +[FIXME] Once the catalogs are available as Parquet files in the cloud, we can efficiently do spatial filtering directly on the cloud-hosted file to identify object ID(s) for our target. But for the time being, we can use catalog VO services through astroquery to do the same. ``` +++ diff --git a/tutorials/parquet-catalog-demos/euclid-q1-hats/1-euclid-q1-hats-intro.md b/tutorials/parquet-catalog-demos/euclid-q1-hats/1-euclid-q1-hats-intro.md new file mode 100644 index 00000000..ced0061c --- /dev/null +++ b/tutorials/parquet-catalog-demos/euclid-q1-hats/1-euclid-q1-hats-intro.md @@ -0,0 +1,383 @@ +--- +short_title: "HATS: Introduction" +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.1 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +(euclid-q1-hats-intro)= +# Euclid Q1 Merged Objects HATS Catalog: Introduction + ++++ + +This tutorial is an introduction to the content and format of the Euclid Q1 Merged Objects HATS Catalog. +Later tutorials in [this series](#euclid-q1-hats) will show how to load quality samples. + +## Learning Goals + +In this tutorial, we will: + +- List the Euclid Q1 tables that were joined to create the Merged Objects catalog, and the Euclid papers that describe how the data were produced. +- Find columns of interest. +- Perform a basic spatial query in each of the Euclid Deep Fields using PyArrow. + ++++ + +## 1. Introduction + +The [Euclid Q1](https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html) catalogs were derived from Euclid photometry and spectroscopy, taken by the Visible Camera (VIS) and the Near-Infrared Spectrometer and Photometer (NISP), and from photometry taken by other ground-based instruments. +The data include several flux measurements per band, several redshift estimates, several morphology parameters, etc. +Each was derived for different science goals using different algorithms or configurations. + +The Euclid Q1 Merged Objects HATS Catalog was produced by joining 14 of the original catalogs on object ID (column: `object_id`). +Following the HATS framework, the data were then partitioned spatially (by right ascension and declination) and written as an Apache Parquet dataset. +A visualization of the Euclid Q1 on-sky density and the HATS partitioning of this catalog can be seen on these [skymaps](https://irsa.ipac.caltech.edu/data/download/parquet/euclid/q1/merged_objects/hats/euclid_q1_merged_objects-hats/skymap.png). + +- Columns: 1,594 +- Rows: 29,953,430 (one per Euclid Q1 object) +- Size: 400 GB + +The catalog is served from an AWS S3 cloud storage bucket. +Access is free and no credentials are required. + ++++ + +## 2. Imports + +```{code-cell} +# # Uncomment the next line to install dependencies if needed. +# %pip install hpgeom pyarrow +``` + +```{code-cell} +import hpgeom # Find HEALPix indexes from RA and Dec +import pyarrow.compute as pc # Filter the catalog +import pyarrow.dataset # Load the catalog +import pyarrow.fs # Simple S3 filesystem pointer +import pyarrow.parquet # Load the schema +``` + +### 3. Load Parquet Metadata + +First we'll load the Parquet schema (column information) of the Merged Objects catalog so we can use it in later sections. +The Parquet schema is accessible from a few locations, all of which include the column names and types. +Here, we load it from the `_common_metadata` file because it also includes the column units and descriptions. + +```{code-cell} +# AWS S3 paths. +s3_bucket = "nasa-irsa-euclid-q1" +dataset_prefix = "contributed/q1/merged_objects/hats/euclid_q1_merged_objects-hats/dataset" + +dataset_path = f"{s3_bucket}/{dataset_prefix}" +schema_path = f"{dataset_path}/_common_metadata" + +# S3 pointer. Use `anonymous=True` to access without credentials. +s3 = pyarrow.fs.S3FileSystem(anonymous=True) +``` + +```{code-cell} +# Load the Parquet schema. +schema = pyarrow.parquet.read_schema(schema_path, filesystem=s3) + +# There are almost 1600 columns in this dataset. +print(f"{len(schema)} columns in the Euclid Q1 Merged Objects catalog") +``` + +## 4. Merged Objects Catalog Contents + ++++ + +The Merged Objects catalog contains data from 14 Euclid Q1 tables, joined on the column `object_id`. +The tables were produced by three Euclid processing functions: MER, PHZ, and SPE. +The subsections below include the table names, links to reference papers, URLs to the original table schemas, and examples of how the column names were transformed for the Merged Objects catalog. + + + +The original tables' column names are mostly in all caps. +In the Merged Objects catalog and the catalogs available through IRSA's TAP service, all column names have been lower-cased. +In addition, all non-alphanumeric characters have been replaced with an underscore for compatibility with various libraries and services. +Finally, the original table name has been prepended to column names in the Merged Objects catalog, both for provenance and to avoid duplicates. +An example that includes all of these transformations is: `E(B-V)` -> `physparamqso_e_b_v_`. + +Three columns have special names that differ from the standard naming convention described above: + +- `object_id` : Euclid MER Object ID. Unique identifier of a row in this dataset. No table name prepended. +- `ra` : 'RIGHT_ASCENSION' from the 'mer' table. Name shortened to match other IRSA services. No table name prepended. +- `dec` : 'DECLINATION' from the 'mer' table. Name shortened to match other IRSA services. No table name prepended. + +Seven additional columns have been added to the Merged Objects catalog that are not in the original Euclid tables. +They are described below, after the Euclid tables. + +### 4.1 MER tables + +The Euclid MER processing function produced three tables. +The reference paper is [Euclid Collaboration: Romelli et al., 2025](https://arxiv.org/pdf/2503.15305) (hereafter, Romelli). +The tables are: + +**Main table (mer)** +- Described in Romelli sections 6 & 8 ("EUC_MER_FINAL-CAT") +- Original schema: [Main catalog FITS file](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html#main-catalog-fits-file) +- Example column name transform: `FLUX_DETECTION_TOTAL` --> `mer_flux_detection_total` + +**Morphology (morph)** +- Described in Romelli sections 7 & 8 ("EUC_MER_FINAL-MORPH-CAT") +- Original schema: [Morphology catalog FITS file](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html#morphology-catalog-fits-file) +- Example column name transform: `CONCENTRATION` --> `morph_concentration` + +**Cutouts (cutouts)** +- Described in Romelli section 8 ("EUC_MER_FINAL-CUTOUTS-CAT") +- Original schema: [Cutouts catalog FITS file](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/dpcards/mer_finalcatalog.html#cutouts-catalog-fits-file) +- Example column name transform: `CORNER_0_RA` --> `cutouts_corner_0_ra` + +Find all columns from these tables in the Parquet schema: + +```{code-cell} +mer_prefixes = ["mer_", "morph_", "cutouts_"] +mer_col_counts = {p: len([n for n in schema.names if n.startswith(p)]) for p in mer_prefixes} + +print(f"MER tables: {sum(mer_col_counts.values())} columns total") +for prefix, count in mer_col_counts.items(): + print(f" {prefix}: {count}") +``` + +### 4.2 PHZ tables + +The Euclid PHZ processing function produced eight tables. +The reference paper is [Euclid Collaboration: Tucci et al., 2025](https://arxiv.org/pdf/2503.15306) (hereafter, Tucci). +The tables are: + +**Photometric Redshifts (phz)** +- Described in Tucci section 5 ("phz_photo_z") +- Original schema: [Photo Z catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#photo-z-catalog) +- Example column name transform: `PHZ_PDF` --> `phz_phz_pdf` + +**Classifications (class)** +- Described in Tucci section 4 ("phz_classification") +- Original schema: [Classification catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#classification-catalog) +- Example column name transform: `PHZ_CLASSIFICATION` --> `class_phz_classification` + +**Galaxy Physical Parameters (physparam)** +- Described in Tucci section 6 (6.1; "phz_physical_parameters") +- Original schema: [Physical Parameters catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#physical-parameters-catalog) +- Example column name transform: `PHZ_PP_MEDIAN_REDSHIFT` --> `physparam_phz_pp_median_redshift` + +**Galaxy SEDs (galaxysed)** +- Described in Tucci appendix B (B.1 "phz_galaxy_sed") +- Original schema: [Galaxy SED catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#galaxy-sed-catalog) +- Example column name transform: `FLUX_4900_5000` --> `galaxysed_flux_4900_5000` + +**QSO Physical Parameters (physparamqso)** +- Described in Tucci section 6 (6.2; "phz_qso_physical_parameters") +- Original schema: [QSO Physical Parameters catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#qso-physical-parameters-catalog) +- Example column name transform: `E(B-V)` --> `physparamqso_e_b_v_` + +**Star Parameters (starclass)** +- Described in Tucci section 6 (6.3; "phz_star_template") +- Original schema: [Star Template](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#star-template) +- Example column name transform: `FLUX_VIS_Total_Corrected` --> `starclass_flux_vis_total_corrected` + +**Star SEDs (starsed)** +- Described in Tucci appendix B (B.1 "phz_star_sed") +- Original schema: [Star SED catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#star-sed-catalog) +- Example column name transform: `FLUX_4900_5000` --> `starsed_flux_4900_5000` + +**NIR Physical Parameters (physparamnir)** +- Described in Tucci section 6 (6.4; "phz_nir_physical_parameters") +- Original schema: [NIR Physical Parameters catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#nir-physical-parameters-catalog) +- Example column name transform: `E(B-V)` --> `physparamnir_e_b_v_` + +Find all columns from these tables in the Parquet schema: + +```{code-cell} +phz_prefixes = ["phz_", "class_", "physparam_", "galaxysed_", "physparamqso_", "starclass_", "starsed_", "physparamnir_"] +phz_col_counts = {p: len([n for n in schema.names if n.startswith(p)]) for p in phz_prefixes} + +print(f"PHZ tables: {sum(phz_col_counts.values())} columns total") +for prefix, count in phz_col_counts.items(): + print(f" {prefix}: {count}") +``` + +### 4.3 SPE tables + +The Euclid SPE processing function produced three tables from which data are included in the Merged Objects catalog. +The reference paper is [Euclid Collaboration: Le Brun et al., 2025](https://arxiv.org/pdf/2503.15308) (hereafter, Le Brun). + +These tables required special handling because they contain multiple rows per object (identified by column `object_id`). +The tables were pivoted before being joined so that the Merged Objects catalog contains one row per object. +The pivoted columns were named by combining at least the table name, the original column name, and the rank of the redshift estimate (i.e., the value in the original 'SPE_RANK' column). + +The tables are: + +**Spectroscopic Redshifts (z)** +- Described in Le Brun section 2 ("spectro_zcatalog_spe_quality", "spectro_zcatalog_spe_classification", "spectro_zcatalog_spe_galaxy_candidates", "spectro_zcatalog_spe_star_candidates", and "spectro_zcatalog_spe_qso_candidates") +- Original schema: [Redshift catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#redshift-catalog) +- Pivot and naming: For each object, the original table contains up to 5 redshift estimates (ranked by confidence) produced by assuming the object is a galaxy, star, and QSO -- for a total of up to 15 rows per object. + The table was pivoted to one row per object and the resulting columns were named by prepending the table name (`z`) and the assumed type (`galaxy_candidates`, `star_candidates`, and `qso_candidates`), and appending the rank (`rank[0-4]`). +- Example column name transform: `SPE_PDF` --> `z_galaxy_candidates_spe_pdf_rank0` (top-ranked redshift PDF, assuming galaxy) + +**Spectral Line Measurements (lines)** +- Described in Le Brun section 5 ("spectro_line_features_catalog_spe_line_features_cat"). + _Notice that lines were identified by assuming the object is a **galaxy**._ +- Original schema: [Lines catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#lines-catalog) (HDU#1 only) +- Pivot and naming: For each object and line, the original table contains up to 5 rows, one per galaxy redshift estimate from the **z** table. + The Merged Objects catalog only contains the Halpha line measurements. + The table was pivoted to one row per object and the resulting columns were named by prepending the table name (`lines`) and appending both the redshift rank (`rank[0-4]`) and the name of the line (`halpha`). +- Example column name transform: `SPE_LINE_FLUX_GF` --> `lines_spe_line_flux_gf_rank0_halpha` (Halpha line flux of the top-ranked redshift, assuming galaxy) + +**Models (models)** +- Described in Le Brun section 5 ("spectro_model_catalog_spe_lines_catalog") +- Original schema: [Models catalog](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#models-catalog) (HDU#2 only) +- Pivot and naming: The original table has the same structure as the **z** table. + The Merged Objects catalog only contains the galaxy solutions. + The table was pivoted to one row per object and the resulting columns were named by prepending the table name (`models`) and the assumed type (`galaxy`), and appending the redshift rank (`rank[0-4]`). +- Example column name transform: `SPE_VEL_DISP_E` --> `models_galaxy_spe_vel_disp_e_rank0` (velocity dispersion of the top-ranked redshift, assuming galaxy) + +Find all columns from these tables in the Parquet schema: + +```{code-cell} +spe_prefixes = ["z_", "lines_", "models_"] +spe_col_counts = {p: len([n for n in schema.names if n.startswith(p)]) for p in spe_prefixes} + +print(f"SPE tables: {sum(spe_col_counts.values())} columns total") +for prefix, count in spe_col_counts.items(): + print(f" {prefix[:-1]}: {count}") +``` + +### 4.4 Additional columns + +The following columns were added to the Merged Objects catalog but do not appear in the original Euclid tables. + +**Euclid columns:** + +- `tileid` : ID of the Euclid Tile the object was detected in. + The Euclid tiling is described in Romelli section 3.1. + +**HEALPix columns:** + +These HEALPix indexes correspond to the object's RA and Dec coordinates. +They are useful for spatial queries, as demonstrated in the Euclid Deep Fields section below. + +- `_healpix_9` : HEALPix order 9 pixel index. + Order 9 pixels have a resolution (square root of area) of ~400 arcseconds or ~0.1 degrees. +- `_healpix_19` : HEALPix order 19 pixel index. + Order 19 pixels have a resolution of ~0.4 arcseconds. +- `_healpix_29` : HEALPix order 29 pixel index. + Order 29 pixels have a resolution of ~4e-4 arcseconds. + +The HEALPix, Euclid object ID, and Euclid tile ID columns appear first: + +```{code-cell} +schema.names[:5] +``` + +**HATS columns:** + +These are the HATS partitioning columns. +They appear in the Parquet file names but are not included inside the files. +However, PyArrow automatically makes them available as regular columns when the dataset is loaded as demonstrated in these tutorials. + +- `Norder` : (hats column) HEALPix order at which the data is partitioned. +- `Npix` : (hats column) HEALPix pixel index at order Norder. +- `Dir` : (hats column) Integer equal to 10_000 * floor[Npix / 10_000]. + +The HATS columns appear at the end: + +```{code-cell} +schema.names[-3:] +``` + ++++ + +### 4.5 Find columns of interest + +The subsections above show how to find all columns from a given Euclid table as well as the additional columns. +Here we show some additional techniques for finding columns. + +```{code-cell} +# Access the data type using the `field` method. +schema.field("mer_flux_y_2fwhm_aper") +``` + +```{code-cell} +# The column metadata includes unit and description. +schema.field("mer_flux_y_2fwhm_aper").metadata +``` + +Euclid Q1 offers many flux measurements, both from Euclid detections and from external ground-based surveys. +They are given in microjanskys, so all flux columns can be found by searching the metadata for this unit. + +```{code-cell} +# Find all flux columns. +flux_columns = [field.name for field in schema if field.metadata[b"unit"] == b"uJy"] + +print(f"{len(flux_columns)} flux columns. First four are:") +flux_columns[:4] +``` + +Columns associated with external surveys are identified by the inclusion of "ext" in the name. + +```{code-cell} +external_flux_columns = [name for name in flux_columns if "ext" in name] +print(f"{len(external_flux_columns)} flux columns from external surveys. First four are:") +external_flux_columns[:4] +``` + +## 5. Euclid Deep Fields + ++++ + +Euclid Q1 includes data from three Euclid Deep Fields: EDF-N (North), EDF-S (South), EDF-F (Fornax; also in the southern hemisphere). +There is also a small amount of data from a fourth field: LDN1641 (Lynds' Dark Nebula 1641), which was observed for technical reasons during Euclid's verification phase and mostly ignored here. +The fields are described in [Euclid Collaboration: Aussel et al., 2025](https://arxiv.org/pdf/2503.15302). + +The regions are well separated, so we can distinguish them using a simple cone search without having to be too picky about the radius. +We can load data more efficiently using the HEALPix order 9 pixels that cover each area rather than using RA and Dec values directly. +These will be used in later tutorials. + +```{code-cell} +# EDF-N (Euclid Deep Field - North) +ra, dec, radius = 269.733, 66.018, 4 # 20 sq deg +edfn_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius, inclusive=True) + +# EDF-S (Euclid Deep Field - South) +ra, dec, radius = 61.241, -48.423, 5 # 23 sq deg +edfs_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius, inclusive=True) + +# EDF-F (Euclid Deep Field - Fornax) +ra, dec, radius = 52.932, -28.088, 3 # 10 sq deg +edff_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius, inclusive=True) + +# LDN1641 (Lynds' Dark Nebula 1641) +ra, dec, radius = 85.74, -8.39, 1.5 # 6 sq deg +ldn_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius, inclusive=True) +``` + +## 6. Basic Query + +To demonstrate a basic query, we'll search for objects with a galaxy photometric redshift estimate of 6.0 (largest possible). +Other tutorials in this series will show more complex queries and describe the redshifts and other data in more detail. + +```{code-cell} +dataset = pyarrow.dataset.dataset(dataset_path, partitioning="hive", filesystem=s3, schema=schema) + +highz_objects = dataset.to_table(columns=['object_id', 'phz_phz_median'], filter=pc.field('phz_phz_median') == 6).to_pandas() +highz_objects +``` + +## About this notebook + +**Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team. + +**Updated:** 17 December 2025 + +**Contact:** [IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or problems. diff --git a/tutorials/parquet-catalog-demos/euclid-q1-hats/4-euclid-q1-hats-magnitudes.md b/tutorials/parquet-catalog-demos/euclid-q1-hats/4-euclid-q1-hats-magnitudes.md new file mode 100644 index 00000000..e60bc375 --- /dev/null +++ b/tutorials/parquet-catalog-demos/euclid-q1-hats/4-euclid-q1-hats-magnitudes.md @@ -0,0 +1,384 @@ +--- +short_title: "HATS: Magnitudes" +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.16.1 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +# Euclid Q1 Merged Objects HATS Catalog: Magnitudes + ++++ + +This tutorial explores the different flux measurements available in the Euclid Q1 Merged Objects HATS Catalog and demonstrates how to work with magnitudes in the four Euclid bands. +It assumes you are familiar with the [first tutorial](#euclid-q1-hats-intro) in this [series](#euclid-q1-hats) covering the content, format, and basic access to this catalog. + +## Learning Goals + +By the end of this tutorial, you will be able to: + +- Distinguish between aperture and template-fit flux measurements and understand when to use each type. +- Apply color corrections to NIR band fluxes to obtain accurate total flux estimates. +- Query the HATS Catalog for magnitude data across the four Euclid bands (VIS: I; NISP: Y, J, H). +- Visualize magnitude distributions as a function of object classification. +- Compare aperture and template-fit magnitudes to understand their systematic differences for extended vs. point-like sources. + ++++ + +## 1. Introduction + +Euclid Q1 contains two main types of flux measurements: **aperture photometry** and **template-fit photometry**. +These measurements are provided for both Euclid bands (VIS: I; NISP: Y, J, H) and external survey bands. +Additional flux measurements are also available but not covered in this tutorial, including: PSF-fit fluxes (VIS only), Sérsic-fit fluxes (computed for parametric morphology), and fluxes that were corrected based on PHZ classification. + +**Aperture fluxes** measure the total light within a defined aperture (e.g., within a Kron radius or a fixed circular aperture). +They are generally more accurate for point-like sources, especially bright stars in the NIR bands, likely due to better handling of PSF-related effects. + +**Template-fit fluxes** are expected to be more accurate than aperture fluxes for extended sources because the templates do a better job of excluding contamination from nearby sources. +However, Euclid recommends scaling the measured NIR fluxes with a color term based on VIS photometry to obtain the best estimate of the total flux in each NIR band. +This color correction accounts for systematic differences between the VIS and NIR observations. + +See also: +- MER [Photometry Cookbook](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/merphotometrycookbook.html) +- [Euclid Collaboration: Romelli et al., 2025](https://arxiv.org/pdf/2503.15305) (hereafter, Romelli) + ++++ + +## 2. Imports + +```{code-cell} +# # Uncomment the next line to install dependencies if needed. +# %pip install hpgeom matplotlib numpy pandas pyarrow s3fs +``` + +```{code-cell} +import hpgeom # Find HEALPix indexes from RA and Dec +import matplotlib.pyplot as plt # Create figures +import numpy as np # Math +import pandas as pd # Manipulate query results +import pyarrow.compute as pc # Filter dataset +import pyarrow.dataset # Load the dataset +import pyarrow.fs # Simple S3 filesystem pointer +import pyarrow.parquet # Load the schema + +# Copy-on-write will become the default in pandas 3.0 and is generally more performant +pd.options.mode.copy_on_write = True +``` + +## 3. Setup + ++++ + +### 3.1 Load the catalog as a PyArrow dataset + +```{code-cell} +# AWS S3 paths. +s3_bucket = "nasa-irsa-euclid-q1" +dataset_prefix = "contributed/q1/merged_objects/hats/euclid_q1_merged_objects-hats/dataset" + +dataset_path = f"{s3_bucket}/{dataset_prefix}" +schema_path = f"{dataset_path}/_common_metadata" +metadata_path = f"{dataset_path}/_metadata" + +# S3 pointer. Use `anonymous=True` to access without credentials. +s3 = pyarrow.fs.S3FileSystem(anonymous=True) +``` + +```{code-cell} +# Load the catalog as a PyArrow dataset. This is used in the examples below. +# Include partitioning="hive" so PyArrow understands the file naming scheme and can navigate the partitions. +schema = pyarrow.parquet.read_schema(schema_path, filesystem=s3) +dataset = pyarrow.dataset.dataset(dataset_path, partitioning="hive", filesystem=s3, schema=schema) +# dataset = pyarrow.dataset.parquet_dataset(metadata_path, partitioning="hive", filesystem=s3) +``` + +### 3.2 Euclid Deep Fields + +We'll restrict our query to the Euclid Deep Field - Fornax to reduce the amount of data loaded. +This part of the filter will be clearly indicated so you can comment it out if you want an all-sky search. + +```{code-cell} +# EDF-F (Euclid Deep Field - Fornax) +ra, dec, radius = 52.932, -28.088, 3 # 10 sq deg +edff_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius, inclusive=True) +``` + +### 3.3 Flux <-> magnitude conversions + +The [MER Photometry Cookbook](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/merdpd/merphotometrycookbook.html) defines the magnitude-flux conversion as: + +MAG = -2.5 * log10(FLUX) + 23.9 + +where FLUX is the total flux in microjanskys (µJy) and the zeropoint is 23.9 for all bands. + +```{code-cell} +def magnitude_to_flux(magnitude: float) -> float: + """Convert magnitude to flux following the MER Photometry Cookbook.""" + zeropoint = 23.9 + flux = 10 ** ((magnitude - zeropoint) / -2.5) + return flux +``` + +For convenience, we'll have PyArrow compute magnitudes from fluxes during the read operation. +This allows us to avoid loading and handling flux columns directly. +The function below constructs PyArrow expressions that can be used to filter and/or compute new columns on-the-fly. + +```{code-cell} +def flux_to_magnitude(flux_col_name: str, color_col_names: tuple[str, str] | None = None) -> pc.Expression: + """Construct an expression for the magnitude of `flux_col_name` following the MER Photometry Cookbook. + + MAG = -2.5 * log10(TOTAL FLUX) + 23.9. + If `color_col_names` is None, `flux_col_name` is taken as the total flux in the band. + If not None, it should list the columns needed for the color correction such that + TOTAL FLUX = flux_col_name * color_col_names[0] / color_col_names[1]. + + Returns a `pyarrow.compute.Expression` which can be used in the `filter` (to filter based on it) and/or + `columns` (to return it) keyword arguments when loading from the PyArrow dataset. + """ + scale = pc.scalar(-2.5) + zeropoint = pc.scalar(23.9) + + total_flux = pc.field(flux_col_name) + if color_col_names is not None: + color_scale = pc.divide(pc.field(color_col_names[0]), pc.field(color_col_names[1])) + total_flux = pc.multiply(total_flux, color_scale) + + log10_flux = pc.log10(total_flux) + mag_expression = pc.add(pc.multiply(scale, log10_flux), zeropoint) + return mag_expression +``` + +### 3.4 Frequently used columns + ++++ + +The following columns are used throughout this tutorial. +Descriptors generally come from Romelli unless noted. + +```{code-cell} +# Object ID set by the MER pipeline. +OBJECT_ID = "object_id" + +# Whether the source was detected in the VIS mosaic (1) or only in the NIR-stack mosaic (0). +VIS_DET = "mer_vis_det" + +# Best estimate of the total flux in the detection band. From aperture photometry within a Kron radius. +# Detection band is VIS if mer_vis_det=1. +# Otherwise, this is a non-physical NIR-stack flux and there was no VIS detection (aka, NIR-only). +FLUX_TOTAL = "mer_flux_detection_total" +FLUXERR_TOTAL = "mer_fluxerr_detection_total" + +# Peak surface brightness minus the magnitude used for mer_point_like_prob. +# Point-like: <-2.5. Compact: <-2.6. (Tucci) +MUMAX_MINUS_MAG = "mer_mumax_minus_mag" + +# Whether the detection has a >50% probability of being spurious (1=Yes, 0=No). +SPURIOUS_FLAG = "mer_spurious_flag" + +# PHZ classification: 1=Star, 2=Galaxy, 4=QSO. Combinations indicate multiple probability thresholds exceeded. +PHZ_CLASS = "phz_phz_classification" +PHZ_CLASS_MAP = { + 1: "Star", + 2: "Galaxy", + 4: "QSO", # In Q1, this includes luminous AGN. + # If multiple probability thresholds were exceeded, a combination of classes was reported. + 3: "Star and Galaxy", + 5: "Star and QSO", + 6: "Galaxy and QSO", + 7: "Star, Galaxy, and QSO", + # Two other integers (-1 and 0) and nulls are present, indicating that the class was not determined. + **{i: "Undefined" for i in [-1, 0, np.nan]}, +} +``` + +## 4. Magnitude distributions by object type + ++++ + +In this section, we query the catalog for magnitudes in the four Euclid bands and examine their distributions as a function of PHZ classification. +We use template-fit fluxes with color corrections as our baseline, following Euclid's recommendations for extended sources. + +Construct the filter for quality objects with VIS detections: + +```{code-cell} +# Construct a basic filter. +mag_filter = ( + (pc.field(VIS_DET) == 1) # No NIR-only objects. For simpler total-flux calculations; not required. + & (pc.field(PHZ_CLASS).isin([1, 2, 3, 4, 5, 6, 7])) # Stars, Galaxies, QSOs, and mixed classes. + & (pc.field(SPURIOUS_FLAG) == 0) # Basic quality cut. + # Comment out the next line to do an all-sky search. + & pc.field("_healpix_9").isin(edff_k9_pixels) # Restrict to Euclid Deep Field - Fornax. +) +``` + +Define the magnitude columns we want to load. +Because we're having PyArrow construct new columns (magnitudes from fluxes), we pass a dictionary mapping column names to PyArrow expressions rather than a simple list of column names. + +For the VIS band (I), we use `FLUX_TOTAL` which is the best estimate for the total flux in the detection band. +This comes from aperture photometry within a Kron radius. + +For the NIR bands (Y, J, H), we compute color-corrected total magnitudes from both template-fit and aperture fluxes following the MER Photometry Cookbook. +The color correction scales the measured NIR flux by the ratio of VIS total flux to VIS template (or aperture) flux. + +```{code-cell} +# Columns we want to load. It needs to be a dict because we're defining new columns (magnitudes). +# We'll have PyArrow return only magnitudes so that we don't have to handle all the flux columns in memory. +_mag_columns = { + # FLUX_TOTAL is the best estimate for the total flux in the detection band (here, VIS) and comes from + # aperture photometry. VIS provides the template for NIR bands. It has no unique templfit flux itself. + "I total": flux_to_magnitude(FLUX_TOTAL), + # Template-fit fluxes with color corrections. + "Y templfit total": flux_to_magnitude("mer_flux_y_templfit", color_col_names=(FLUX_TOTAL, "mer_flux_vis_to_y_templfit")), + "J templfit total": flux_to_magnitude("mer_flux_j_templfit", color_col_names=(FLUX_TOTAL, "mer_flux_vis_to_j_templfit")), + "H templfit total": flux_to_magnitude("mer_flux_h_templfit", color_col_names=(FLUX_TOTAL, "mer_flux_vis_to_h_templfit")), + # Aperture fluxes with color corrections. + "Y aperture total": flux_to_magnitude("mer_flux_y_2fwhm_aper", color_col_names=(FLUX_TOTAL, "mer_flux_vis_2fwhm_aper")), + "J aperture total": flux_to_magnitude("mer_flux_j_2fwhm_aper", color_col_names=(FLUX_TOTAL, "mer_flux_vis_2fwhm_aper")), + "H aperture total": flux_to_magnitude("mer_flux_h_2fwhm_aper", color_col_names=(FLUX_TOTAL, "mer_flux_vis_2fwhm_aper")), +} +mag_columns = {**_mag_columns, PHZ_CLASS: pc.field(PHZ_CLASS), MUMAX_MINUS_MAG: pc.field(MUMAX_MINUS_MAG)} +``` + +Load the data. This query filters for quality VIS-detected objects and returns magnitudes in all four Euclid bands. + +```{code-cell} +mags_df = dataset.to_table(columns=mag_columns, filter=mag_filter).to_pandas() +``` + +Plot magnitude distributions for each band, separated by PHZ classification. +We'll use template-fit magnitudes (with color corrections) as our baseline. +Include multiply-classed objects and separate point-like sources, since classification thresholds were tuned differently for different science goals. + +```{code-cell} +# Galaxy + any. Star + galaxy. QSO + galaxy. +classes = {"Galaxy": (2, 3, 6, 7), "Star": (1, 3), "QSO": (4, 6)} +class_colors = ["tab:green", "tab:blue", "tab:orange"] + +bands = ["I total", "Y templfit total", "J templfit total", "H templfit total"] +mag_limits = (14, 28) # Excluding all magnitudes outside this range. +hist_kwargs = dict(bins=20, range=mag_limits, histtype="step") + +fig, axes = plt.subplots(3, 4, figsize=(18, 12), sharey="row", sharex=True) +for (class_name, class_ids), class_color in zip(classes.items(), class_colors): + # Get the objects that are in this class only. + class_df = mags_df.loc[mags_df[PHZ_CLASS] == class_ids[0]] + # Plot histograms for each band. Galaxies on top row, then stars, then QSOs. + axs = axes[0] if class_name == "Galaxy" else (axes[1] if class_name == "Star" else axes[2]) + for ax, band in zip(axs, bands): + ax.hist(class_df[band], label=class_name, color=class_color, **hist_kwargs) + + # Get the objects that were accepted as multiple classes. + class_df = mags_df.loc[mags_df[PHZ_CLASS].isin(class_ids)] + label = "+Galaxy" if class_name != "Galaxy" else "+any" + # Of those objects, restrict to the ones that are point-like. + classpt_df = class_df.loc[class_df[MUMAX_MINUS_MAG] < -2.5] + # Plot histograms for both sets of objects. + for ax, band in zip(axs, bands): + ax.hist(class_df[band], color=class_color, label=label, linestyle=":", **hist_kwargs) + ax.hist(classpt_df[band], color=class_color, linestyle="-.", label=label + " and point-like", **hist_kwargs) + +# Add axis labels, etc. +for ax in axes[:, 0]: + ax.set_ylabel("Counts") + ax.legend(framealpha=0.2, loc=2) +for axs, band in zip(axes.transpose(), bands): + axs[0].set_title(band.split()[0]) + axs[-1].set_xlabel(f"{band} (mag)") +plt.tight_layout() +``` + +Euclid is tuned to detect galaxies for cosmology studies, so there are many more galaxies than other object types. +The green distributions (top row) show the galaxy magnitudes across all four bands. +The green dash-dot line highlights the population of point-like "galaxies", which are likely misclassified stars or QSOs but mostly appear at faint magnitudes. + +The star distributions (middle row, blue) are broader and peak at brighter magnitudes than the galaxy distributions, as expected. +Adding objects classified as both star and galaxy (dotted line) adds significant numbers, especially near the peak and toward the faint end where confusion is more likely. +Restricting these to point-like objects (dash-dot line) shows that many bright objects surpassing both probability thresholds are likely true stars. +However, this doesn't hold at the faint end where even some star-only classified objects fail the point-like cut. + +The bottom panel (orange) shows very few point-like QSOs, reminding us that most QSO classifications in Q1 should be treated with skepticism. +The few high-confidence QSOs are concentrated in regions with additional external data (particularly u-band from UNIONS in EDF-N). + ++++ + +## 5. Template-fit vs. aperture magnitudes + ++++ + +Now we compare template-fit and aperture magnitudes by plotting their differences. +This comparison reveals systematic offsets that depend on factors including morphology (extended vs. point-like) and brightness. + +This figure is inspired by Romelli Fig. 6 (top panel). + +```{code-cell} +# Only consider objects within these mag and mag difference limits. +mag_limits, mag_diff_limits = (16, 24), (-1, 1) +mag_limited_df = mags_df.loc[(mags_df["I total"] > mag_limits[0]) & (mags_df["I total"] < mag_limits[1])] + +fig, axes = plt.subplots(2, 3, figsize=(18, 9), sharey=True, sharex=True) +bands = [("Y templfit total", "Y aperture total"), ("J templfit total", "J aperture total"), ("H templfit total", "H aperture total")] +hexbin_kwargs = dict(bins="log", extent=(*mag_limits, *mag_diff_limits), gridsize=25) +annotate_kwargs = dict(xycoords="axes fraction", ha="left", fontweight="bold", bbox=dict(facecolor="white", alpha=0.8)) + +# Plot +for axs, (ref_band, aper_band) in zip(axes.transpose(), bands): + # Extended objects, top row. + ax = axs[0] + extended_df = mags_df.loc[mags_df[MUMAX_MINUS_MAG] >= -2.5] + extended_mag_diff = extended_df[ref_band] - extended_df[aper_band] + cb = ax.hexbin(extended_df["I total"], extended_mag_diff, **hexbin_kwargs) + plt.colorbar(cb) + ax.set_ylabel(f"{ref_band} - {aper_band}") + # Annotate top (bottom) with the fraction of objects having a magnitude difference greater (less) than 0. + frac_tmpl_greater = len(extended_mag_diff.loc[extended_mag_diff > 0]) / len(extended_mag_diff) + ax.annotate(f"{frac_tmpl_greater:.3f}", xy=(0.01, 0.99), va="top", **annotate_kwargs) + frac_tmpl_less = len(extended_mag_diff.loc[extended_mag_diff < 0]) / len(extended_mag_diff) + ax.annotate(f"{frac_tmpl_less:.3f}", xy=(0.01, 0.01), va="bottom", **annotate_kwargs) + + # Point-like objects, bottom row. + ax = axs[1] + pointlike_df = mags_df.loc[mags_df[MUMAX_MINUS_MAG] < -2.5] + pointlike_mag_diff = pointlike_df[ref_band] - pointlike_df[aper_band] + cb = ax.hexbin(pointlike_df["I total"], pointlike_mag_diff, **hexbin_kwargs) + plt.colorbar(cb) + ax.set_ylabel(f"{ref_band} - {aper_band}") + # Annotate top (bottom) with the fraction of objects having a magnitude difference greater (less) than 0. + frac_tmpl_greater = len(pointlike_mag_diff.loc[pointlike_mag_diff > 0]) / len(pointlike_mag_diff) + ax.annotate(f"{frac_tmpl_greater:.3f}", xy=(0.01, 0.99), va="top", **annotate_kwargs) + frac_tmpl_less = len(pointlike_mag_diff.loc[pointlike_mag_diff < 0]) / len(pointlike_mag_diff) + ax.annotate(f"{frac_tmpl_less:.3f}", xy=(0.01, 0.01), va="bottom", **annotate_kwargs) + +# Add axis labels, etc. +for i, ax in enumerate(axes.flatten()): + ax.axhline(0, color="gray", linewidth=1) + if i == 1: + ax.set_title("Extended objects") + if i == 4: + ax.set_title("Point-like objects") + if i > 2: + ax.set_xlabel("I total") +plt.tight_layout() +``` + +The template - aperture magnitude difference is fairly tightly clustered around 0 for extended objects (top row), but with asymmetric outliers. +Fractions annotated in the plots show the proportion of objects with magnitude differences above (top number) and below (bottom number) zero. +There is a positive offset indicating fainter template-fit magnitudes, as expected: templates better exclude contaminating light from nearby sources. + +The offset is more pronounced for point-like objects (bottom row), likely due to PSF-related issues in template fitting. +This confirms that aperture magnitudes are more reliable for point sources, consistent with recommendations in the MER Photometry Cookbook. + ++++ + +## About this notebook + +**Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team. + +**Updated:** 2025-12-10 + +**Contact:** [IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or problems. diff --git a/tutorials/parquet-catalog-demos/irsa-hats-with-lsdb.md b/tutorials/parquet-catalog-demos/irsa-hats-with-lsdb.md index 00a73fd7..95d5d6de 100644 --- a/tutorials/parquet-catalog-demos/irsa-hats-with-lsdb.md +++ b/tutorials/parquet-catalog-demos/irsa-hats-with-lsdb.md @@ -11,6 +11,7 @@ kernelspec: name: python3 --- +(hats-collections-lsdb-intro)= # Access HATS Collections Using LSDB: Euclid Q1 and ZTF DR23 +++ @@ -243,7 +244,7 @@ euclid_schema_df[ "phz_phz_median" gives us the median photometric redshift of sources, while "phz_phz_classification" can be used to filter the Euclid catalog to only galaxy-type sources. You can explore the schema DataFrame further to identify any other columns of interest. -It is also useful to go through the description of Euclid Q1 data products and papers that contain different PHZ, MER, etc. catalog tables that are merged in this HATS catalog, which are linked in the [Euclid Q1 HATS Parquet notebook](https://caltech-ipac.github.io/irsa-tutorials/#todo-add-when-available). +It is also useful to go through the description of Euclid Q1 data products and papers that contain different PHZ, MER, etc. catalog tables that are merged in this HATS catalog, which are linked in the [Euclid Q1 Merged Objects HATS Catalog: Introduction](#euclid-q1-hats-intro). +++