diff --git a/tutorials/cosmodc2/cosmoDC2_TAP_access.md b/tutorials/cosmodc2/cosmoDC2_TAP_access.md index a4fcd851..ca6c37dd 100644 --- a/tutorials/cosmodc2/cosmoDC2_TAP_access.md +++ b/tutorials/cosmodc2/cosmoDC2_TAP_access.md @@ -5,7 +5,7 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.16.2 + jupytext_version: 1.18.1 kernelspec: display_name: Python 3 (ipykernel) language: python @@ -14,13 +14,30 @@ execution: timeout: 2600 --- +# Querying the CosmoDC2 Mock v1 Catalogs +This tutorial demonstrates how to access and query the **CosmoDC2 Mock v1** catalogs using IRSA’s Table Access Protocol (TAP) service. Background information on the catalogs is available on the [IRSA CosmoDC2 page](https://irsa.ipac.caltech.edu/Missions/cosmodc2.html). -# Querying CosmoDC2 Mock v1 catalogs +The catalogs are served through IRSA’s Virtual Observatory–standard **TAP** [interface](https://irsa.ipac.caltech.edu/docs/program_interface/TAP.html), which you can access programmatically in Python via the **PyVO** library. TAP queries are written in the **Astronomical Data Query Language (ADQL)** — a SQL-like language designed for astronomical catalogs (see the [ADQL specification](https://www.ivoa.net/documents/latest/ADQL.html)). -This tutorial demonstrates how to access the CosmoDC2 Mock V1 catalogs. More information about these catalogs can be found here: https://irsa.ipac.caltech.edu/Missions/cosmodc2.html +If you are new to PyVO’s query modes, the documentation provides a helpful comparison between **synchronous** and **asynchronous** execution: [PyVO: Synchronous vs. Asynchronous Queries](https://pyvo.readthedocs.io/en/latest/dal/index.html#synchronous-vs-asynchronous-query) -These catalogs can be accessed through IRSA's Virtual Ovservatory Table Access Protocol (TAP) service. See https://www.ivoa.net/documents/TAP/ for details on the protocol. This service can be accessed through Python using the PyVO library. + +## Tips for Working with CosmoDC2 via TAP + +- **Use indexed columns for fast queries.** + CosmoDC2 is indexed on the following fields: + `ra`, `dec`, `redshift`, `mag*_lsst`, `halo_mass`, `stellar_mass` + Queries involving these columns generally return much faster. + +- **Ensure your positional queries fall within the survey footprint.** + CosmoDC2 covers the area specified by the + following (R.A., decl.) coordinate pairs (J2000): + (71.46,−27.25), (52.25,−27.25), + (73.79,−44.33), (49.42,−44.33). + +- **Avoid overloading the TAP service.** + Preferentially use **asynchronous** queries for long running queries to avoid timing out. The whole system will slow down if a lot of people are using it for large queries, or if you decide to kick off many large queries at the same time. ```{code-cell} ipython3 # Uncomment the next line to install dependencies if needed. @@ -29,13 +46,16 @@ These catalogs can be accessed through IRSA's Virtual Ovservatory Table Access P ```{code-cell} ipython3 import pyvo as vo +import numpy as np +import matplotlib.mlab as mlab +import matplotlib.pyplot as plt ``` ```{code-cell} ipython3 service = vo.dal.TAPService("https://irsa.ipac.caltech.edu/TAP") ``` -## List the available DC2 tables +## 1. List the available DC2 tables ```{code-cell} ipython3 tables = service.tables @@ -45,7 +65,7 @@ for tablename in tables.keys(): tables[tablename].describe() ``` -## Choose the DC2 catalog you want to work with. +## 2. Choose the DC2 catalog you want to work with. IRSA currently offers 3 versions of the DC2 catalog. @@ -64,31 +84,7 @@ If you are new to the DC2 catalog, we recommend that you start with ``cosmodc2mo tablename = 'cosmodc2mockv1_heavy' ``` -## How many rows are in the chosen table? - -With TAP, you can query catalogs with constraints specified in IVOA Astronomical Data Query Language (ADQL; https://www.ivoa.net/documents/latest/ADQL.html), which is based on SQL. - -```{code-cell} ipython3 -# For example, this snippet of ADQL counts the number of elements in -# the redshift column of the table we chose. -adql = f"SELECT count(redshift) FROM {tablename}" -adql -``` - -In order to use TAP with this ADQL string using pyvo, you can do the following: - -```{code-cell} ipython3 -# Uncomment the next line to run the query. Beware that it can take awhile. -# service.run_async(adql) -``` - -The above query shows that there are 597,488,849 redshifts in this table. -Running ``count`` on an entire table is an expensive operation, therefore we ran it asynchronously to avoid any potential timeout issues. -To learn more about synchronous versus asynchronous PyVO queries please read the [relevant PyVO documentation](https://pyvo.readthedocs.io/en/latest/dal/index.html#synchronous-vs-asynchronous-query). - -+++ - -## What is the default maximum number of rows returned by the service? +## 3. What is the default maximum number of rows returned by the service? This service will return a maximum of 2 billion rows by default. @@ -102,7 +98,7 @@ This default maximum can be changed, and there is no hard upper limit to what it print(service.hardlimit) ``` -## List the columns in the chosen table +## 4. List the columns in the chosen table This table contains 301 columns. @@ -118,79 +114,152 @@ for col in columns: print(f'{f"{col.name}":30s} {col.description}') ``` -## Create a histogram of redshifts +## 5. Retrieve a list of galaxies within a small area -Let's figure out what redshift range these galaxies cover. Since we found out above that it's a large catalog, we can start with a spatial search over a small area of 0.1 deg. The ADQL that is needed for the spatial constraint is: +Since we know that cosmoDC2 is a large catalog, we can start with a spatial search over a small square area. The ADQL that is needed for the spatial constraint is shown below. We then show how to make a redshift histogram of the sample generated. ```{code-cell} ipython3 -adql = f"SELECT redshift FROM {tablename} WHERE CONTAINS(POINT('ICRS', RAMean, DecMean), CIRCLE('ICRS',54.218205903,-37.497959343,.1))=1" -adql -``` +# Setup the query +adql = f""" +SELECT redshift +FROM {tablename} +WHERE CONTAINS( + POINT('ICRS', ra, dec), + CIRCLE('ICRS', 54.0, -37.0, 0.05) +) = 1 +""" -Now we can use the previously-defined service to execute the query with the spatial contraint. +cone_results = service.run_sync(adql) +``` ```{code-cell} ipython3 -cone_results = service.run_sync(adql) +#how many redshifts does this return? +print(len(cone_results)) ``` ```{code-cell} ipython3 -# Plot a histogram -import numpy as np -import matplotlib.mlab as mlab -import matplotlib.pyplot as plt +# Now that we have a list of galaxy redshifts in that region, we can +# create a histogram of the redshifts to see what redshifts this survey includes. +# Plot a histogram num_bins = 20 # the histogram of the data n, bins, patches = plt.hist(cone_results['redshift'], num_bins, facecolor='blue', alpha = 0.5) plt.xlabel('Redshift') plt.ylabel('Number') -plt.title('Redshift Histogram CosmoDC2 Mock Catalog V1 abridged') +plt.title(f'Redshift Histogram {tablename}') ``` -We can easily see form this plot that the simulated galaxies go out to z = 3. +We can see form this plot that the simulated galaxies go out to z = 3. +++ -## Visualize galaxy colors at z ~ 0.5 - -Now let's visualize the galaxy main sequence at z = 2.0. First, we'll do a narrow redshift cut with no spatial constraint. +## 6. Visualize galaxy colors: redshift search -Let's do it as an asynchronous search since this might take awhile, too. +First, we'll do a narrow redshift cut with no spatial constraint. Then, from that redshift sample we will visualize the galaxy main sequence at z = 2.0. ```{code-cell} ipython3 -service = vo.dal.TAPService("https://irsa.ipac.caltech.edu/TAP") -adql = f"SELECT Mag_true_r_sdss_z0, Mag_true_g_sdss_z0, redshift FROM {tablename} WHERE redshift > 0.5 and redshift < 0.54" -results = service.run_async(adql) +# Setup the query +adql = f""" +SELECT TOP 50000 + mag_r_lsst, + (mag_g_lsst - mag_r_lsst) AS color, + redshift +FROM {tablename} +WHERE redshift BETWEEN 1.95 AND 2.05 +""" +redshift_results = service.run_sync(adql) ``` ```{code-cell} ipython3 -len(results['mag_true_r_sdss_z0']) +redshift_results ``` ```{code-cell} ipython3 -# Since this results in almost 4 million galaxies, -# we will construct a 2D histogram rather than a scatter plot. -plt.hist2d(results['mag_true_r_sdss_z0'], results['mag_true_g_sdss_z0']-results['mag_true_r_sdss_z0'], - bins=200, cmap='plasma', cmax=500) +# Construct a 2D histogram of the galaxy colors +plt.hist2d(redshift_results['mag_r_lsst'], redshift_results['color'], + bins=100, cmap='plasma', cmax=500) # Plot a colorbar with label. cb = plt.colorbar() cb.set_label('Number') # Add title and labels to plot. -plt.xlabel('SDSS Mag r') -plt.ylabel('SDSS rest-frame g-r color') +plt.xlabel('LSST Mag r') +plt.ylabel('LSST rest-frame g-r color') +``` + +## 7. Suggestions for further queries: +TAP queries are extremely powerful and provide flexible ways to explore large catalogs like CosmoDC2, including spatial searches, photometric selections, cross-matching, and more. +However, many valid ADQL queries can take minutes or longer to complete due to the size of the catalog, so we avoid running those directly in this tutorial. +Instead, the examples here have so far focused on fast, lightweight queries that illustrate the key concepts without long wait times. +If you are interested in exploring further, here are some additional query ideas that are scientifically useful but may take longer to run depending on server conditions. + +### Count the total number of redshifts in the chosen table +The answer for the `'cosmodc2mockv1_heavy'` table is 597,488,849 redshifts. + +```sql +adql = f"SELECT count(redshift) FROM {tablename}" +``` + +### Count galaxies in a sky region (cone search) +Generally useful for: estimating source density, validating spatial footprint, testing spatial completeness. + +```sql +adql = f""" +SELECT COUNT(*) +FROM {tablename} +WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 54.2, -37.5, 0.2)) = 1 +""" +``` + +### Retrieve only a subset of columns (recommended for speed) and rows +This use of "TOP 5000" just limits the number of rows returned. +Remove it if you want all rows, but keep in mind such a query can take a much longer time. + +```sql +adql = f""" +SELECT TOP 5000 + ra, + dec, + redshift, + stellar_mass +FROM {tablename}""" +``` -# Show the plot. -plt.show() +### Explore the stellar–halo mass relation + +```sql +adql = f""" +SELECT TOP 500000 + stellar_mass, + halo_mass +FROM {tablename} +WHERE halo_mass > 1e11""" +``` + +### Find the brightest galaxies at high redshift +Return the results in ascending (ASC) order by r band magnitude. + +```sql +adql = f""" +SELECT TOP 10000 + ra, dec, redshift, mag_r_lsst +FROM {tablename} +WHERE redshift > 2.5 +ORDER BY mag_r_lsst ASC +""" ``` -*** + ++++ ## About this notebook -**Author:** Vandana Desai (IRSA Science Lead) +**Author:** IRSA Data Science Team, including Vandana Desai, Jessica Krick, Troy Raen, Brigitta Sipőcz, Andreas Faisst, Jaladh Singhal -**Updated:** 2024-07-24 +**Updated:** 2025-12-16 **Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems. + +**Runtime:** As of the date above, this notebook takes about 2 minutes to run to completion on a machine with 8GB RAM and 2 CPU. Large variations in this runtime can be expected if the TAP server is busy with many queries at once.