Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 133 additions & 64 deletions tutorials/cosmodc2/cosmoDC2_TAP_access.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.16.2
jupytext_version: 1.18.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand All @@ -14,13 +14,30 @@ execution:
timeout: 2600
---

# Querying the CosmoDC2 Mock v1 Catalogs

This tutorial demonstrates how to access and query the **CosmoDC2 Mock v1** catalogs using IRSA’s Table Access Protocol (TAP) service. Background information on the catalogs is available on the [IRSA CosmoDC2 page](https://irsa.ipac.caltech.edu/Missions/cosmodc2.html).

# Querying CosmoDC2 Mock v1 catalogs
The catalogs are served through IRSA’s Virtual Observatory–standard **TAP** [interface](https://irsa.ipac.caltech.edu/docs/program_interface/TAP.html), which you can access programmatically in Python via the **PyVO** library. TAP queries are written in the **Astronomical Data Query Language (ADQL)** — a SQL-like language designed for astronomical catalogs (see the [ADQL specification](https://www.ivoa.net/documents/latest/ADQL.html)).

This tutorial demonstrates how to access the CosmoDC2 Mock V1 catalogs. More information about these catalogs can be found here: https://irsa.ipac.caltech.edu/Missions/cosmodc2.html
If you are new to PyVO’s query modes, the documentation provides a helpful comparison between **synchronous** and **asynchronous** execution: [PyVO: Synchronous vs. Asynchronous Queries](https://pyvo.readthedocs.io/en/latest/dal/index.html#synchronous-vs-asynchronous-query)

These catalogs can be accessed through IRSA's Virtual Ovservatory Table Access Protocol (TAP) service. See https://www.ivoa.net/documents/TAP/ for details on the protocol. This service can be accessed through Python using the PyVO library.

## Tips for Working with CosmoDC2 via TAP

- **Use indexed columns for fast queries.**
CosmoDC2 is indexed on the following fields:
`ra`, `dec`, `redshift`, `mag*_lsst`, `halo_mass`, `stellar_mass`
Queries involving these columns generally return much faster.

- **Ensure your positional queries fall within the survey footprint.**
CosmoDC2 covers the area specified by the
following (R.A., decl.) coordinate pairs (J2000):
(71.46,−27.25), (52.25,−27.25),
(73.79,−44.33), (49.42,−44.33).

- **Avoid overloading the TAP service.**
Preferentially use **asynchronous** queries for long running queries to avoid timing out. The whole system will slow down if a lot of people are using it for large queries, or if you decide to kick off many large queries at the same time.

```{code-cell} ipython3
# Uncomment the next line to install dependencies if needed.
Expand All @@ -29,13 +46,16 @@ These catalogs can be accessed through IRSA's Virtual Ovservatory Table Access P

```{code-cell} ipython3
import pyvo as vo
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
```

```{code-cell} ipython3
service = vo.dal.TAPService("https://irsa.ipac.caltech.edu/TAP")
```

## List the available DC2 tables
## 1. List the available DC2 tables

```{code-cell} ipython3
tables = service.tables
Expand All @@ -45,7 +65,7 @@ for tablename in tables.keys():
tables[tablename].describe()
```

## Choose the DC2 catalog you want to work with.
## 2. Choose the DC2 catalog you want to work with.

IRSA currently offers 3 versions of the DC2 catalog.

Expand All @@ -64,31 +84,7 @@ If you are new to the DC2 catalog, we recommend that you start with ``cosmodc2mo
tablename = 'cosmodc2mockv1_heavy'
```

## How many rows are in the chosen table?

With TAP, you can query catalogs with constraints specified in IVOA Astronomical Data Query Language (ADQL; https://www.ivoa.net/documents/latest/ADQL.html), which is based on SQL.

```{code-cell} ipython3
# For example, this snippet of ADQL counts the number of elements in
# the redshift column of the table we chose.
adql = f"SELECT count(redshift) FROM {tablename}"
adql
```

In order to use TAP with this ADQL string using pyvo, you can do the following:

```{code-cell} ipython3
# Uncomment the next line to run the query. Beware that it can take awhile.
# service.run_async(adql)
```

The above query shows that there are 597,488,849 redshifts in this table.
Running ``count`` on an entire table is an expensive operation, therefore we ran it asynchronously to avoid any potential timeout issues.
To learn more about synchronous versus asynchronous PyVO queries please read the [relevant PyVO documentation](https://pyvo.readthedocs.io/en/latest/dal/index.html#synchronous-vs-asynchronous-query).

+++

## What is the default maximum number of rows returned by the service?
## 3. What is the default maximum number of rows returned by the service?

This service will return a maximum of 2 billion rows by default.

Expand All @@ -102,7 +98,7 @@ This default maximum can be changed, and there is no hard upper limit to what it
print(service.hardlimit)
```

## List the columns in the chosen table
## 4. List the columns in the chosen table

This table contains 301 columns.

Expand All @@ -118,79 +114,152 @@ for col in columns:
print(f'{f"{col.name}":30s} {col.description}')
```

## Create a histogram of redshifts
## 5. Retrieve a list of galaxies within a small area

Let's figure out what redshift range these galaxies cover. Since we found out above that it's a large catalog, we can start with a spatial search over a small area of 0.1 deg. The ADQL that is needed for the spatial constraint is:
Since we know that cosmoDC2 is a large catalog, we can start with a spatial search over a small square area. The ADQL that is needed for the spatial constraint is shown below. We then show how to make a redshift histogram of the sample generated.

```{code-cell} ipython3
adql = f"SELECT redshift FROM {tablename} WHERE CONTAINS(POINT('ICRS', RAMean, DecMean), CIRCLE('ICRS',54.218205903,-37.497959343,.1))=1"
adql
```
# Setup the query
adql = f"""
SELECT redshift
FROM {tablename}
WHERE CONTAINS(
POINT('ICRS', ra, dec),
CIRCLE('ICRS', 54.0, -37.0, 0.05)
) = 1
"""

Now we can use the previously-defined service to execute the query with the spatial contraint.
cone_results = service.run_sync(adql)
```

```{code-cell} ipython3
cone_results = service.run_sync(adql)
#how many redshifts does this return?
print(len(cone_results))
```

```{code-cell} ipython3
# Plot a histogram
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# Now that we have a list of galaxy redshifts in that region, we can
# create a histogram of the redshifts to see what redshifts this survey includes.

# Plot a histogram
num_bins = 20
# the histogram of the data
n, bins, patches = plt.hist(cone_results['redshift'], num_bins,
facecolor='blue', alpha = 0.5)
plt.xlabel('Redshift')
plt.ylabel('Number')
plt.title('Redshift Histogram CosmoDC2 Mock Catalog V1 abridged')
plt.title(f'Redshift Histogram {tablename}')
```

We can easily see form this plot that the simulated galaxies go out to z = 3.
We can see form this plot that the simulated galaxies go out to z = 3.

+++

## Visualize galaxy colors at z ~ 0.5

Now let's visualize the galaxy main sequence at z = 2.0. First, we'll do a narrow redshift cut with no spatial constraint.
## 6. Visualize galaxy colors: redshift search

Let's do it as an asynchronous search since this might take awhile, too.
First, we'll do a narrow redshift cut with no spatial constraint. Then, from that redshift sample we will visualize the galaxy main sequence at z = 2.0.

```{code-cell} ipython3
service = vo.dal.TAPService("https://irsa.ipac.caltech.edu/TAP")
adql = f"SELECT Mag_true_r_sdss_z0, Mag_true_g_sdss_z0, redshift FROM {tablename} WHERE redshift > 0.5 and redshift < 0.54"
results = service.run_async(adql)
# Setup the query
adql = f"""
SELECT TOP 50000
mag_r_lsst,
(mag_g_lsst - mag_r_lsst) AS color,
redshift
FROM {tablename}
WHERE redshift BETWEEN 1.95 AND 2.05
"""
redshift_results = service.run_sync(adql)
```

```{code-cell} ipython3
len(results['mag_true_r_sdss_z0'])
redshift_results
```

```{code-cell} ipython3
# Since this results in almost 4 million galaxies,
# we will construct a 2D histogram rather than a scatter plot.
plt.hist2d(results['mag_true_r_sdss_z0'], results['mag_true_g_sdss_z0']-results['mag_true_r_sdss_z0'],
bins=200, cmap='plasma', cmax=500)
# Construct a 2D histogram of the galaxy colors
plt.hist2d(redshift_results['mag_r_lsst'], redshift_results['color'],
bins=100, cmap='plasma', cmax=500)

# Plot a colorbar with label.
cb = plt.colorbar()
cb.set_label('Number')

# Add title and labels to plot.
plt.xlabel('SDSS Mag r')
plt.ylabel('SDSS rest-frame g-r color')
plt.xlabel('LSST Mag r')
plt.ylabel('LSST rest-frame g-r color')
```

## 7. Suggestions for further queries:
TAP queries are extremely powerful and provide flexible ways to explore large catalogs like CosmoDC2, including spatial searches, photometric selections, cross-matching, and more.
However, many valid ADQL queries can take minutes or longer to complete due to the size of the catalog, so we avoid running those directly in this tutorial.
Instead, the examples here have so far focused on fast, lightweight queries that illustrate the key concepts without long wait times.
If you are interested in exploring further, here are some additional query ideas that are scientifically useful but may take longer to run depending on server conditions.

### Count the total number of redshifts in the chosen table
The answer for the `'cosmodc2mockv1_heavy'` table is 597,488,849 redshifts.

```sql
adql = f"SELECT count(redshift) FROM {tablename}"
```

### Count galaxies in a sky region (cone search)
Generally useful for: estimating source density, validating spatial footprint, testing spatial completeness.

```sql
adql = f"""
SELECT COUNT(*)
FROM {tablename}
WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 54.2, -37.5, 0.2)) = 1
"""
```

### Retrieve only a subset of columns (recommended for speed) and rows
This use of "TOP 5000" just limits the number of rows returned.
Remove it if you want all rows, but keep in mind such a query can take a much longer time.

```sql
adql = f"""
SELECT TOP 5000
ra,
dec,
redshift,
stellar_mass
FROM {tablename}"""
```

# Show the plot.
plt.show()
### Explore the stellar–halo mass relation

```sql
adql = f"""
SELECT TOP 500000
stellar_mass,
halo_mass
FROM {tablename}
WHERE halo_mass > 1e11"""
```

### Find the brightest galaxies at high redshift
Return the results in ascending (ASC) order by r band magnitude.

```sql
adql = f"""
SELECT TOP 10000
ra, dec, redshift, mag_r_lsst
FROM {tablename}
WHERE redshift > 2.5
ORDER BY mag_r_lsst ASC
"""
```
***

+++

## About this notebook

**Author:** Vandana Desai (IRSA Science Lead)
**Author:** IRSA Data Science Team, including Vandana Desai, Jessica Krick, Troy Raen, Brigitta Sipőcz, Andreas Faisst, Jaladh Singhal

**Updated:** 2024-07-24
**Updated:** 2025-12-16

**Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems.

**Runtime:** As of the date above, this notebook takes about 2 minutes to run to completion on a machine with 8GB RAM and 2 CPU. Large variations in this runtime can be expected if the TAP server is busy with many queries at once.