code to generate Gauss stats plot

Swanson-Hysell · Swanson-Hysell · commit ee8a47ac281a · 2026-03-23T18:16:27.000-05:00
diff --git a/book/chapters/chapter11.md b/book/chapters/chapter11.md
@@ -23,12 +23,12 @@ Most of the statistical methods used in paleomagnetism have direct analogies to
 
 Any statistical method for determining a mean (and confidence limit) from a set of observations is based on a probability density function. This function describes the distribution of observations for a hypothetical, infinite set of observations called a population. The Gaussian probability density function (normal distribution) has the familiar bell-shaped form shown in [Figure %s](#fig:gauss)a. The meaning of the probability density function $f(z)$ is that the proportion of observations within an interval of incremental width $dz$ centered on $z$ is $f(z) dz$.
 
-:::{figure} ../figures/chapter11/gauss.png
+:::{figure} ../figures/chapter11/gauss_code.png
 :name: fig:gauss
-:alt: Four-panel plot: a) bell-shaped Gaussian PDF, b) histogram of 1000 bed thickness measurements with normal curve overlay, c) narrow histogram of 100 sample means, d) skewed chi-squared histogram of variances.
+:alt: Four-panel plot: a) bell-shaped Gaussian PDF, b) histogram of 1000 simulated bed thickness measurements with normal curve overlay, c) narrow histogram of 100 sample means, d) skewed chi-squared histogram of variances.
 :width: 100%
 
-a) The Gaussian probability density function (normal distribution, [Equation %s](#eq:normal)). The proportion of observations within an interval $dz$ centered on $z$ is $f(z)dz$. b) Histogram of 1000 measurements of bed thickness in a sedimentary formation. Also shown is the smooth curve of a normal distribution with a mean of 10 and a standard deviation of 3. c) Histogram of the means from 100 repeated sets of 1000 measurements from the same sedimentary formation. The distribution of the means is much tighter. d) Histogram of the variances ($s^2$) from the same set of experiments as in c). The distribution of variances is not bell shaped; it is $\chi^2$.
+a) The Gaussian probability density function (normal distribution, [Equation %s](#eq:normal)). The proportion of observations within an interval $dz$ centered on $z$ is $f(z)dz$. b) Histogram of 1000 simulated measurements of bed thickness in a sedimentary formation, drawn from a normal distribution with a mean of 15 and a standard deviation of 3. Also shown is the smooth curve of the generating distribution. c) Histogram of the means from 100 repeated sets of 1000 measurements from the same distribution. The distribution of the means is much tighter. d) Histogram of the variances ($s^2$) from the same set of experiments as in c). The distribution of variances is not bell shaped; it is $\chi^2$.
 :::
 
 The Gaussian probability density function is given by:
@@ -55,7 +55,7 @@ $$
 
 where $N$ is the number of measurements and $x_i$ is an individual measurement.
 
-The mean estimated from the data shown in [Figure %s](#fig:gauss)b is 10.09. If we had measured an infinite number of bed thicknesses, we would have gotten the bell curve shown as the dashed line and calculated a mean of 10.
+The mean estimated from the data shown in [Figure %s](#fig:gauss)b is 14.91. If we had measured an infinite number of bed thicknesses, we would have gotten the bell curve shown as the dashed line and calculated a mean of 15.
 
 The "spread" in the data is characterized by the *variance* $\sigma^2$. Variance for normal distributions can be estimated by the statistic $s^2$:
 
@@ -65,17 +65,17 @@ $$ (eq:sigma)
 
 In order to get the units right on the spread about the mean (cm -- not cm$^2$), we have to take the square root of $s^2$. The statistic $s$ gives an estimate of the standard deviation $\sigma$ and is the bounds around the mean that includes 68% of the values. The 95% confidence bounds are given by 1.96$s$ (this is what a "2-$\sigma$ error" is), and should include 95% of the observations. The bell curve shown in [Figure %s](#fig:gauss)b has a $\sigma$ (standard deviation) of 3, while the $s$ is 2.97.
 
-If you repeat the bed measuring experiment a few times, you will never get exactly the same measurements in the different trials. The mean and standard deviations measured for each trial then are "sample" means and standard deviations. If you plotted up all those sample means, you would get another normal distribution whose mean should be pretty close to the true mean, but with a much more narrow standard deviation. In [Figure %s](#fig:gauss)c we plot a histogram of means from 100 such trials of 1000 measurements each drawn from the same distribution of $\mu = 10, \sigma = 3$. In general, we expect the standard deviation of the means (or *standard error of the mean*, $s_m$) to be related to $s$ by
+If you repeat the bed measuring experiment a few times, you will never get exactly the same measurements in the different trials. The mean and standard deviations measured for each trial then are "sample" means and standard deviations. If you plotted up all those sample means, you would get another normal distribution whose mean should be pretty close to the true mean, but with a much more narrow standard deviation. In [Figure %s](#fig:gauss)c we plot a histogram of means from 100 such trials of 1000 measurements each drawn from the same distribution of $\mu = 15, \sigma = 3$. In general, we expect the standard deviation of the means (or *standard error of the mean*, $s_m$) to be related to $s$ by
 
 $$
 s_m = \frac{s}{\sqrt{N_{trials}}}.
 $$
 
 What if we were to plot up a histogram of the estimated variances as in [Figure %s](#fig:gauss)c? Are these also normally distributed? The answer is no, because variance is a squared parameter relative to the original units. In fact, the distribution of variance estimates from normal distributions is expected to be *chi-squared* ($\chi^2$). The width of the $\chi^2$ distribution is also governed by how many measurements were made. The so-called number of *degrees of freedom* ($\nu$) is given by the number of measurements made minus the number of measurements required to make the estimate, so $\nu$ for our case is $N-1$. Therefore we expect the variance estimates to follow a $\chi^2$ distribution with $N-1$ degrees of freedom of $\chi^2_{\nu}$.
 
-The estimated standard error of the mean, $s_m$, provides a confidence limit for the calculated mean. Of all the possible samples that can be drawn from a particular normal distribution, 95% have means, $\bar x$, within 2$s_m$ of $\bar x$. (Only 5% of possible samples have means that lie farther than 2$s_m$ from $\bar x$.) Thus the 95% confidence limit on the calculated mean, $\bar x$, is 2$s_m$, and we are 95% certain that the true mean of the population from which the sample was drawn lies within 2$s_m$ of $\bar x$. The estimated standard error of the mean, $s_m$ decreases 1/$\sqrt{N}$. Larger samples provide more precise estimations of the true mean; this is reflected in the smaller confidence limit with increasing $N$.
+The estimated standard error of the mean, $s_m$, provides a confidence limit for the calculated mean. Of all the possible samples that can be drawn from a particular normal distribution, 95% have means, $\bar x$, within 2$s_m$ of $\bar x$. (Only 5% of possible samples have means that lie farther than 2$s_m$ from $\bar x$.) Thus, the 95% confidence limit on the calculated mean, $\bar x$, is 2$s_m$, and we are 95% certain that the true mean of the population from which the sample was drawn lies within 2$s_m$ of $\bar x$. The estimated standard error of the mean, $s_m$ decreases 1/$\sqrt{N}$. Larger samples provide more precise estimations of the true mean; this is reflected in the smaller confidence limit with increasing $N$.
 
-We often wish to consider ratios of variances derived from normal distributions (for example to decide if the data are more scattered in one data set relative to another). In order to do this, we must know what ratio would be expected from data sets drawn from the same distributions. Ratios of such variances follow a so-called $F$ distribution with $\nu_1$ and $\nu_2$ degrees of freedom for the two data sets. This is denoted $F[\nu_1,\nu_2]$. Thus if the ratio $F$, given by:
+We often wish to consider ratios of variances derived from normal distributions (for example to decide if the data are more scattered in one data set relative to another). In order to do this, we must know what ratio would be expected from data sets drawn from the same distributions. Ratios of such variances follow a so-called $F$ distribution with $\nu_1$ and $\nu_2$ degrees of freedom for the two data sets. This is denoted $F[\nu_1,\nu_2]$. Thus, if the ratio $F$, given by:
 
 $$
 F = \frac{s_1^2}{s_2^2},
@@ -99,7 +99,7 @@ Here $\nu = N_1 + N_2 - 2$. If this number is below a critical value for $t$ the
 
 ## Statistics of Vectors
 
-We turn now to the trickier problem of sets of measured vectors. We will consider the case in which all vectors are assumed to have a length of one, i.e., these are unit vectors. Unit vectors are just "directions". Paleomagnetic directional data are subject to a number of factors that lead to scatter. These include:
+We turn now to the trickier problem of sets of measured vectors. We will consider the case in which all vectors are assumed to have a length of one, i.e., these are unit vectors. Unit vectors are just "directions." Paleomagnetic directional data are subject to a number of factors that lead to scatter. These include:
 
 1. uncertainty in the measurement caused by instrument noise or specimen alignment errors,
 2. uncertainties in sample orientation,
diff --git a/book/figures/chapter11/gauss_code.png b/book/figures/chapter11/gauss_code.png
diff --git a/scripts/chapter11_gauss.py b/scripts/chapter11_gauss.py
@@ -0,0 +1,118 @@
+"""Generate four-panel Gaussian/CLT/chi-squared demonstration figure.
+
+Reproduces the logic of gauss.png (Chapter 11):
+    a) Standard normal probability density function with square markers
+    b) Histogram of N=1000 draws from N(mu=10, sigma=3) with PDF overlay
+    c) Histogram of sample means from 100 repeated trials (CLT demonstration)
+    d) Histogram of sample variances from the same trials (chi-squared shape)
+"""
+
+import sys
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+from scipy.stats import norm
+
+# Use project figure style
+sys.path.insert(0, str(Path(__file__).parent))
+from figure_style import apply_mpl_style
+
+apply_mpl_style()
+
+
+# Reproducible random state
+rng = np.random.default_rng(42)
+
+# Parameters matching the original figure description
+mu = 15.0
+sigma = 3.0
+n_single = 1000
+n_trials = 100
+n_per_trial = 1000
+
+# --- Generate data ---
+# Panel b: one set of 1000 bed-thickness measurements
+bed_thickness = rng.normal(loc=mu, scale=sigma, size=n_single)
+
+# Panels c & d: 100 repeated trials of 1000 measurements each
+repeated_trials = rng.normal(loc=mu, scale=sigma, size=(n_trials, n_per_trial))
+trial_means = repeated_trials.mean(axis=1)
+trial_variances = repeated_trials.var(axis=1, ddof=1)
+
+# --- Create figure ---
+fig, axes = plt.subplots(2, 2, figsize=(8, 7))
+
+# --- Panel a: Standard normal PDF ---
+ax = axes[0, 0]
+z = np.linspace(-4, 4, 500)
+pdf_z = norm.pdf(z)
+
+# Red dashed line with red square markers (matching original)
+z_markers = np.linspace(-3.5, 3.5, 50)
+ax.plot(z, pdf_z, 'r--', lw=1.5)
+ax.plot(z_markers, norm.pdf(z_markers), 's', color='red',
+        markersize=4, markeredgecolor='black', markeredgewidth=0.5)
+ax.axvline(0.0, color='grey', lw=0.8, alpha=0.5)
+ax.set_xlim(-4, 4)
+ax.set_ylim(0, 0.42)
+ax.set_xlabel('z')
+ax.set_ylabel('f(z)')
+ax.text(0.05, 0.90, 'a)', transform=ax.transAxes, fontsize=14,
+        fontweight='bold')
+
+# --- Panel b: Histogram of bed thickness with normal PDF overlay ---
+ax = axes[0, 1]
+bin_width_b = 0.5
+bins_b = np.arange(mu - 5 * sigma, mu + 5 * sigma, bin_width_b)
+ax.hist(bed_thickness, bins=bins_b,
+        histtype='step', color='black', linewidth=0.8)
+
+x_fit = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 400)
+ax.plot(x_fit, n_single * bin_width_b * norm.pdf(x_fit, loc=mu, scale=sigma),
+        'r--', lw=2)
+ax.set_xlabel('Bed thickness (cm)')
+ax.set_ylabel('Count')
+ax.set_xlim(mu - 4 * sigma, mu + 4 * sigma)
+ax.text(0.05, 0.90, 'b)', transform=ax.transAxes, fontsize=14,
+        fontweight='bold')
+ax.text(0.65, 0.90, f'N  =  {n_single}', transform=ax.transAxes,
+        fontsize=11)
+
+# --- Panel c: Histogram of sample means ---
+ax = axes[1, 0]
+sigma_mean = sigma / np.sqrt(n_per_trial)
+bins_c = np.linspace(trial_means.min() - 0.05, trial_means.max() + 0.05, 25)
+bin_width_c = bins_c[1] - bins_c[0]
+ax.hist(trial_means, bins=bins_c,
+        histtype='step', color='black', linewidth=0.8)
+
+x_fit_c = np.linspace(trial_means.min() - 3 * sigma_mean,
+                       trial_means.max() + 3 * sigma_mean, 400)
+ax.plot(x_fit_c, n_trials * bin_width_c * norm.pdf(x_fit_c, loc=mu, scale=sigma_mean),
+        'r--', lw=2)
+ax.set_xlabel('Means of repeat trials')
+ax.set_ylabel('Count')
+ax.text(0.05, 0.90, 'c)', transform=ax.transAxes, fontsize=14,
+        fontweight='bold')
+ax.text(0.60, 0.90, f'N  =  {n_trials}', transform=ax.transAxes,
+        fontsize=11)
+
+# --- Panel d: Histogram of sample variances ---
+ax = axes[1, 1]
+bins_d = np.linspace(trial_variances.min(), trial_variances.max(), 25)
+ax.hist(trial_variances, bins=bins_d,
+        histtype='step', color='black', linewidth=0.8)
+ax.set_xlabel('Variance')
+ax.set_ylabel('Count')
+ax.text(0.05, 0.90, 'd)', transform=ax.transAxes, fontsize=14,
+        fontweight='bold')
+ax.text(0.60, 0.90, f'N  =  {n_trials}', transform=ax.transAxes,
+        fontsize=11)
+
+fig.tight_layout()
+
+outpath = Path(__file__).parent.parent / 'book' / 'figures' / 'chapter11' / 'gauss_code.png'
+fig.savefig(outpath, dpi=300, bbox_inches='tight', facecolor='white')
+print(f'Saved to {outpath}')
+plt.close(fig)