Skip to content

Commit ee8a47a

Browse files
code to generate Gauss stats plot
1 parent 3de4524 commit ee8a47a

3 files changed

Lines changed: 126 additions & 8 deletions

File tree

book/chapters/chapter11.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ Most of the statistical methods used in paleomagnetism have direct analogies to
2323

2424
Any statistical method for determining a mean (and confidence limit) from a set of observations is based on a probability density function. This function describes the distribution of observations for a hypothetical, infinite set of observations called a population. The Gaussian probability density function (normal distribution) has the familiar bell-shaped form shown in [Figure %s](#fig:gauss)a. The meaning of the probability density function $f(z)$ is that the proportion of observations within an interval of incremental width $dz$ centered on $z$ is $f(z) dz$.
2525

26-
:::{figure} ../figures/chapter11/gauss.png
26+
:::{figure} ../figures/chapter11/gauss_code.png
2727
:name: fig:gauss
28-
:alt: Four-panel plot: a) bell-shaped Gaussian PDF, b) histogram of 1000 bed thickness measurements with normal curve overlay, c) narrow histogram of 100 sample means, d) skewed chi-squared histogram of variances.
28+
:alt: Four-panel plot: a) bell-shaped Gaussian PDF, b) histogram of 1000 simulated bed thickness measurements with normal curve overlay, c) narrow histogram of 100 sample means, d) skewed chi-squared histogram of variances.
2929
:width: 100%
3030

31-
a) The Gaussian probability density function (normal distribution, [Equation %s](#eq:normal)). The proportion of observations within an interval $dz$ centered on $z$ is $f(z)dz$. b) Histogram of 1000 measurements of bed thickness in a sedimentary formation. Also shown is the smooth curve of a normal distribution with a mean of 10 and a standard deviation of 3. c) Histogram of the means from 100 repeated sets of 1000 measurements from the same sedimentary formation. The distribution of the means is much tighter. d) Histogram of the variances ($s^2$) from the same set of experiments as in c). The distribution of variances is not bell shaped; it is $\chi^2$.
31+
a) The Gaussian probability density function (normal distribution, [Equation %s](#eq:normal)). The proportion of observations within an interval $dz$ centered on $z$ is $f(z)dz$. b) Histogram of 1000 simulated measurements of bed thickness in a sedimentary formation, drawn from a normal distribution with a mean of 15 and a standard deviation of 3. Also shown is the smooth curve of the generating distribution. c) Histogram of the means from 100 repeated sets of 1000 measurements from the same distribution. The distribution of the means is much tighter. d) Histogram of the variances ($s^2$) from the same set of experiments as in c). The distribution of variances is not bell shaped; it is $\chi^2$.
3232
:::
3333

3434
The Gaussian probability density function is given by:
@@ -55,7 +55,7 @@ $$
5555
5656
where $N$ is the number of measurements and $x_i$ is an individual measurement.
5757
58-
The mean estimated from the data shown in [Figure %s](#fig:gauss)b is 10.09. If we had measured an infinite number of bed thicknesses, we would have gotten the bell curve shown as the dashed line and calculated a mean of 10.
58+
The mean estimated from the data shown in [Figure %s](#fig:gauss)b is 14.91. If we had measured an infinite number of bed thicknesses, we would have gotten the bell curve shown as the dashed line and calculated a mean of 15.
5959
6060
The "spread" in the data is characterized by the *variance* $\sigma^2$. Variance for normal distributions can be estimated by the statistic $s^2$:
6161
@@ -65,17 +65,17 @@ $$ (eq:sigma)
6565
6666
In order to get the units right on the spread about the mean (cm -- not cm$^2$), we have to take the square root of $s^2$. The statistic $s$ gives an estimate of the standard deviation $\sigma$ and is the bounds around the mean that includes 68% of the values. The 95% confidence bounds are given by 1.96$s$ (this is what a "2-$\sigma$ error" is), and should include 95% of the observations. The bell curve shown in [Figure %s](#fig:gauss)b has a $\sigma$ (standard deviation) of 3, while the $s$ is 2.97.
6767
68-
If you repeat the bed measuring experiment a few times, you will never get exactly the same measurements in the different trials. The mean and standard deviations measured for each trial then are "sample" means and standard deviations. If you plotted up all those sample means, you would get another normal distribution whose mean should be pretty close to the true mean, but with a much more narrow standard deviation. In [Figure %s](#fig:gauss)c we plot a histogram of means from 100 such trials of 1000 measurements each drawn from the same distribution of $\mu = 10, \sigma = 3$. In general, we expect the standard deviation of the means (or *standard error of the mean*, $s_m$) to be related to $s$ by
68+
If you repeat the bed measuring experiment a few times, you will never get exactly the same measurements in the different trials. The mean and standard deviations measured for each trial then are "sample" means and standard deviations. If you plotted up all those sample means, you would get another normal distribution whose mean should be pretty close to the true mean, but with a much more narrow standard deviation. In [Figure %s](#fig:gauss)c we plot a histogram of means from 100 such trials of 1000 measurements each drawn from the same distribution of $\mu = 15, \sigma = 3$. In general, we expect the standard deviation of the means (or *standard error of the mean*, $s_m$) to be related to $s$ by
6969
7070
$$
7171
s_m = \frac{s}{\sqrt{N_{trials}}}.
7272
$$
7373
7474
What if we were to plot up a histogram of the estimated variances as in [Figure %s](#fig:gauss)c? Are these also normally distributed? The answer is no, because variance is a squared parameter relative to the original units. In fact, the distribution of variance estimates from normal distributions is expected to be *chi-squared* ($\chi^2$). The width of the $\chi^2$ distribution is also governed by how many measurements were made. The so-called number of *degrees of freedom* ($\nu$) is given by the number of measurements made minus the number of measurements required to make the estimate, so $\nu$ for our case is $N-1$. Therefore we expect the variance estimates to follow a $\chi^2$ distribution with $N-1$ degrees of freedom of $\chi^2_{\nu}$.
7575
76-
The estimated standard error of the mean, $s_m$, provides a confidence limit for the calculated mean. Of all the possible samples that can be drawn from a particular normal distribution, 95% have means, $\bar x$, within 2$s_m$ of $\bar x$. (Only 5% of possible samples have means that lie farther than 2$s_m$ from $\bar x$.) Thus the 95% confidence limit on the calculated mean, $\bar x$, is 2$s_m$, and we are 95% certain that the true mean of the population from which the sample was drawn lies within 2$s_m$ of $\bar x$. The estimated standard error of the mean, $s_m$ decreases 1/$\sqrt{N}$. Larger samples provide more precise estimations of the true mean; this is reflected in the smaller confidence limit with increasing $N$.
76+
The estimated standard error of the mean, $s_m$, provides a confidence limit for the calculated mean. Of all the possible samples that can be drawn from a particular normal distribution, 95% have means, $\bar x$, within 2$s_m$ of $\bar x$. (Only 5% of possible samples have means that lie farther than 2$s_m$ from $\bar x$.) Thus, the 95% confidence limit on the calculated mean, $\bar x$, is 2$s_m$, and we are 95% certain that the true mean of the population from which the sample was drawn lies within 2$s_m$ of $\bar x$. The estimated standard error of the mean, $s_m$ decreases 1/$\sqrt{N}$. Larger samples provide more precise estimations of the true mean; this is reflected in the smaller confidence limit with increasing $N$.
7777
78-
We often wish to consider ratios of variances derived from normal distributions (for example to decide if the data are more scattered in one data set relative to another). In order to do this, we must know what ratio would be expected from data sets drawn from the same distributions. Ratios of such variances follow a so-called $F$ distribution with $\nu_1$ and $\nu_2$ degrees of freedom for the two data sets. This is denoted $F[\nu_1,\nu_2]$. Thus if the ratio $F$, given by:
78+
We often wish to consider ratios of variances derived from normal distributions (for example to decide if the data are more scattered in one data set relative to another). In order to do this, we must know what ratio would be expected from data sets drawn from the same distributions. Ratios of such variances follow a so-called $F$ distribution with $\nu_1$ and $\nu_2$ degrees of freedom for the two data sets. This is denoted $F[\nu_1,\nu_2]$. Thus, if the ratio $F$, given by:
7979
8080
$$
8181
F = \frac{s_1^2}{s_2^2},
@@ -99,7 +99,7 @@ Here $\nu = N_1 + N_2 - 2$. If this number is below a critical value for $t$ the
9999
100100
## Statistics of Vectors
101101
102-
We turn now to the trickier problem of sets of measured vectors. We will consider the case in which all vectors are assumed to have a length of one, i.e., these are unit vectors. Unit vectors are just "directions". Paleomagnetic directional data are subject to a number of factors that lead to scatter. These include:
102+
We turn now to the trickier problem of sets of measured vectors. We will consider the case in which all vectors are assumed to have a length of one, i.e., these are unit vectors. Unit vectors are just "directions." Paleomagnetic directional data are subject to a number of factors that lead to scatter. These include:
103103
104104
1. uncertainty in the measurement caused by instrument noise or specimen alignment errors,
105105
2. uncertainties in sample orientation,
174 KB
Loading

scripts/chapter11_gauss.py

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
"""Generate four-panel Gaussian/CLT/chi-squared demonstration figure.
2+
3+
Reproduces the logic of gauss.png (Chapter 11):
4+
a) Standard normal probability density function with square markers
5+
b) Histogram of N=1000 draws from N(mu=10, sigma=3) with PDF overlay
6+
c) Histogram of sample means from 100 repeated trials (CLT demonstration)
7+
d) Histogram of sample variances from the same trials (chi-squared shape)
8+
"""
9+
10+
import sys
11+
from pathlib import Path
12+
13+
import matplotlib.pyplot as plt
14+
import numpy as np
15+
from scipy.stats import norm
16+
17+
# Use project figure style
18+
sys.path.insert(0, str(Path(__file__).parent))
19+
from figure_style import apply_mpl_style
20+
21+
apply_mpl_style()
22+
23+
24+
# Reproducible random state
25+
rng = np.random.default_rng(42)
26+
27+
# Parameters matching the original figure description
28+
mu = 15.0
29+
sigma = 3.0
30+
n_single = 1000
31+
n_trials = 100
32+
n_per_trial = 1000
33+
34+
# --- Generate data ---
35+
# Panel b: one set of 1000 bed-thickness measurements
36+
bed_thickness = rng.normal(loc=mu, scale=sigma, size=n_single)
37+
38+
# Panels c & d: 100 repeated trials of 1000 measurements each
39+
repeated_trials = rng.normal(loc=mu, scale=sigma, size=(n_trials, n_per_trial))
40+
trial_means = repeated_trials.mean(axis=1)
41+
trial_variances = repeated_trials.var(axis=1, ddof=1)
42+
43+
# --- Create figure ---
44+
fig, axes = plt.subplots(2, 2, figsize=(8, 7))
45+
46+
# --- Panel a: Standard normal PDF ---
47+
ax = axes[0, 0]
48+
z = np.linspace(-4, 4, 500)
49+
pdf_z = norm.pdf(z)
50+
51+
# Red dashed line with red square markers (matching original)
52+
z_markers = np.linspace(-3.5, 3.5, 50)
53+
ax.plot(z, pdf_z, 'r--', lw=1.5)
54+
ax.plot(z_markers, norm.pdf(z_markers), 's', color='red',
55+
markersize=4, markeredgecolor='black', markeredgewidth=0.5)
56+
ax.axvline(0.0, color='grey', lw=0.8, alpha=0.5)
57+
ax.set_xlim(-4, 4)
58+
ax.set_ylim(0, 0.42)
59+
ax.set_xlabel('z')
60+
ax.set_ylabel('f(z)')
61+
ax.text(0.05, 0.90, 'a)', transform=ax.transAxes, fontsize=14,
62+
fontweight='bold')
63+
64+
# --- Panel b: Histogram of bed thickness with normal PDF overlay ---
65+
ax = axes[0, 1]
66+
bin_width_b = 0.5
67+
bins_b = np.arange(mu - 5 * sigma, mu + 5 * sigma, bin_width_b)
68+
ax.hist(bed_thickness, bins=bins_b,
69+
histtype='step', color='black', linewidth=0.8)
70+
71+
x_fit = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 400)
72+
ax.plot(x_fit, n_single * bin_width_b * norm.pdf(x_fit, loc=mu, scale=sigma),
73+
'r--', lw=2)
74+
ax.set_xlabel('Bed thickness (cm)')
75+
ax.set_ylabel('Count')
76+
ax.set_xlim(mu - 4 * sigma, mu + 4 * sigma)
77+
ax.text(0.05, 0.90, 'b)', transform=ax.transAxes, fontsize=14,
78+
fontweight='bold')
79+
ax.text(0.65, 0.90, f'N = {n_single}', transform=ax.transAxes,
80+
fontsize=11)
81+
82+
# --- Panel c: Histogram of sample means ---
83+
ax = axes[1, 0]
84+
sigma_mean = sigma / np.sqrt(n_per_trial)
85+
bins_c = np.linspace(trial_means.min() - 0.05, trial_means.max() + 0.05, 25)
86+
bin_width_c = bins_c[1] - bins_c[0]
87+
ax.hist(trial_means, bins=bins_c,
88+
histtype='step', color='black', linewidth=0.8)
89+
90+
x_fit_c = np.linspace(trial_means.min() - 3 * sigma_mean,
91+
trial_means.max() + 3 * sigma_mean, 400)
92+
ax.plot(x_fit_c, n_trials * bin_width_c * norm.pdf(x_fit_c, loc=mu, scale=sigma_mean),
93+
'r--', lw=2)
94+
ax.set_xlabel('Means of repeat trials')
95+
ax.set_ylabel('Count')
96+
ax.text(0.05, 0.90, 'c)', transform=ax.transAxes, fontsize=14,
97+
fontweight='bold')
98+
ax.text(0.60, 0.90, f'N = {n_trials}', transform=ax.transAxes,
99+
fontsize=11)
100+
101+
# --- Panel d: Histogram of sample variances ---
102+
ax = axes[1, 1]
103+
bins_d = np.linspace(trial_variances.min(), trial_variances.max(), 25)
104+
ax.hist(trial_variances, bins=bins_d,
105+
histtype='step', color='black', linewidth=0.8)
106+
ax.set_xlabel('Variance')
107+
ax.set_ylabel('Count')
108+
ax.text(0.05, 0.90, 'd)', transform=ax.transAxes, fontsize=14,
109+
fontweight='bold')
110+
ax.text(0.60, 0.90, f'N = {n_trials}', transform=ax.transAxes,
111+
fontsize=11)
112+
113+
fig.tight_layout()
114+
115+
outpath = Path(__file__).parent.parent / 'book' / 'figures' / 'chapter11' / 'gauss_code.png'
116+
fig.savefig(outpath, dpi=300, bbox_inches='tight', facecolor='white')
117+
print(f'Saved to {outpath}')
118+
plt.close(fig)

0 commit comments

Comments
 (0)