stephens999 · jgblanc · Mar 20, 2019 · Apr 5, 2019 · Apr 5, 2019 · Apr 5, 2019
diff --git a/analysis/Cont_Coal_Pic.jpg b/analysis/Cont_Coal_Pic.jpg
diff --git a/analysis/Discrete_Coal_Pic.jpg b/analysis/Discrete_Coal_Pic.jpg
diff --git a/analysis/The Ancestral Process.Rmd b/analysis/The Ancestral Process.Rmd
@@ -0,0 +1,195 @@
+---
+title: "The Ancestral Process"
+author: "Jennifer Blanc"
+date: "3/14/2019"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+rm(list = ls())
+library(reshape2)
+library(ggplot2)
+library(expm)
+library(ggpubr)
+set.seed(12)
+```
+
+### Pre-requisites 
+
+* The Coalescent Process
+* Continuous Time Markov Chains 
+
+### Ancestral Process  
+
+In coalescent theory, the ancestral process is the number of active ancestral lineages at time t. Recall that we can think of the coalescent process as modeling the relationship back in time between a sample of haploid individuals at a single neutral locus, in a population of constant size 2N. In the previous vignette, we showed the time until two individuals coalesce, $T_2$, is an exponential random variable with parameter 1. We then extended this idea to show that for a sample of size n, each coalescent coalescent time, $T_n, T_{n-1}, ... T_2$, is an independent random variable where $T_i \sim exp{{i}\choose{2}}$. So each time a coalescent event happens, we are transitioning from $i$ to $i-1$ lineages after waiting $T_i \sim exp{{i}\choose{2}}$ amount of time.  
+
+Keeping this in mind, we can define the ancestral process as a continuous time Markov chain where the states are the number of active lineages (1,2, .., n) in generation t:  
+
+$$A_n(t) := \text{number of distinct lineages in generation t of a sample of size n at time 0} $$
+
+### Deriving Kingman's Coalescent
+
+We will first re-derive the results from the previous vignette while explicitly considering the ancestral process as a continuous time Markov chain. Mainly, we will show that the coalescent process describes the ancestral process for a sample of fixed size n in the limit as N approaches infinity in the Wright-Fisher model. Remember that in the wright-fisher model we assume a fixed population size (in this case N) and that the j parents of i individuals are sampled with replacement from the N possible parents in the previous generation. This process is described by the expression below where i is the number of active lineages in the current generation and j is the number of ancestral lineages in the previous generation. 
+
+$$G_{i,j} := P(\text{i individuals have j separate parents)}= \frac{N(N-1)...(N-j+1)\delta_k^{(j)}}{N^i} \\ 1 \le j \le i$$
+
+This probability can be broken into three parts:  
+$\bullet \ N(N-1)...(N-j+1)$ is a descending factorial that describes the number of ways that you can choose j parents  
+$\bullet \ \delta_k^{(j)}$ is a Stirling number of the second kind, it describes the number of ways to assign k individuals to their j parents  
+$\bullet \ N^i$ is the total number of ways of assigning the number of i individuals to their parents (there are N possible parents) 
+
+Taken together, $G_{i,j}$ describes the probability of i individuals having j separate ancestors (remember that $j \le i$ because individuals cannot have more than one parent). Since our CTMC is defined as the number of lineages at a given time, $G_{ij}$ describes the transition probabilities for our CTMC:  
+
+$$P(A_n(t+1) = i|A_n(t) = j) = G_{ij}$$
+
+When we look closer at this transition probability, we can see that Kingman's coalescent does not apply when we have a fixed population size N. Right now, i individuals can have anywhere from i to 1 distinct parents. However, we know that Kingman's coalescent only allows only one coalescent event per generation, in other words, i individuals can have only i or i - 1 parents. We will now show that for large N, we recover Kingman's coalescent.    
+
+First let's consider the case where i individuals have i - 1 parents.  
+
+$$G_{i,i-1} = \delta_i^{(i-1)}\frac{N(N-1)...N(N-k+1)}{N^i} =  {{i}\choose{2}}\frac{1}{N}  + O(N^{-2})$$
+
+
+We can arrive at the above statement by first recalling that $\delta_k^{(k-1)} = {{k}\choose{2}}$ and recognizing that the other terms are proportional to $1/N^2$ and will decay quickly when N is is very large. 
+
+Next, we will consider the other two possibilities. First that i individuals have fewer than i - 1 parents and the case where they have exactly i parents:  
+
+$$G_{i,j} = \delta_i^{(j)}\frac{N(N-1)...N(N-j+1)}{N^i}= O(N^{-2}) \ \text{for} \ j <i-1$$
+
+$$G_{i,i} =  \delta_i^{(i)}N^{-i} N(N-1)..(N-i+1) = 1-{{i}\choose{2}}\frac{1}{N} + O(N^{-2})$$  
+
+Now we can show formally, that in the limit as N tends to infinity, the ancestral process converges to the continuous-time coalescent process. In order to do this we are again going to re-scale time so that one unit of t corresponds to N generations.   
+
+$$P(T_i^{(N)} > t) = G_{i,i}^{[Nt]} \rightarrow e^{-{{i}\choose{2}}t} \ \text{as} \ N \rightarrow \infty$$
+
+Here the notation $[Nt]$ means only the integer part of Nt. So while t could be any value greater than zero, the probability of $G_{i,i}^{[Nt]}$ only makes sense for whole generations. We have shown that the probability of a coalescent event, $T_i$ happens after time t happens with probability $e^{-{{i}\choose{2}}}$. This is that same coalescent time that we arrived at in the last vignette.   
+
+### Properties of the Ancestral Process  
+
+In the last section, we showed that as N tends to infinity, the ancestral process converges to the coalescent process. We can use the ancestral process transition probabilities, which we defined above to write the general transition matrix for this process:  
+
+$$G_{i,j}^{Nt} = (I + N^{-1}Q + O(N^{-2}))^{Nt} \rightarrow e^{Qt}, \ \text{as} \ N \rightarrow \infty$$
+
+Thus, the ancestral process is approximated by the Markov chain $A_n(t)$ whose behave is determined by the rate matrix Q and $A_n(0) = n$. Here Q is is an n x n matrix with non-zero entries given by:  
+
+$$q_{i,i} = -{{i}\choose{2}}, q_{i, i-1} = {{i}\choose{2}}, i = n, n-1, ..., 2 $$
+
+$$Q =
+  \left[ {\begin{array}{ccccc}
+   0 &  &  & \\
+   {{i}\choose{2}} & -{{i}\choose{2}} &  & \\
+   0 & {{i}\choose{2}} & -{{i}\choose{2}} &  \\
+   0 & 0 & {{i}\choose{2}} & -{{i}\choose{2}} \\
+   \vdots & & & & \ddots \\
+  \end{array} } \right]$$
+
+Intuitively, we can think about $q_{i,i-1}$ as the rate individuals coalesce, going from i to i-1 lineages. Therefore, $q_{i,i}$ is the rate that process is leaving the state corresponding to i=1. Since these are the only two options, we know the $A_n(t)$ is pure death process that starts at $A_n(0)$ and decreases by only one jump at a time. We can use Q to calculate the distribution of $A_n(t)$.  
+
+One way we can do this is write $Q = RDL$, where D is a diagonal matrix of eigenvalues $\lambda_i = - {{i}\choose{2}}$ and L and R are matrices of right and left eigenvalues of Q, normalized so that RL = LR = I. Using this method and some calculation, its possible to solve for the the distribution of $A_n(t)$:    
+
+$$G_{n,j} = P(A_n(t) = j) = \sum\limits_{i=j}^n e^{-i(i-1)t/2}\frac{(2i-1)(-1)^{i-j}j_{(i-1)}n_{[i]}}{j!(i-j)!n_{(i)}}  \\ s_{(n)} = s(s+1)...(s + n-1) \\ s_{[n]} = s(s-1) ... (s -n +1) \\ s_{(0)} = s_{[0]} = 1$$  
+
+The expectation of this distribution: 
+
+$$E[A_n(t)] = \sum\limits_{k=1}^n e^{-i(i-1)t/2}\frac{(2i-1)n_{[i]}}{n_{(i)}}$$
+
+Now we have a both a distribution and its expectation for the number of ancestral lineages for a sample of size n and different times, t. Lets plots the mean number of ancestors over time for different values of n:  
+
+
+```{r}
+up <- function(a,n) { # Function for s(n) calculations
+  x <- seq(a, (a+n-1))
+  return(prod(x))
+}
+
+down <- function(a,n) { #Function for s[n] calculations
+  x <- seq(a, (a-n+1))
+  return(prod(x))
+}
+
+
+mean_ancestral_lineages <- function(t, n) { 
+  tmp <- rep(NA, n)
+  for (i in 1:n) { # Evaluate espression for every value of i
+    tmp[i] <- exp(-i * (i-1) * (t /2)) * ((((2*i)-1) * down(n,i)) / up(n,i))
+  }
+  return(sum(tmp)) # Return sum over values of i 
+}
+```
+
+```{r}
+# Calculate the mean number of ancestral lineages for n = 5, 10, 50 
+x <- seq(0, 2, 0.01)
+df <- as.data.frame(cbind(x,(sapply(x,FUN = mean_ancestral_lineages, n = 5)), (sapply(x,FUN = mean_ancestral_lineages, n = 10)), (sapply(x,FUN = mean_ancestral_lineages, n = 50))))
+dat <- melt(df, id = "x")
+
+# Plot the data 
+ggplot(data = dat, aes(x = x, y = value, color = variable)) + geom_line() + theme_bw() + xlab("time") + ylab("Expecent Number Ancestral Lineages") + scale_color_manual(name = "n", labels = c("10", "50", "100"), values = c("red4", "navy", "darkgreen")) + geom_hline(yintercept = 1, linetype = 2)
+```
+
+
+Again, we can see that most coalescent events happen quickly as the number of active lineages quickly decreases and that most of the time is take up by the last coalescent event.  
+
+
+### Example  
+
+Let's do an example for a sample size of n = 5. Recalling that the ancestral process is a pure death CTMC with death rate of ${{i}\choose{2}}$, we can write down the Q matrix for our example:  
+
+$$Q =
+  \left[ {\begin{array}{ccccc}
+   0 &   \\
+   1 & -1 &  \\
+   0 & 2 & -2 \\
+   0 & 0 & 6 & -6 \\
+   0 & 0 & 0 & 10 & -10
+  \end{array} } \right]$$  
+
+
+Now, we can use the fact the the transition matrix, $G_{i,j}(t)$, is equal to $e^{Qt}$ when $N \rightarrow \infty$ to calculate the transition matrix for different time points.  
+
+```{r}
+# Enter Q matrix 
+Q <- matrix(c(0,0,0,0,0,1,-1,0,0,0,0,2,-2,0,0,0,0,6,-6,0,0,0,0,10,-10), nrow = 5, byrow = T)
+
+# Calculate transition matrix at different time points 
+P0.1 <- expm(Q * 0.1)
+P0.5 <- expm(Q * 0.5)
+P1 <- expm(Q * 1)
+P10 <- expm(Q * 10)
+```
+
+
+Let's visualize the transition matrix with heat maps for each time point:  
+```{r, fig.width=12}
+# Prepare data for plotting 
+dat_plot1 <- cbind(melt(P0.1, varnames = names(dimnames(P0.1)), na.rm = FALSE, as.is = FALSE, value.name = "value"), 0.1)
+dat_plot2 <- cbind(melt(P0.5, varnames = names(dimnames(P0.5)), na.rm = FALSE, as.is = FALSE, value.name = "value"), 0.5)
+dat_plot3 <- cbind(melt(P1, varnames = names(dimnames(P1)), na.rm = FALSE, as.is = FALSE, value.name = "value"), 1)
+dat_plot4 <- cbind(melt(P10, varnames = names(dimnames(P10)), na.rm = FALSE, as.is = FALSE, value.name = "value"), 10)
+
+# Make a heat map for each tranisition matrix
+p1 <- ggplot(data = dat_plot1, aes(x=Var2, y=Var1, fill=value)) + 
+  geom_tile() + theme_classic() + scale_fill_gradient2(low = "white", high = "navy") + geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 4) + scale_x_reverse() + scale_y_reverse() + xlab("j") + ylab("i")
+p2 <- ggplot(data = dat_plot2, aes(x=Var2, y=Var1, fill=value)) + 
+  geom_tile() + theme_classic() + scale_fill_gradient2(low = "white", high = "navy") + geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 4) + scale_x_reverse() + scale_y_reverse() + xlab("j") + ylab("i")
+p3 <- ggplot(data = dat_plot3, aes(x=Var2, y=Var1, fill=value)) + 
+  geom_tile() + theme_classic() + scale_fill_gradient2(low = "white", high = "navy") + geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 4) + scale_x_reverse() + scale_y_reverse() + xlab("j") + ylab("i")
+p4 <- ggplot(data = dat_plot4, aes(x=Var2, y=Var1, fill=value)) + 
+  geom_tile() + theme_classic() + scale_fill_gradient2(low = "white", high = "navy") + geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 4) + scale_x_reverse() + scale_y_reverse() + xlab("j") + ylab("i")
+
+# View plots 
+ggarrange(p1,p2,p3,p4, common.legend = T, ncol = 4, labels = c("0.1", "0.5", "1", "10"), hjust = -.3)
+```
+
+In each transition matrix, each cell in the lower diagonal matrix represents the probability of transitioning from i lineages to j lineages for the given time period. If we focus on the fifth row in each transition matrix, we can see that for t = 0.1 the highest probability is that one coalescent event will happen and the process will go from 5 lineages to 4. If we look at t = 0.5 and t = 1, the highest probability is to go from i = 5 to j = 3 and j = 2, respectively. Unsurprisingly, given that the coalescent is a pure death process, by t = 10 all of 5 of our samples will have coalesced and there will be only 1 lineage left. 
+
+```{r}
+sessionInfo()
+```
+
+
+
+
+
+
+