From our last meeting, we can do something like the following:
1. pick a number of alleles
2. pick frequencies for the alleles, for example by randomly sampling probabilities for a multinomial distribution
3. sample alleles accordingly
We can augment this approach by applying a missing data rate. After sampling alleles, we simply "mask" each [genotype|allele] as missing by sampling a uniform [0, 1) and asking if it is <= the missing rate.
We should stick this code in a module that is #[cfg(test)] and its members are mostly pub.
In order to be general, I think it would help if what we return first is simple a Vec/iterable over genotypes. We can then process such a thing to make the values for a VCF record, etc..
Getting boundary conditions right is essential but applying crates like proptest should help us identify corner cases.
From our last meeting, we can do something like the following:
We can augment this approach by applying a missing data rate. After sampling alleles, we simply "mask" each [genotype|allele] as missing by sampling a uniform [0, 1) and asking if it is <= the missing rate.
We should stick this code in a module that is
#[cfg(test)]and its members are mostlypub.In order to be general, I think it would help if what we return first is simple a Vec/iterable over genotypes. We can then process such a thing to make the values for a VCF record, etc..
Getting boundary conditions right is essential but applying crates like
proptestshould help us identify corner cases.