Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ KinformR.Rproj
.pre-commit-config.yaml
.Rproj.user
.git
.github
.github
4 changes: 3 additions & 1 deletion .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,6 @@ jobs:
run: |
conda create -n test_r r-base r-devtools r-testthat
conda activate test_r
Rscript -e "testthat::test_local()"
Rscript -e "testthat::test_local()"
# - name: PreCommit
# uses: pre-commit/action@v3.0.1
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Changed

### Added
- use of precommit spelling, not making a CI check so as to keep cran compatibility.

### Fixed
- linting and spelling errors resolved with pre-commit usage.
Expand Down
18 changes: 9 additions & 9 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
Package: KinformR
Title: Relationship-Informed Pedigree and Variant Scoring
Version: 0.1.0
Authors@R:
Authors@R:
person("Cameron M.", "Nugent", , "cam.nugent@sequencebio.com", role = c("aut", "cre"),
comment = c(ORCID = "0000-0002-1135-2605"))
Author: Cameron M. Nugent
Maintainer: Cameron M. Nugent <cam.nugent@sequencebio.com>
Description:
The KinformR R package is meant to aid in comparative evaluation of families
and candidate variants in rare-variant association studies. The package can be used for
two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic
evaluation, evaluation of relative detection power of pedigrees, can direct recruitment
efforts by showing which unsampled individuals would be the most meaningful additions to a study.
Second, after sequencing and analysis, variants based on association with disease status
Description:
The KinformR R package is meant to aid in comparative evaluation of families
and candidate variants in rare-variant association studies. The package can be used for
two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic
evaluation, evaluation of relative detection power of pedigrees, can direct recruitment
efforts by showing which unsampled individuals would be the most meaningful additions to a study.
Second, after sequencing and analysis, variants based on association with disease status
and familial relationships of individuals, aids in variant prioritization.
License: MIT + file LICENSE
Encoding: UTF-8
Expand All @@ -22,5 +22,5 @@ VignetteBuilder: knitr
Suggests:
devtools,
testthat,
knitr,
knitr,
rmarkdown
5 changes: 1 addition & 4 deletions R/io.R
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ read.relation.mat <- function(fname){
#' status encoded in the indivudal's names
#'
#' Note - ensure the status in the names match your desired encoding!
#' There are individuals with ambigious statues, that you may require to
#' There are individuals with ambiguous statues, that you may require to
#' be encoded in a specific fashion for you current purposes.
#'
#'
Expand Down Expand Up @@ -80,6 +80,3 @@ read.var.table <- function(fname){
"variant" = in.variants)
return(out.df)
}



2 changes: 1 addition & 1 deletion R/pedigree.r
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ score.pedigree <- function(h){
for (i in seq_len(nrow(h))) {
family <- h[i,"Family"]
max.a <- h[i, "max_a"]
#Yeezy yeezy whats good its ya boy
#Yeezy yeezy what's good, its ya boy
max.b <- h[i, "max_b"]
max.c <- h[i, "max_c"]
max.d <- h[i, "max_d"]
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The development version of `KinformR` can be installed directly from GitHub. You
```
#install.packages("devtools")
#install.packages("knitr") #required if build_vignettes = TRUE
#library(devtools)
#library(devtools)
devtools::install_github("SequenceBio/KinformR", build_vignettes = TRUE)
library(KinformR)
```
Expand All @@ -25,14 +25,14 @@ library(KinformR)

The package's vignette contains detailed explanations of the functions and parameters.

For a walk through of the `KinformR` functions for scoring the value of *families* based on penetrance and IBD, see the corresponging vignette file:
For a walk through of the `KinformR` functions for scoring the value of *families* based on penetrance and IBD, see the corresponding vignette file:
`vignettes/KinformR-penetrance_and_ibd.Rmd`
or within R, run:
```
vignette('KinformR-penetrance_and_ibd')
```

For a walk through of the `KinformR` functions for scoring the value of *variants* within families, see the corresponging vignette file:
For a walk through of the `KinformR` functions for scoring the value of *variants* within families, see the corresponding vignette file:
`vignettes/KinformR-variant_scoring.Rmd`

or within R, run:
Expand All @@ -59,7 +59,7 @@ and scoring then performed:

## Scoring Variants

When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. This R package is designed to score rare variants, assigning values based on the disease status of individuals, the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. The package uses a custom formula to assign value to a variant that gives more weight to shared variants common to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interst. Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight of unaffected evidence can be customized.
When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. This R package is designed to score rare variants, assigning values based on the disease status of individuals, the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. The package uses a custom formula to assign value to a variant that gives more weight to shared variants common to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interest. Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight of unaffected evidence can be customized.


### The relationship matrix
Expand Down Expand Up @@ -89,4 +89,4 @@ The two streams of information can then be combined to score a variant based off

```
score.example <- score.fam(rel.mat, ind.df.status)
```
```
2 changes: 1 addition & 1 deletion man/add.fam.scores.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/calc.rv.score.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/ibd.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/penetrance.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/read.pedigree.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/read.var.table.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/score.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/score.fam.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/score.pedigree.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/subset.mat.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions tests/testthat/test_encoding.R
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ test_that("Families are correctly encoded.", {
expect_equal(scores$statvar.cat, expected.scores)

print("theoretical.max high score values for a family")
ther.scores <- score.variant.status(indiv.df, theoretical.max=TRUE)
theory.scores <- score.variant.status(indiv.df, theoretical.max=TRUE)

expected.thermax.scores <- c("A.c","U.c","A.c","A.c","A.c" ,"U.c", "A.c", "U.c")
expect_equal(ther.scores$statvar.cat, expected.thermax.scores)
expect_equal(theory.scores$statvar.cat, expected.thermax.scores)


})
15 changes: 6 additions & 9 deletions vignettes/KinformR-penetrance_and_ibd.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "KinformR - penetrance and idb informed scoring of families"
author: "Cameron M. Nugent"
date: "`r format(Sys.time(), '%d %B, %Y')`"
data: "`r Sys.Date()`"
output: rmarkdown::pdf_document # rmarkdown::html_vignette #
output: rmarkdown::pdf_document # rmarkdown::html_vignette #
pdf_document:
df_print: kable
vignette: >
Expand Down Expand Up @@ -37,12 +37,12 @@ show <- function(df){
The family power calculations depend on a single tab-delimited input file, where each row represents a family. The input file is read in using the `read.pedigree` function.

```{r}
example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv',
example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv',
package = 'KinformR')

example.pedigree.df <- read.pedigree(example.pedigree.file)
```
The input file is expected to have the following 11 columns (with a header).
The input file is expected to have the following 11 columns (with a header).
```{r}
colnames(example.pedigree.df)

Expand All @@ -51,7 +51,7 @@ colnames(example.pedigree.df)
### Simplified summary of pedigrees

For now this file should be be constructed through careful manual inspection of the predigrees. To encode the rows for each family, you should first prune down pedigrees to informative allele transfers. For
the purposes of this tool, we exclude young generations (non-adults, younger than age of onset) and large (more than two sequential generations) trees of exclusively unaffected family members. Additionally all individuals require a binary A/U status, there should be no ambigious individuals. There will be some judgment calls required here.
the purposes of this tool, we exclude young generations (non-adults, younger than age of onset) and large (more than two sequential generations) trees of exclusively unaffected family members. Additionally all individuals require a binary A/U status, there should be no ambiguous individuals. There will be some judgment calls required here.


### Encoding categories of relationships
Expand All @@ -73,11 +73,11 @@ show(example.pedigree.df)
```

All columns with the prefix `max_` are meant to count the total number of each category in the pedigree, while
the columns without this prefix are the number of each category for whom samples have been collected.
the columns without this prefix are the number of each category for whom samples have been collected.

The categories correspond to A, B, and C as defined above.

Category D is represented by two numbers, d and n. n is the number of offspring in a tree of unaffecteds; d is the number of those types of trees across the pedigree. Multiple types of trees are encoded with commas separating the values. For example, the following represents a family with three total trees of unaffecteds. One tree (d=1) has three offspring (n=3); two trees (d=2) each have one offspring (n=1).
Category D is represented by two numbers, d and n. n is the number of offspring in a tree of unaffecteds; d is the number of those types of trees across the pedigree. Multiple types of trees are encoded with commas separating the values. For example, the following represents a family with three total trees of unaffecteds. One tree (d=1) has three offspring (n=3); two trees (d=2) each have one offspring (n=1).

```
d n
Expand Down Expand Up @@ -138,6 +138,3 @@ we only count the parent. (d=1, n=0; equivalently, c=1)
2. You have collected one or more children, but not the parent. In this case,
each of the children contribute a portion of what the parent would have contributed
to our understanding. (d=1, n>0)



44 changes: 21 additions & 23 deletions vignettes/KinformR-variant_scoring.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "KinformR - pedigree-informed rare variant association scoring"
author: "Cameron M. Nugent"
date: "`r format(Sys.time(), '%d %B, %Y')`"
data: "`r Sys.Date()`"
output: rmarkdown::pdf_document #rmarkdown::html_vignette #
output: rmarkdown::pdf_document #rmarkdown::html_vignette #
pdf_document:
df_print: kable
vignette: >
Expand Down Expand Up @@ -43,7 +43,7 @@ To read in the data, one uses the function `read.relation.mat`.
mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR')
rel.mat <- read.relation.mat(mat.name1)
show(rel.mat)
```
```


### The status file
Expand All @@ -60,15 +60,15 @@ tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR')
status.df <- read.indiv(tsv.name1)

show(status.df)
```
```

The disease-genotype scoring can then be encoded using the `score.variant.status` function to produce the status-variant category for all individuals. This creates a df with the new column: `statvar.cat`.

```{r}

full.df.status <- score.variant.status(status.df)
show(full.df.status)
```
```



Expand All @@ -80,7 +80,7 @@ For most real-world applications, you will likely want to score family members i

ex.score.default <- score.fam(rel.mat, full.df.status)
show(ex.score.default)
```
```


By default `score.fam` returns:
Expand All @@ -92,26 +92,26 @@ As previously noted, if an individual is present in the relationship matrix and
The scoring can be changed to summing across all combinations as opposed to the mean by passing the following options. Note using the program in this way will return higher scores for more dense pedigrees.
```{r}

ex.score.sum <- score.fam(rel.mat, full.df.status,
ex.score.sum <- score.fam(rel.mat, full.df.status,
return.sums = TRUE, return.means = FALSE)
show(ex.score.sum)
```
```


To obtain a long form table with the scores for variants expressed relative to each individual, set both `return.sums` and `return.means` to `FALSE`. This output can aid in identifying which individuals are carrying the most weight in a family's score.
```{r}

ex.score.table <- score.fam(rel.mat, full.df.status,
ex.score.table <- score.fam(rel.mat, full.df.status,
return.sums = FALSE, return.means = FALSE)
show(ex.score.table)
```
```

## How scoring works
## How scoring works
### A Minimal example, scoring a variant from perspective of a single individual.

This section is meant to demonstrate how the variant scoring is accomplished on a finer scale. A user does not need to interact with the package on this level of granularity. This section is for explanatory purposes only, demonstrating how the `score.fam` function operated "under the hood".
This section is meant to demonstrate how the variant scoring is accomplished on a finer scale. A user does not need to interact with the package on this level of granularity. This section is for explanatory purposes only, demonstrating how the `score.fam` function operated "under the hood".

The `score.fam` function runs the scoring method once for each affected individual in the status dataframe (or for each individual regardless of status if `affected.only = FALSE`). To do this, for each individual, the program takes corresponding row of the relationship matrix to determine the relations to all other individuals in the pedigree.
The `score.fam` function runs the scoring method once for each affected individual in the status dataframe (or for each individual regardless of status if `affected.only = FALSE`). To do this, for each individual, the program takes corresponding row of the relationship matrix to determine the relations to all other individuals in the pedigree.

For example, the degrees of relationships of all other members of the example family relative to the reference individual `"MS-1234-1001"` are show in the following subset of the matrix:

Expand All @@ -135,25 +135,25 @@ name.stat.dict
```{r}
rel.dict<-build.relation.dict(rel.mat.proband, name.stat.dict)
rel.dict
```
```
In this example, the proband, two first degree relations, and a third degree relations are all affected and share the candidate variant. For the affected correct (`A.c`) category we therefore see the following encoded:

```{r}
rel.dict$A.c
```
```

Since one first degree unaffected relative has the variant, they are categorized as "unaffected incorrect"(`U.i`) and we see:
```{r}
rel.dict$U.i
```
```

Deriving a relatedness-weighted score for the variant from the perspective of the given individual is then performed by `calc.rv.score`

For each degree-encoded relationship, the coefficient of relatedness is used to weight the evidence for or against a variant. The coefficients for different degress of relationship are:
For each degree-encoded relationship, the coefficient of relatedness is used to weight the evidence for or against a variant. The coefficients for different degrees of relationship are:
```{r}
for(i in 0:7){
print(paste0("Degree of relatedness: ", i,
" coefficient of relatedness: ", 1 / (2 ** (i))))
print(paste0("Degree of relatedness: ", i,
" coefficient of relatedness: ", 1 / (2 ** (i))))
}
```

Expand Down Expand Up @@ -200,19 +200,17 @@ The final score for the variant would then be:
```
Giving a final score of 10 for the variant.

This is all accomplished by the function `calc.rv.score`.
This is all accomplished by the function `calc.rv.score`.
```{r}
calc.rv.score(rel.dict)
```
```

The weights of the scoring can be adjusted, for example if we wanted to consider only `affected`-based evidence, we could turn off the unaffected part of the calculation by setting the unaffected weighting to 0. This can be useful for incompletely penetrant variants, where disease status and genotype of unaffected individuals are more likely to have imperfect concordance.

Additionally, families with low numbers of affected individuals sequenced and high number of unaffected individuals may haved inflated variant scores and potentially be misleading, focusing the scoring algorithm on the affected individuals only can overcome this bias.

```{r}
calc.rv.score(rel.dict, unaffected.weight=0)
```
```

The `score.fam` function automatically walks through this process from all specified perspectives in the pedigree and by default returns the average score. The use of the averages and different perspectives is meant to eliminate pedigree-associated bias, such as for instances when a proband is distantly related to all other members in a family (considering the relationships from only the perspective of the proband in this case would give an inflated score for the variant's value).


Loading