Skip to content

Update annotation pipeline documentation #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: mp-term
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
59 changes: 51 additions & 8 deletions annotation_pipeline/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,44 @@
# IMPC Annotation Pipeline
The IMPC annotation pipeline assigns Mammalian Phenotype (MP) terms to significant genetic effects based on a p-value threshold of 0.0001. The goal is to associate phenotypic observations with the corresponding genetic modifications.
At the IMPC, genetic effects are identified using three statistical analysis platforms:
1. Linear Mixed Model framework or MM
2. Fisher's Exact Test framework or FE
3. Reference Range Plus Test framework or RR

## Continuous data
Continuous data are typically analysed using a linear mixed model framework. These continuous measurements are particularly informative because the direction of change can be determined through the effect size.

However, due to the complexity of the data, not all continuous variables can be analysed using this framework. In such cases, the IMPC often employs the Reference Range Plus (RR) method. Control data are first discretised into three categories: low, normal, and high. Mutant data points are then classified into one of these reference categories. Finally, a Fisher's Exact Test is applied to determine whether there is a statistically significant deviation from the normal category.

## Categorical data
Categorical data in the IMPC encompasses a range of qualitative measurements and is analysed using Fisher’s Exact Test, as implemented in the R package OpenStats.

# How IMPC Annotation Pipeline Works
The `annotationChooser` function processes statistical analysis results called statpacket. It determines calls based on significance levels. These calls are then mapped to Mammalian Phenotype (MP) ontology terms using a provided `mp_chooser_file`. Finally, it updates the input statpacket's JSON component with the identified MP terms. If no relevant annotation is found or the statistical result is not significant, it returns the original statpacket with no MP terms added.

*Note:* Several MP terms can be assign to one statpacket, because statpacket can have several "Active sex levels".

The annotation pipeline requires a reference table that summarises the available MP terms for a given IMPC parameter. This reference can be retrieved from [IMPReSS](https://www.mousephenotype.org/impress/index).
The ETL pipeline handles this by generating the `mp_chooser.json` file.

- We will denote p-value calls made from:
- ♂ Male only data
- ♀ Female only data
- ⚤ All data combined

- We will denote `mp_chooser.json` terms as they are:
- MALE
- FEMALE
- UNSPECIFIED

In the `mp_chooser.json` file each MP term can have different levels:
- Ontology term levels: ABNORMAL, INCREASED, DECREASED.
- Sex levels: FEMALE, MALE, UNSPECIFIED.

*Note:* Sex-specific MP terms, e.g. those with FEMALE and MALE sex levels, and UNSPECIFIED, are never encountered together for the same parameter. In other words, it is either a FEMALE/MALE term or an UNSPECIFIED term available in the `mp_chooser.json` file.

MP term assignment logic can be seen below:

```mermaid
%%{
init: {
Expand All @@ -11,11 +52,13 @@
}
}%%
graph TD;
A[Repeat separately for Overall, Females and Males] --> B{Is genotype effect significant?}
B -- No --> C[Do not assign MP term]
B -- Yes --> D{Is the direction of genotype effect specified?}
D -- No --> E[Select Abnormal term]
D -- Yes --> F{Is there any conflict of direction? <br> example: Low.Decrease/High.Increase <br> or Male.Decrease/Female.Increase}
F -- No --> G[Choose an MP term corresponding to the direction of change]
F -- Yes --> H[Select Abnormal term]
```
Start{Which method is used for the analysis?} --> |MM| MM[Prioritise INCREASED/DECREASED MP term, otherwise use ABNORMAL] --> A
Start --> |FE or RR| FE_RR[Only use ABNORMAL MP term] --> A

A{"Is sex-specific MP term available in the mp_chooser file?"}
A --> |Yes| A2[Use FEMALE/MALE term] --> B{Is ♀ or ♂ call observed?}
A --> |No| A3[Use UNSPECIFIED term] --> B

B --> |Yes| B1[Drop ⚤ call and report MP term for ♀ or ♂ call]
B --> |No| B2[Report ⚤ call]
```