Skip to content

Commit 36cd878

Browse files
Initial jupyterbook version
Move old content to subdir & jupyter book Convert msprime content into jupyter book form
1 parent 46dba61 commit 36cd878

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+552
-6626
lines changed

Makefile

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2+
# Simple makefile for dev.
3+
4+
all:
5+
# Catch warnings for development
6+
jupyter-book build -W -n --keep-going .
7+
8+
clean:
9+
rm -fR _build

_config.yml

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Book settings
2+
# Learn more at https://jupyterbook.org/customize/config.html
3+
4+
title: Tskit Tutorial
5+
author: Tskit Developers
6+
# logo: logo.png
7+
8+
# Force re-execution of notebooks on each build.
9+
# See https://jupyterbook.org/content/execute.html
10+
execute:
11+
execute_notebooks: cache
12+
13+
# Information about where the book exists on the web
14+
repository:
15+
url: https://github.com/tsk-dev/tutorials
16+
branch: main
17+
18+
# Add GitHub buttons to your book
19+
# See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
20+
html:
21+
use_issues_button: true
22+
use_repository_button: true
23+
24+
only_build_toc_files: true
25+
26+
parse:
27+
myst_extended_syntax: true

_toc.yml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Table of content
2+
# Learn more at https://jupyterbook.org/customize/toc.html
3+
#
4+
- file: intro
5+
- file: msprime
6+
sections:
7+
- file: bottlenecks
8+
- file: introgression
9+
- file: forward_sims
10+
# TODO port/add forward simulation contents
11+
12+
- file: development

bottlenecks.md

+195
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.12
7+
jupytext_version: 1.9.1
8+
kernelspec:
9+
display_name: Python 3
10+
language: python
11+
name: python3
12+
---
13+
14+
# Bottlenecks
15+
16+
**Konrad Lohse and Jerome Kelleher**
17+
18+
The site frequency spectrum (SFS) summarises variants by their frequency in a sample and is a fundamental summary of sequence variation that forms the basis of many modern inference approaches (e.g. sweepfinder, DFE-alpha, dadi). First, the SFS is a lossless summary of unlinked variants, so any summary of sequence variation that ignores linkage (e.g. pairwise measures of diversity and divergence, F_st, Tajima's D and D) are summaries of the SFS.
19+
20+
The SFS is convenient analytically, because it only depends on the mean length and frequency of genealogical branches. For many demographic models of interest the means can be derived analytically either using coalescent theory (cite Huang, TPB) or diffusion equations (cite dadi). A number of composite likelihood approaches have been developed based on either analytic results for the SFS (cite dadi Excoffier, Jaada). However, analytic expectations for the SFS break down for large samples and/or complex demographic models.
21+
22+
In the following section we show how the SFS can be approximated using coalescence simulations and compare such approximations to analytic results. We will assume a simple toy history of a single panmictic population that is affected by an instaneous bottleneck at time T with strenght s (cite Galtier et al). The effect of this bottleneck is to induce sudden burst of coalescence, which simultaneous multiple merges. We measure bottleneck strength as the probability that a pair of lineages coalesces during the bottleneck (we could could of course convert s into on (imaginary) time period that would give the same probability of coalescence $s=1-e^{-T}$).
23+
24+
We assume a sample of size 10 and use msprime to simulate 10,000 replicate genealogies. For each genealogy the function bottSFS records the unfolded SFS as the mean length of branches with n leafnodes (normalized by the total length of the genealogy) by iterating through all nodes in the tree.sequence. Note that we are simulating genealogies only, i.e. we do not need to simulate mutations.
25+
26+
We use a for loop to record the SFS for a range of bottleneck strengths parameters in a dictionary:
27+
28+
```{code-cell} ipython3
29+
%matplotlib inline
30+
%config InlineBackend.figure_format = 'svg'
31+
import msprime
32+
import numpy as np
33+
import seaborn as sns
34+
import matplotlib.pyplot as plt
35+
```
36+
37+
```{code-cell} ipython3
38+
def run_bott_sims(num_rep, num_samp, T, s):
39+
demographic_events = [
40+
msprime.InstantaneousBottleneck(time=T, strength=s, population=0)]
41+
reps = msprime.simulate(
42+
sample_size=num_samp, Ne=Ne, num_replicates=num_rep,
43+
demographic_events=demographic_events)
44+
return reps
45+
46+
def approx_SFS(reps):
47+
B = np.zeros((num_rep, num_samp))
48+
for rep_index, ts in enumerate(reps):
49+
assert ts.num_trees == 1
50+
tree = ts.first()
51+
for u in tree.nodes():
52+
nleaves = tree.num_samples(u)
53+
if tree.parent(u) != msprime.NULL_NODE:
54+
B[rep_index, nleaves] += tree.branch_length(u)
55+
data = np.mean(B, axis=0)
56+
data /= np.sum(data)
57+
return data
58+
59+
num_rep = 1000
60+
num_samp = 10
61+
Ne = 1
62+
T = 0.5
63+
taulist= np.array([0,1,2,3])
64+
datalist = {}
65+
for tau in taulist:
66+
datalist[tau]= approx_SFS(run_bott_sims(num_rep, num_samp, T, tau))
67+
68+
# My guess/assumption is that currently bottleneck strength in msprime is scaled as an (imaginary) time tau (in units of 4N_e) generations.
69+
# It makes a lot more sense to express the bottleneck strength as the probability of pairwise coalescence
70+
# during the bottelenck s=1-np.exp(-tau/2)
71+
```
72+
73+
With increasing bottleneck strength the SFS becomes increasingly skewed (the leftmost blue bars show the SFS for a population of constant size). However, bottlenecks have a complex effect on the different frequency classes of the SFS: while the relative frequency of singletons increases, other frequency classes (e.g. doubletons) have a non-monotonic relationship with bottleneck strength:
74+
75+
```{code-cell} ipython3
76+
bar_width=0.2
77+
index = np.arange(1,num_samp)
78+
j = 0
79+
for s, B in datalist.items():
80+
plt.bar(index + j * bar_width, B[1:], bar_width, label=str(s))
81+
j += 1
82+
```
83+
84+
## Comparison with analytic predictions
85+
86+
+++
87+
88+
How does the approximate SFS compare to analytic expectations? For a population of constant size, the SFS is simply given by Watterson's correction factor, that is the total length branches with i leafnodes is given is 1/i. Reassuringly, in the limit of s=0 (no bottleneck), our SFS approximation based on simulated genealogies agrees with this prediction:
89+
90+
```{code-cell} ipython3
91+
expsfs=[(1/i) for i in range(1,10)]
92+
expsfs/=np.sum(expsfs)
93+
94+
fig, ax = plt.subplots()
95+
index = np.arange(1,10)
96+
bar_width = 0.4
97+
opacity = 0.9
98+
99+
simsfs = ax.bar(index, datalist[0][1:], bar_width, alpha=opacity, label='sim')
100+
expextsfs = ax.bar(index+ bar_width, expsfs, bar_width, alpha=opacity, label='exp')
101+
102+
fig.tight_layout()
103+
plt.show()
104+
```
105+
106+
The analytic prediction for the SFS under a bottleneck model is more complicated (Bunnefeld et al. 2015, Appendix). For a sample of n=4 lineages the SFS is:
107+
108+
```{code-cell} ipython3
109+
#We are assuming a bottleneck of strength tau = 4 N_e generations
110+
#and a bottleneck time of T=1 (2 in units of 4 Ne)
111+
#I am pretty sure the analytic prediction for the SFS is correct: the limit mfor s->0 is correct and
112+
#itr matches the automatically generated expression in the Mathematica .nb...
113+
114+
T=2
115+
slist=[1-np.exp(-tau) for tau in taulist]
116+
117+
for s in slist:
118+
p=s*(-6 + 15*s - 20 * np.power(s,2) + 15 * np.power(s,3) - 6 * np.power(s,4) + np.power(s,5))
119+
expsfsBottlN= [2/15*(np.exp(-6*T)*(15 *np.exp(6*T) - 9 *np.exp(5*T)*s -
120+
5*np.exp(3*T)*s*(3 - 3*s + np.power(s,2)) + p)),
121+
1/5*np.exp(-6*T)*(5*np.exp(6*T) - 6*np.exp(5*T)*s - p),
122+
2/15*np.exp(-6*T)*(5*np.exp(6*T) - 9*np.exp(5*T)*s + 5*np.exp(3*T)*s*(3-3*s + np.power(s,2)) + p)]
123+
124+
expsfsBottlN/=np.sum(expsfsBottlN)
125+
print(expsfsBottlN)
126+
```
127+
128+
The fit between the SFS simulated with msprime and the analytic prediction is noty convincing (given the 100,000 replicates):
129+
130+
```{code-cell} ipython3
131+
num_samp = 4
132+
num_rep = 1000
133+
data4 = {}
134+
T = 1
135+
for tau in taulist:
136+
data4[tau]= approx_SFS(run_bott_sims(num_rep, num_samp, T, tau/2))
137+
```
138+
139+
```{code-cell} ipython3
140+
fig, ax = plt.subplots()
141+
index = np.arange(1,4)
142+
bar_width = 0.4
143+
print(data4[0][1:])
144+
print(data4[1][1:])
145+
print(data4[2][1:])
146+
print(data4[3][1:])
147+
148+
simsfs = ax.bar(index, data4[3][1:], bar_width, alpha=opacity, label='sim')
149+
expextsfs = ax.bar(index+ bar_width, expsfsBottlN, bar_width, label='exp')
150+
151+
fig.tight_layout()
152+
plt.show()
153+
```
154+
155+
## The distribution of nton branches
156+
157+
+++
158+
159+
Given that the SFS only depends on mean branch lengths, it is interesting to inspect the probability density distribution of the underlying genealogical branches. Given the discrete event, the pfd of nton branches are discontinuous.
160+
161+
```{code-cell} ipython3
162+
s=1
163+
demographic_events = [msprime.InstantaneousBottleneck(time=T, strength=s, population=0)]
164+
reps = msprime.simulate(
165+
sample_size=num_samp, Ne=Ne, num_replicates=num_rep,
166+
demographic_events=demographic_events)
167+
B = np.zeros((num_rep, num_samp))
168+
for rep_index, ts in enumerate(reps):
169+
tree = next(ts.trees())
170+
for u in tree.nodes():
171+
nleaves = tree.num_samples(u)
172+
if tree.parent(u) != msprime.NULL_NODE:
173+
B[rep_index, nleaves]+=tree.branch_length(u)
174+
```
175+
176+
```{code-cell} ipython3
177+
Btrans=np.array(B).T.tolist()
178+
sns.distplot(Btrans[1],axlabel="f(t)")
179+
sns.distplot(Btrans[2],axlabel="f(t)")
180+
sns.distplot(Btrans[3],axlabel="f(t)");
181+
```
182+
183+
### To Do
184+
185+
+++
186+
187+
1) Fix the scaling of the strength in msprime
188+
2) Fix the pdf plot above:
189+
- Label axes on the pdf plot above: y-> f(t), x -> t
190+
- Restrict X range to 15
191+
3) Fix the x axes on all the barplots so that these correspond to classes in the SFS
192+
193+
```{code-cell} ipython3
194+
195+
```

content.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Content in Jupyter Book
2+
=======================
3+
4+
There are many ways to write content in Jupyter Book. This short section
5+
covers a few tips for how to do so.

development.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Development
2+
3+
:::{note}
4+
Add content for helping developers to add more tutorials.
5+
:::

intro.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Introduction
2+
3+
This is the home of tskit tutorials.
4+
5+
6+
:::{warning}
7+
This site is under heavy development.
8+
:::

0 commit comments

Comments
 (0)