Skip to content

Commit 8ecaf46

Browse files
committed
Fill out some more phylo concepts
And adjust minor wording in other tutes so that they can be linked sensibly
1 parent fec91c3 commit 8ecaf46

File tree

3 files changed

+110
-43
lines changed

3 files changed

+110
-43
lines changed

analysing_trees.md

+2
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,8 @@ to remember that the algorithm is minimising the number of state transitions;
420420
this may not correspond always to what we might consider the most parsimonious
421421
explanation.
422422

423+
424+
(sec_trees_numba)=
423425
## Fast tree-based algorithms using numba
424426

425427
:::{todo}

phylogen.md

+100-37
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ Calculations on these huge trees can be very efficient:
4949
traversed_nodes = 0
5050
for u in big_tree.nodes(order="postorder"):
5151
traversed_nodes += 1
52-
print("Postorder traversal through {traversed_nodes} took:")
52+
print(f"Postorder traversal through {traversed_nodes} took:")
5353
```
5454

5555
```{code-cell}
@@ -73,30 +73,30 @@ import tsconvert # used for reading tree sequences from different formats
7373
7474
# example code reading in a large file, timed
7575
76-
# example of reading from a string (smaller tree) into a tree sequence
76+
# Or read smaller trees from strings (here we create a tree spanning 1000 genomic units)
7777
# Todo: add sequence length: https://github.com/tskit-dev/tsconvert/issues/40
78-
ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);")
78+
ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=100)
7979
```
8080

8181
The "succinct tree sequence" format used by `tskit` can also store mutations
8282
(and optionally a reference genome) along with the tree(s). This results in a
8383
single unified representation of large genomic datasets, storing trees,
8484
sequence data and metadata in a single efficient structure. Examples are given
85-
in the {ref}`end of this page<sec_phylogen_unified_structure>`.
85+
in the section below entitled {ref}`sec_phylogen_unified_structure`.
8686

8787
As the name suggests, a tree sequence can also store and analyse a sequence of
8888
trees along a genome (i.e. a "phylogenetic network"). This is necessary to
8989
account for recombination between lineages, and may be important even when looking at
9090
species-level phylogenies due to the effects of hybridization and incomplete lineage
91-
sorting. An overview, and links to further details are given
92-
{ref}`sec_phylogen_multiple_trees`
91+
sorting. An overview, and links to further details are given at the
92+
{ref}`end of this page <sec_phylogen_multiple_trees>`.
9393

9494
## Hints for phylogeneticists
9595

9696
Unlike other phylogenetic libraries, `tskit` is designed to efficiently store not just
97-
single trees, but entire , and focusses primarily on dealing with
98-
genetic sequence data. This means that the library has some features not found in
99-
more standard phylogenetic libraries. Here we focus on the {ref}`sec_python_api`,
97+
single trees, but sequences of correlated trees along a genome. This means that the
98+
library has some features not found in more standard phylogenetic libraries.
99+
Here we focus on the {ref}`sec_python_api`,
100100
introducing eight `tskit` concepts that may be useful to those with a background in
101101
phylogenetics (each is linked to a separate section below):
102102

@@ -208,8 +208,8 @@ important in tree sequences that contain more than one tree.
208208
The {class}`Tree` object has {ref}`methods<sec_python_api_trees>` to perform basic operations
209209
on a tree such as traversing the nodes, identifying parents, children, and common
210210
ancestors, etc. {ref}`Several methods<sec_python_api_trees_node_measures_array>`
211-
also return numpy arrays for use in efficient algorithms
212-
(e.g. using numba - link to tutorial).
211+
also return numpy arrays for use in
212+
{ref}`efficient algorithms using numba<sec_trees_numba>`
213213

214214
```{code-cell}
215215
for n_id in tree.nodes(order="postorder"):
@@ -284,28 +284,34 @@ tree.tree_sequence.nodes_time
284284
(sec_phylogen_node_time)=
285285
### Nodes must have times
286286

287-
Perhaps the biggest difference between `tskit` and other phylogenetic libraries is that
288-
each node *must* have a time associated with it. Node times can be arbitrary (marked
289-
by setting {attr}`TreeSequence.time_units` to `"uncalibrated"`), but they must
290-
be present. This means that `tskit` trees are always directional (i.e. they are
291-
"rooted").
287+
Perhaps the most noticable different between a `tskit` tree and the encoding of trees
288+
in other phylogenetic libraries is that `tskit` does not explicitly store branch lengths.
289+
Instead, each node has a *time* associated with it. Branch lengths can therefore be
290+
found by calculating the difference between the time of a node and the time of its
291+
parent node.
292292

293-
The primary reason for this strict requirement is to ensure temporal consistency across
294-
the trees in the tree sequence. In particular it ensures that
295-
a node cannot be a parent of a node in one tree, and a child of the same node in another
296-
tree in the tree sequence, a logically impossibility.
293+
Since nodes *must* have a time, `tskit` trees aways have these (implicit) branch
294+
lengths. To represent a tree ("cladogram") in which the branch lengths are not
295+
meaningful, the {attr}`TreeSequence.time_units` of a tree sequence can be
296+
specified as `"uncalibrated"` (see below)
297297

298-
The units in which time is measured are stored in the {attr}`TreeSequence.time_units`
299-
attribute: if not known, this defaults to "unknown":
298+
Another implication of storing node times rather than branch lengths is that `tskit`
299+
trees are always directional (i.e. they are "rooted"). The reason that `tskit` stores
300+
times of nodes (rather than e.g. genetic distances between them) is to ensure temporal
301+
consistency. In particular it makes it impossible for a node to be an ancestor of a
302+
node in one tree, and a descendant of the same node in another tree in the tree sequence.
303+
This is of critical importance when extending the concept of genetic ancestry to
304+
{ref}`sec_phylogen_multiple_trees` along a genome.
305+
306+
The {attr}`TreeSequence.time_units` attribute stores the units in which time is
307+
measured: if not known, this defaults to "unknown":
300308

301309
```{code-cell}
302310
print("Time units are", tree.tree_sequence.time_units)
303311
tree.draw_svg(y_axis=True)
304312
```
305313

306-
The fact that nodes have times also means that `tskit` does not explictly store
307-
branch lengths: a branch length is simply the difference between the time of a
308-
node and the time of its parent node. For convenience, however, `tskit` provides a
314+
Although branch lengths are not stored explicitly, for convenience `tskit` provides a
309315
{meth}`Tree.branch_length` method:
310316

311317
```{code-cell}
@@ -412,30 +418,87 @@ has not been simulated.
412418
## Phylogenetic methods
413419

414420
:::{todo}
415-
Demo some phylo methods. e.g.
421+
Demo some phylogenetic methods. e.g.
416422
1. Total branch length - demo quick calculation across multiple trees - incremental algorithm used extensively in population genetics. ("bringing tree thinking to popgen").
417423
2. KC distance
418424
3. Balance metrics
419-
4. Topology rankings
425+
4. Topology rankings (see https://github.com/tskit-dev/tutorials/issues/93)
420426
:::
421427

422428

423429
(sec_phylogen_unified_structure)=
424430
## Storing and accessing genetic data
425431

426-
:::{todo}
427-
Add content and link to {ref}`sec_what_is_dna_data`.
428-
:::
432+
`Tskit` has been designed to capture both evolutionary tree topologies and the genetic
433+
sequences that evolve along the branches of these trees. This is achieved by defining
434+
{ref}`sec_terminology_mutations_and_sites` which are associated with specific positions
435+
along the genome.
436+
437+
```{code-cell}
438+
import msprime # The `msprime` package can throw mutations onto a tree sequence
439+
mutated_ts = msprime.sim_mutations(ts, rate=3e-3, random_seed=321)
440+
mutated_tree = mutated_ts.first()
441+
print("Variable sites with the following IDs generated")
442+
for site in mutated_tree.sites():
443+
print(
444+
f"Site ID {site.id} @ genomic position {site.position:g}:",
445+
f"{site.ancestral_state} -> {site.mutations[0].derived_state}"
446+
)
447+
mutated_tree.draw_svg()
448+
```
449+
450+
Mutations occur above nodes in a tree, with all the descendant
451+
nodes inheriting that specific mutation (unless replaced by a subsequent
452+
mutation at the same site). This allows genetic variation to be
453+
{ref}`efficiently represented<sec_what_is_dna_data>` using the tree topology.
454+
To obtain the genetic variation at each site across the entire genome, you can use the
455+
{meth}`TreeSequence.sites` method, or (less efficiently), you can use
456+
{meth}`TreeSequence.alignments` to output the
457+
entire sequences for each sample node:
458+
459+
```{code-cell}
460+
for node_id, alignment in zip(
461+
mutated_ts.samples(),
462+
mutated_ts.alignments(missing_data_character="."),
463+
):
464+
print(f"Node {node_id}: {alignment}")
465+
```
429466

430467

431468
(sec_phylogen_multiple_trees)=
432469
## Multiple trees
433470

434-
:::{todo}
435-
Add content showing how `tskit` allows easy extension of phylogenetics to multiple trees
436-
along a genome. We could e.g. edit the example tree (assuming we can extend it to
437-
e.g. 1000 bp sequence length, see https://github.com/tskit-dev/tsconvert/issues/40)
438-
by shortening instead of removing one of the edges, which
439-
should create two trees, one with 2 roots, and one with one.
440-
:::
471+
Where `tskit` really shines is when the ancestry of your dataset cannot be adequately
472+
represented by a single tree. This is a pervasive issue in genomes (even from different
473+
species) that have undergone recombination in the past. The resulting series of
474+
"local trees" along a genome are highly correlated (see {ref}`sec_concepts`).
475+
476+
Instead of storing each tree along a genome separately, `tskit` records the genomic
477+
coordinates of each edge, which leads to enormous efficiencies in storage and
478+
analysis. As a basic demonstration, we can repeat the edge removal example
479+
{ref}`above <sec_phylogen_multiroot>`, but only remove the ancestral link above node 4
480+
for the first half of the genome.
481+
482+
```{code-cell}
483+
tables = ts.dump_tables()
484+
edge_id_above_node_4 = ts.first().edge(4)
485+
left_coord_for_edges = tables.edges.left
486+
left_coord_for_edges[edge_id_above_node_4] = 50
487+
tables.edges.left = left_coord_for_edges # reset the right coords
488+
tables.sort()
489+
multi_ts = tables.tree_sequence()
490+
491+
multi_ts.draw_svg()
492+
```
493+
494+
For the left hand side of the genome we lack information about the ancestry of
495+
node 4, but for the right hand side we know this information. The result is to
496+
generate 2 trees in the tree sequence, which differ only in the presence of absence of
497+
a single branch. We do not have to separately store the entire tree on the right: all
498+
the edges that are shared between trees are stored only once.
499+
500+
The rest of the `tskit` tutorials will lead you through the concepts involved with
501+
storing and analysing sequences of many correlated trees. For a simple introduction, you
502+
might want to start with {ref}`sec_what_is`.
503+
441504

terminology_and_concepts.md

+8-6
Original file line numberDiff line numberDiff line change
@@ -359,8 +359,8 @@ is not always possible.
359359

360360
## Concepts
361361

362-
There are some basic population genetic concepts which can be helpful with thinking about
363-
tree sequences. For reference, here is the tree sequence topology that we have been using:
362+
There are some basic concepts which can be helpful with thinking about tree sequences.
363+
For reference, here is the tree sequence topology that we have been using:
364364

365365
```{code-cell} ipython3
366366
:"tags": ["hide-input"]
@@ -384,10 +384,12 @@ with 3 or more children in a particular tree (these are known as *polytomies*).
384384

385385
### Tree changes, ancestral recombinations, and SPRs
386386

387-
The reason that tree sequences are efficient is that very few edges
388-
{ref}`change from tree to tree<fig_what_is_edge_diffs>`. More specifically, recombination
389-
results in adjacent trees that differ by only a few "tree edit" or SPR
390-
(subtree-prune-and-regraft) operations. This is seen in the example tree sequence above.
387+
The process of recombination usually results in trees along a genome where adjacent
388+
trees differ by only a few "tree edit" or SPR (subtree-prune-and-regraft) operations.
389+
The result is a tree sequence in which very few edges
390+
{ref}`change from tree to tree<fig_what_is_edge_diffs>`.
391+
This is the underlying reason that `tskit` is so
392+
efficient, and is well illustrated in the example tree sequence above.
391393

392394
In this (simulated) tree sequence, each tree differs from the next by a single SPR.
393395
The subtree defined by node 7 in the first tree has been pruned and regrafted onto the

0 commit comments

Comments
 (0)