Skip to content

Commit a86f77a

Browse files
authored
Merge pull request HadrienG#2 from jonasoh/jonas-patches
Jonas patches
2 parents 990bb23 + ce36813 commit a86f77a

File tree

7 files changed

+99
-87
lines changed

7 files changed

+99
-87
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ bioinformatics subjects:
1111
* [Assembly](assembly.md)
1212
* [Annotation](annotation.md)
1313
* [Variant Calling](variant_calling.md)
14-
* [Pan-genome Analysis](pan-genome.md)
14+
* [Pan-genome Analysis](pan_genome.md)
1515
* [RNA-Seq](rna_seq.md)
1616
* [16s Metabarcoding Analysis](16s.md)
1717
* [Whole Metagenome Sequencing](wms.md)

file_formats.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# File Formats
22

3-
This lecture is aimed at making you discover the most popular file formats used in bioinforatics. You're expected to have basic working knowledge of linux to be able to follow the lesson.
3+
This lecture is aimed at making you discover the most popular file formats used in bioinformatics. You're expected to have basic working knowledge of Linux to be able to follow the lesson.
44

55
## Table of Contents
66
* [The fasta format](#the-fasta-format)
@@ -11,7 +11,7 @@ This lecture is aimed at making you discover the most popular file formats used
1111

1212
### The fasta format
1313

14-
The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics
14+
The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics.
1515

1616
The first line in a FASTA file starts with a ">" (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
1717

@@ -31,7 +31,7 @@ HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN
3131
GNHIEIE
3232
```
3333

34-
A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">"
34+
A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">".
3535

3636
Example:
3737

@@ -49,7 +49,7 @@ SQFIGYPITLFVEK
4949

5050
### The fastq format
5151

52-
The fastq format is also a text based format to represent nucleotide sequences, but also contains the coresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.
52+
The fastq format is also a text based format to represent nucleotide sequences, but also contains the corresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.
5353

5454
A fastq file uses four lines per sequence:
5555

@@ -69,9 +69,9 @@ GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
6969

7070
#### Quality
7171

72-
The quality, also called phred scores is the probability that the corresponding basecall is incorrect.
72+
The quality, also called phred score, is the probability that the corresponding basecall is incorrect.
7373

74-
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40
74+
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40.
7575

7676
| Phred Quality Score | Probability of incorrect base call | Base call accuracy
7777
| --- | --- | ---
@@ -90,9 +90,9 @@ SAM (file format) is a text-based format for storing biological sequences aligne
9090

9191
The SAM format consists of a header and an alignment section. The binary representation of a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and edited with the software SAMtools.
9292

93-
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf)
93+
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf).
9494

95-
In brief it consists in a header section and reads (with other information) in tab delimited format.
95+
In brief it consists of a header section and reads (with other information) in tab delimited format.
9696

9797
#### Example header section
9898

@@ -111,9 +111,9 @@ M01137:130:00-A:17009:1352/14 * 0 0 * * 0 0 AGCAAAATACAACGATCTGGATGGTAGCATTAGCGA
111111

112112
### the vcf format
113113

114-
the vcf is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format hs been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
114+
The vcf format is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format has been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
115115

116-
A vcf is a tab-delimited file with 9 columns, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
116+
A vcf is a tab-delimited file, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf).
117117

118118
#### VCF Example
119119

@@ -163,9 +163,9 @@ chr2 215634055 . C T
163163

164164
### the gff format
165165

166-
The general feature format (gff) is another text file format. used for describing genes and other features of DNA, RNA and protein sequences. It is the standrad for annotation of genomes.
166+
The general feature format (gff) is another text file format, used for describing genes and other features of DNA, RNA and protein sequences. It is the standard for annotation of genomes.
167167

168-
A gff file should, as a vcf, conatins 9 columns described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
168+
A gff file should contain 9 columns, described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
169169

170170
#### Example gff
171171

qc.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The sequencing was done as paired-end 2x150bp.
77

88
## Downloading the data
99

10-
The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
10+
The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. Go to the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
1111

1212
```
1313
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz
@@ -18,18 +18,18 @@ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_2.fastq.gz
1818

1919
To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more.
2020

21-
FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
21+
FastQC has a graphical interface and can be downloaded and run on a Windows or Linux computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
2222

23-
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow:
23+
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follows:
2424

2525
```
26-
module load fastqc
26+
module load FastQC
2727
fastqc $read1 $read2
2828
```
2929

3030
which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser.
3131

32-
Open the html file with your favourite web browser, and try to interpret them
32+
Open the html file with your favourite web browser, and try to interpret them.
3333

3434
Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison.
3535

@@ -49,7 +49,9 @@ cd scythe
4949
make all
5050
```
5151

52-
Then, copy or move "scythe" to a directory in your $PATH.
52+
Then, copy or move "scythe" to a directory in your $PATH, for example like this:
53+
54+
`cp scythe $HOME/bin/`
5355

5456
Scythe can be run minimally with:
5557

@@ -73,6 +75,10 @@ cd sickle
7375
make
7476
```
7577

78+
Copy sickle to a directory in your $PATH:
79+
80+
`cp sickle $HOME/bin/`
81+
7682
Sickle has two modes to work with both paired-end and single-end reads: sickle se and sickle pe.
7783

7884
Running sickle by itself will print the help:
@@ -95,8 +101,8 @@ What did the trimming do to the per-base sequence quality, the per sequence qual
95101

96102
What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data?
97103

98-
Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis.
104+
Based on the FastQC report, there seems to be a population of shorter reads that are technical artifacts. We will ignore them for now as they will not interfere with our analysis.
99105

100106
## Extra exercises
101107

102-
Perform quality control on the extra datasets given by your instructors
108+
Perform quality control on the extra datasets given by your instructors.

rna_seq.md

Lines changed: 38 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
## Load salmon
44

55
```
6-
module load salmon
6+
module load Salmon
77
```
88

9-
## Downloading the data.
9+
## Downloading the data
1010

1111
For this tutorial we will use the test data from [this](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393) paper:
1212

@@ -27,7 +27,7 @@ So to summarize we have:
2727
* HBR + ERCC Spike-In Mix2, Replicate 2
2828
* HBR + ERCC Spike-In Mix2, Replicate 3
2929

30-
You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz)
30+
You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz).
3131

3232
Unpack the data and go into the toy_rna directory
3333

@@ -36,13 +36,13 @@ tar xzf toy_rna.tar.gz
3636
cd toy_rna
3737
```
3838

39-
## indexing transcriptome
39+
## Indexing transcriptome
4040

4141
```
4242
salmon index -t chr22_transcripts.fa -i chr22_index
4343
```
4444

45-
## quantify reads using salmon
45+
## Quantify reads using salmon
4646

4747
```bash
4848
for i in *_R1.fastq.gz
@@ -64,9 +64,9 @@ Salmon exposes many different options to the user that enable extra features or
6464

6565
After the salmon commands finish running, you should have a directory named `quant`, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. The main output file (called quant.sf) is rather self-explanatory. For example, take a peek at the quantification file for sample `HBR_Rep1` in `quant/HBR_Rep1/quant.sf` and you’ll see a simple TSV format file listing the name (Name) of each transcript, its length (Length), effective length (EffectiveLength) (more details on this in the documentation), and its abundance in terms of Transcripts Per Million (TPM) and estimated number of reads (NumReads) originating from this transcript.
6666

67-
## import read counts using tximport
67+
## Import read counts using tximport
6868

69-
Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis
69+
Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis.
7070

7171
First, open up your favourite R IDE and install the necessary packages:
7272

@@ -86,59 +86,57 @@ library(GenomicFeatures)
8686
library(readr)
8787
```
8888

89-
Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcripts name to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package:
89+
Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcript names to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package:
9090

9191
```R
9292
txdb <- makeTxDbFromGFF("chr22_genes.gtf")
9393
k <- keys(txdb, keytype = "GENEID")
94-
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
95-
tx2gene <- df[, 2:1]
94+
tx2gene <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
9695
head(tx2gene)
9796
```
9897

99-
now we can import the salmon quantification:
98+
Now we can import the salmon quantification. First, download the file with sample descriptions from [here](https://raw.githubusercontent.com/HadrienG/tutorials/master/data/samples.txt) and put it in the toy_rna directory. Then, use that file to load the corresponding quantification data.
10099

101100
```R
102101
samples <- read.table("samples.txt", header = TRUE)
103-
files <- file.path("quant", samples$quant, "quant.sf")
102+
files <- file.path("quant", samples$sample, "quant.sf")
104103
names(files) <- paste0(samples$sample)
105104
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv)
106105
```
107106

108-
take a look at the data:
107+
Take a look at the data:
109108

110109
```R
111110
head(txi.salmon$counts)
112111
```
113112

114-
## differential expression using DeSeq2
113+
## Differential expression using DESeq2
115114

116-
install the necessary package
115+
Install the necessary package:
117116

118117
```R
119118
biocLite('DESeq2')
120119
```
121120

122-
then load it:
121+
Then load it:
123122

124123
```R
125124
library(DESeq2)
126125
```
127126

128-
Instantiate the DESeqDataSet and generate result table. See ?DESeqDataSetFromTximport and ?DESeq for more information about the steps performed by the program.
129-
127+
Instantiate the DESeqDataSet and generate result table. See `?DESeqDataSetFromTximport` and `?DESeq` for more information about the steps performed by the program.
130128

131129
```R
132130
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
133131
dds <- DESeq(dds)
134132
res <- results(dds)
135133
```
136134

137-
run the `summary` command to have an idea of how many genes are up and down-regulated between the two conditions
135+
Run the `summary` command to get an idea of how many genes are up- and downregulated between the two conditions:
138136

139137
`summary(res)`
140138

141-
DESeq uses a negative binomial distribution. Such distribution has two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean.
139+
DESeq uses a negative binomial distribution. Such distributions have two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean.
142140

143141
You can read more about the methods used by DESeq2 in the [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) or the [vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq/inst/doc/DESeq.pdf)
144142

@@ -218,17 +216,17 @@ library(gageData)
218216
Let’s use the `mapIds` function to add more columns to the results. The row.names of our results table has the Ensembl gene ID (our key), so we need to specify `keytype=ENSEMBL`. The column argument tells the `mapIds` function which information we want, and the `multiVals` argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. Let’s get the Entrez IDs, gene symbols, and full gene names.
219217

220218
```R
221-
res$symbol = mapIds(org.Hs.eg.db,
219+
res$symbol <- mapIds(org.Hs.eg.db,
222220
keys=row.names(res),
223221
column="SYMBOL",
224222
keytype="ENSEMBL",
225223
multiVals="first")
226-
res$entrez = mapIds(org.Hs.eg.db,
224+
res$entrez <- mapIds(org.Hs.eg.db,
227225
keys=row.names(res),
228226
column="ENTREZID",
229227
keytype="ENSEMBL",
230228
multiVals="first")
231-
res$name = mapIds(org.Hs.eg.db,
229+
res$name <- mapIds(org.Hs.eg.db,
232230
keys=row.names(res),
233231
column="GENENAME",
234232
keytype="ENSEMBL",
@@ -237,51 +235,55 @@ res$name = mapIds(org.Hs.eg.db,
237235
head(res)
238236
```
239237

240-
We’re going to use the [gage](http://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram.
241-
238+
We’re going to use the [gage](https://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram.
242239

243240
The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms:
244241

245242
```R
246243
data(kegg.sets.hs)
247244
data(sigmet.idx.hs)
248-
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
245+
kegg.sets.hs <- kegg.sets.hs[sigmet.idx.hs]
249246
head(kegg.sets.hs, 3)
250247
```
251248

252-
Run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir
249+
Run the pathway analysis. See help on the gage function with `?gage`. Specifically, you might want to try changing the value of same.dir.
253250

254251
```R
255-
foldchanges = res$log2FoldChange
256-
names(foldchanges) = res$entrez
257-
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
252+
foldchanges <- res$log2FoldChange
253+
names(foldchanges) <- res$entrez
254+
keggres <- gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
258255
lapply(keggres, head)
259256
```
260257

261-
pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting.
258+
Pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting. The `dplyr` package is required to use the pipe (`%>%`) construct.
262259

263260
```R
261+
library(dplyr)
262+
264263
# Get the pathways
265-
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
264+
keggrespathways <- data.frame(id=rownames(keggres$greater), keggres$greater) %>%
266265
tbl_df() %>%
267266
filter(row_number()<=5) %>%
268267
.$id %>%
269268
as.character()
270269
keggrespathways
271270

272271
# Get the IDs.
273-
keggresids = substr(keggrespathways, start=1, stop=8)
272+
keggresids <- substr(keggrespathways, start=1, stop=8)
274273
keggresids
275274
```
276275

277-
Finally, the pathview() function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.
276+
Finally, the `pathview()` function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.
278277

279278
```R
280279
# Define plotting function for applying later
281-
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)
280+
plot_pathway <- function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)
281+
282+
# Unload dplyr since it conflicts with the next line
283+
detach("package:dplyr", unload=T)
282284

283285
# plot multiple pathways (plots saved to disk and returns a throwaway list object)
284-
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))
286+
tmp <- sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))
285287
```
286288

287289
#### Thanks

0 commit comments

Comments
 (0)