-
Notifications
You must be signed in to change notification settings - Fork 1
Genome assembly and evaluation
During the lecture we have discussed the difference between short- and long-read DNA/RNA sequencing approaches, and how they are applied to solve different assembly problems. In this exercise you will assemble the bacterial reads you have just processed and explore the effect of different k-mer sizes on the assembly result.
- Read all instructions carefully before you start the exercise.
- In the examples below I use the username studentX but you should use the one that has been assigned to you, so please make sure you update all commands with the correct username.
- Run the exercise in two terminal tabs, one where you log into Albiorix and one for accessing the files on your local computer.
- Install the program Bandage on your local computer.
- Make sure your working directory is correct.
cd /nobackup/data18/Assembly_exercise/studentX
- This directory contains the script
runMegahit.sge
that sends the assembly analysis to the queue system. Open this file in your favourite text editor (vim, nano,...) and look at the content. Ask a teacher for help if you need help interpreting the content of the file, but make sure you understand what each line of code does. Also carefully read the comments in the file.
nano runMegahit.sge
-
Set a value for the
KMER
,SAMPLE
,FILE1
andFILE2
variables. TheKMER
variable sets the k-mer size used by the assembly algorithm. In this exercise we will explore the effect of the k-mer size on the assembly result. TheSAMPLE
variable is used to give the output files unique names that also reflects which k-mer size was used for the analysis.FILE1
andFILE2
are the trimmed FASTQ files you produced in the previous exercise. -
Run the analysis by submitting your script to the computer cluster queue system.
qsub runMegahit.sge
-
You can monitor the state of your analysis using the command
qstat
, and by looking at the content of the output files. Your analysis will disappear from the list once it has finished. -
After the analysis has finished you will find a new file ending with
.fastg
in the directory where you ran your analysis from. Copy this file to your local computer using this command (remember to run this command from the second tab in your terminal, the one you can access your local filesystem from):
rsync -hav [email protected]:/nobackup/data18/Assembly_exercise/studentX/*.fastg .
- Open the file in
Bandage
and pressDraw Graph
to look at the assembly graph from your analysis. We know that the data originates from a bacteria with a circular chromosome so we would ideally like to see at least one circular edge in our graph.
- What does it mean if you see no circles in your graph?
- What does it mean if you see more then one circle in your graph?
- Try running the analysis again using a different k-mer size to see if you can improve the result.
- What effect does the k-mer size have on the number of contigs produced?
- What effect does it have on the overall size of the assembly?
- Hint - You can get an estimate of the assembly size by looking at the size of the assembly file.
-
In your working directory, there is a file named
runBUSCO.sge
. Open this file and look at its contents.- To understand what each of the command options means, run
module load BUSCO/v3.1.0
, thenrun_BUSCO.py --help
to bring up the BUSCO manual.
- To understand what each of the command options means, run
-
Set the variables
INFILE
andOUTDIR
, to define both the input file for the analysis, and the desired name of the results directory. -
Submit the job to the queue.
-
Once the analysis is complete, take a look at the output files
- A short summary of the results can be found in
run_${OUTDIR}/short_summary_${OUTDIR}.txt
.
- A short summary of the results can be found in
-
Repeat steps 11 and 12 for some more of your assemblies. How do the results compare?
- Do the better assemblies (based on the assembly graph) give better BUSCO results?
Congratulations - you now have a reference genome to do science with!