Genome assembly and evaluation

Introduction

During the lecture we have discussed the difference between short- and long-read DNA/RNA sequencing approaches, and how they are applied to solve different assembly problems. In this exercise you will assemble the bacterial reads you have just processed and explore the effect of different k-mer sizes on the assembly result.

Preparations

Read all instructions carefully before you start the exercise.
In the examples below I use the username studentX but you should use the one that has been assigned to you, so please make sure you update all commands with the correct username.
Run the exercise in two terminal tabs, one where you log into Albiorix and one for accessing the files on your local computer.
Install the program Bandage on your local computer.

Exercise

Make sure your working directory is correct.

cd /nobackup/data18/Assembly_exercise/studentX

This directory contains the script runMegahit.sge that sends the assembly analysis to the queue system. Open this file in your favourite text editor (vim, nano,...) and look at the content. Ask a teacher for help if you need help interpreting the content of the file, but make sure you understand what each line of code does. Also carefully read the comments in the file.

nano runMegahit.sge

Set a value for the KMER, SAMPLE, FILE1 and FILE2 variables. The KMER variable sets the k-mer size used by the assembly algorithm. In this exercise we will explore the effect of the k-mer size on the assembly result. The SAMPLE variable is used to give the output files unique names that also reflects which k-mer size was used for the analysis. FILE1 and FILE2 are the trimmed FASTQ files you produced in the previous exercise.
Run the analysis by submitting your script to the computer cluster queue system.

qsub runMegahit.sge

You can monitor the state of your analysis using the command qstat, and by looking at the content of the output files. Your analysis will disappear from the list once it has finished.
After the analysis has finished you will find a new file ending with .fastg in the directory where you ran your analysis from. Copy this file to your local computer using this command (remember to run this command from the second tab in your terminal, the one you can access your local filesystem from):

rsync -hav [email protected]:/nobackup/data18/Assembly_exercise/studentX/*.fastg .

Open the file in Bandage and press Draw Graph to look at the assembly graph from your analysis. We know that the data originates from a bacteria with a circular chromosome so we would ideally like to see at least one circular edge in our graph.

What does it mean if you see no circles in your graph?
What does it mean if you see more then one circle in your graph?

Try running the analysis again using a different k-mer size to see if you can improve the result.

Things to consider when interpreting the result

What effect does the k-mer size have on the number of contigs produced?
What effect does it have on the overall size of the assembly?
- Hint - You can get an estimate of the assembly size by looking at the size of the assembly file.

Assembly evaluation

In your working directory, there is a file named runBUSCO.sge. Open this file and look at its contents.
- To understand what each of the command options means, run module load BUSCO/v3.1.0, then run_BUSCO.py --help to bring up the BUSCO manual.
Set the variables INFILE and OUTDIR, to define both the input file for the analysis, and the desired name of the results directory.
Submit the job to the queue.
Once the analysis is complete, take a look at the output files
- A short summary of the results can be found in run_${OUTDIR}/short_summary_${OUTDIR}.txt.
Repeat steps 11 and 12 for some more of your assemblies. How do the results compare?
- Do the better assemblies (based on the assembly graph) give better BUSCO results?

Congratulations - you now have a reference genome to do science with!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Genome assembly and evaluation

Introduction

Preparations

Exercise

Things to consider when interpreting the result

Assembly evaluation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally