Quality control of high throughput sequence data

Preparing for the exercises

Make sure you have an account on the computer cluster Albiorix.
Usernames and passwords will be passed out before we begin.
Run the exercises in two terminal tabs - one where you log into Albiorix and one for accessing the files on your local computer.
After each analysis has run, take a look at the output produced, to be sure that the process hasn't encountered any unexpected errors.
- Files to check include log files in the analysis result subfolders, and any files ending in .sge.e##### and .sge.o#####.
- To keep an eye on the progress of a log file, you can view the end of the file in real-time using tail -f <logfile>

Start by logging in to Albiorix (note - for all exercises, replace studentX with the username you have been given).

ssh -Y [email protected]

When you first log in you will be located in your home directory. However, for this exercise you should move to a different part of the file system.

cd /nobackup/data18/Assembly_exercise/studentX

For this part of the exercise, you will be using the files runFastQC.sge and runCutadapt.sge. These are templates for scripts that can be sent to the queue system in order to run FastQC and Cutadapt, once you have filled in the relevant commands or variables.

For this exercise, you will need to view the help files for some programs, so first you'll have to load a few modules by running the following command:

module load FastQC/v0.11.8 Anaconda3/v2019.10 TrimGalore/v0.6.0

Assessing the raw data

The raw, compressed FASTQ files you'll be using in today's exercises can be found in /db/Teaching/Assembly/Sulfitobacter/raw
Make a local copy of these files to work on using rsync:

rsync -hav /db/Teaching/Assembly/Sulfitobacter/raw/*.fastq.gz .

Inspect the first few lines of the FASTQ files with zless (and close the file with q when you're done)
- Does the format of the files look as you expect?
Run FastQC on your FASTQ files, by adding the relevant command to the file runFastQC.sge
- Hint - the basic form of the command, found at fastqc --help, should be sufficient.
- Once you're happy with the command, submit it to the queue with qsub runFastQC.sge
Explore the FastQC reports (firefox *.html) and note your observations. For example:
a. Does the per base sequence quality look acceptable?
b. Does the per base sequence content suggest any bias?
c. Do any adapters need removing?
- Hint - the makers of FastQC provide information on interpreting the plots.

If Firefox is slow, rsync the files to your local computer and view them there.

Discuss your observations with another student or an instructor.

Trimming the raw data

Run Cutadapt based on your observations, by adding the relevant variables to the file runCutadapt.sge
There are a few variables you'll need to decide on and add to the script:
- INFILE1 and INFILE2 are the raw input files (forward and reverse reads, respectively).
- OUTFILE1 and OUTFILE2 are the names you want to give to the output files (forward and reverse reads, respectively).
  - Hint - The files will be output as compressed FASTQ files, so be sure to name them accordingly.
- TRIM1 and TRIM2 are the number of bases to be hard-trimmed from the 5' end of the forward and reverse reads, respectively.
- QUALITY is the Phred quality score threshold; bases below the given threshold will be trimmed from the 3' end of each read.
- You should also run cutadapt --help to make sure you understand the options which have already been set.

Once you're happy with the command, submit it to the queue with qsub runCutadapt.sge

Assessing the trimmed data

Re-run FastQC on the newly-trimmed files, by editing the previous command to target the new files.
Explore the FastQC report, and check whether the data looks good compared to the first report.

Are you happy with how your data looks?

No - Try rerunning Cutadapt with some different parameters, generate some new FastQC files for the results, and see whether this looks any better.
Yes - Excellent! You can either continue directly to the Assembly section of the exercise, or try out the alternative QC approach below.

Optional - An alternative approach

As mentioned in the lecture, Trim Galore! is a wrapper script containing both Cutadapt and FastQC. While it does have some limitations in how it allows you to interact with these programs, you may find it better to work with than Cutadapt and FastQC separately - when multiple tools are available, it's worth trying them out to see which works best for you!

S1. Re-run the trimming using trim_galore, by adding the relevant command to the file runTrimGalore.sge

Hint - run trim_galore --help
Once you're happy with the command, submit it to the queue with qsub runTrimGalore.sge
How do the results compare to running Cutadapt and FastQC separately?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quality control of high throughput sequence data

Preparing for the exercises

Assessing the raw data

Trimming the raw data

Assessing the trimmed data

Optional - An alternative approach

Next: Genome assembly and evaluation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally