Analysis for ABT480/CS485G genome assembly. The genome is for the organism Pyricularia oryzae, which is a fungus that infects monocots such as wheat, rice, ryegrass, etc. The host organism for the specimen which provided the original sequence data was Panicum sabulorum (a type of grass).
The F1 and R1 sequence datasets were analyzed using FASTQC:
ssh -Y <[email protected]>
cd MyGenome
fastqc &
Load in F1 and R1 datasets into GUI interface. Take screen shots of output files:
java -jar ~/trimmomatic-0.38.jar PE -threads 16 -phred33 -trimlog file.txt \
U249_1.fq.gz U249_2.fq.gz U249_1_paired.fastq U249_1_unpaired.fastq U249_2_paired.fastq U249_2_unpaired.fastq \
ILLUMINACLIP:adaptors.fasta:2:30:10 SLIDINGWINDOW:20:20 MINLEN:100
First, run head to view the headers of each read to find a suitable string for running grep
head U249_1_paired.fastq
Use the beginning of the header for each read to count the number of reads with grep
grep -c @A00261:902:HGC52DSX7: U249_1_paired.fastq
The output shows there are 12042994 forward reads remaining after trimming.
The genome assembly was done on the University of Kentucky Morgan Compute Cluster. Before running the assembly ensure that your forward and reverse trimmed, paired files are present in the directory, and that they are uncompressed. Then, rename your files to match the following format (.fastq is ok too): genomeID_1_paired.fq; genomeID_2_paired.fq (make sure you use mv and not cp as it is critical the directory only contain two files with a paired.fq suffix).
Velvet was run as a SLURM script as follows:
sbatch velvetoptimiser_noclean.sh U249 61 131 10
This command submits the script containing the velvet command with arguments corresponding to the location of the script, the genome name, starting k-mer value, ending k-mer value, and step size as a SLURM job into the supercomputer job queue. Progress can be monitored in the SLURM log file (cat slurm-jobID.out).
The SLURM script was then rerun a second time with a narrower k-mer range and step size of 2 to reach the best possible assembly:
sbatch velvetoptimiser_noclean.sh U249 91 131 2
The optimized assembly can be found in the newly generated velvet_U249_91_131_2_noclean directory (name may vary) as contigs.fa. The assembly was then renamed to U249.fasta.
The assembly headers were renamed to a more standard format using the SimpleFastaHeaders.pl perl script:
perl SimpleFastaHeaders.pl U249.fasta U249
The arguments correspond to the location of the assembly and the genome ID to be used in the new headers. This produces a new FASTA file U249_nh.fasta.
Full log file here: velvet_optimized_logfile.txt
In preparation for submitting the assembled genome to NCBI, contigs that comprise mostly mitochondrial sequences must be identified. NCBI also does not accept contigs with less than 200 base pairs, so these were removed from the assembly using the CullShortContigs.pl perl script:
perl CullShortContigs.pl U249_nh.fasta
(Note: this script produces a new FASTA file U249_final.fasta. In later steps this new file was used in place of U249.fasta, even though this is not reflected in the example commands)
To determine which contigs in the resulting assembly correspond to the mitochondrial genome, run the following command:
blastn -query MoMitochondrion.fasta -subject U249_nh.fasta -evalue 1e-50 -max_target_seqs 20000 -outfmt '6 qseqid sseqid slen length qstart qend sstart send btop' > B71v2sh.U249.BLAST
Details on the output format can be read at http://www.metagenomics.wiki/tools/blast/blastn-output-format-6. The results were then outputted to U249_mitochondrion.csv using the following command:
awk '$3/$4 > 0.9 {print $2 ",mitochondrion"}' B71v2sh.U249.BLAST > U249_mitochondrion.csv
Run the following command to perform a BLAST search against the B71 reference genome (not provided):
blastn -query B71v2sh_masked.fasta -subject U249.fasta -evalue 1e-50 -max_target_seqs 20000 -outfmt '6 qseqid sseqid qstart qend sstart send btop' -out B71v2sh.U249.BLAST
Then, create a directory named U249_BLAST and move your BLAST output file into that directory.
Run the CallVariants.sh SLURM script:
sbatch CallVariants.sh U249_BLAST/
A summary of the total number of variants (SNPs) identified is in the SNP_counts_out_Apr_9_14-32-41_2024 file. Information on each SNP is present in the B71v2sh_v_U249_out file.
Genome completeness was analyzed using BUSCO, which uses BLAST to search for a set of single-copy genes that are known to have orthologs in genomes of related species. For this analysis, the odb10_ascomycota (a fungal phylum) ortholog database was used.
BUSCO was ran as a SLURM script as follows:
sbatch BuscoSingularity.sh U249.fasta
Gene prediction was first done using the standalone gene finders SNAP and AUGUSTUS, which both utilize hidden Markov models (HHMs) to search for regions in the genome which have a high likelihood of being genes. For the final gene prediction, MAKER was utilized. MAKER utilizes results from SNAP, AUGUSTUS, and other gene finding programs, as well as EST and protein sequences to create a more accurate gene prediction which takes both predictions and evidences into account.
Note: the following sections were run using the screen command
The first step is generating the training data. The data used for this were the gene annotations from the reference genome of the related B71 strain. The annotations were in GFF format, but they need to be converted to ZFF format for SNAP using the following command:
maker2zff B71Ref2.gff3
This command generates the files genome.ann (ZFF format), containing the positions of the exons and genes in each contig, and genome.dna (FASTA format), containing the contig sequences. To extract the gene sequences from these files, we can run fathom (a sequence analysis and extraction tool included in SNAP):
fathom genome.ann genome.dna -categorize 1000
The categorize subcommand creates pairs of ZFF and FASTA corresponding to various regions such as those with errors, overlapping genes, etc. It takes as a argument the max number of base pairs of intergenic sequence on both sides of each gene to train the HMM on sequences frequently found near genes.
Next, the genome, transcript, and protein sequences need to be extracted using the export subcommand of fathom. The plus subcommand will also be used to flip genes that are on the reverse strand. For this step, we will be using the uni.ann and uni.dna files, which contain unique, non-overlapping genes without alternative splicing, as this is what performs best for SNAP training:
fathom uni.ann uni.dna -export 1000 -plus
This command outputs four files: export.ann, containing ZFF annotations of the exons of each gene, export.dna, containing the DNA sequence of each genes, including introns and flanking regions, export.tx, containing the DNA sequence for each transcript, and export.aa, containing the protein sequence of each gene.
Now, to train the HMM, we will use the forge command included in SNAP:
forge export.ann export.dna
This command outputs a large number of files, so we use the hmm-assembler.pl script included with SNAP to consolidate these into one file:
hmm-assembler.pl Moryzae . > Moryzae.hmm
The first argument is the name to be used in the header for identification. The second argument is the directory containing the files to be consolidated. We then redirect the output of the script into a new file, Moryzae.hmm.
snap-hmm Moryzae.hmm U249.fasta > U249-snap.zff
Full results in U249-snap.zff. Summary of results can be obtained using the following command:
fathom U249-snap.zff U249.fasta -gene-stats
AUGUSTUS uses a different HMM implementation than SNAP, so results may vary from the SNAP results. We do not need to train AUGUSTUS since it already includes a trained model for the species Magnaporthe grisea, which is very closely related to Pyricularia Oryzae. Run AUGUSTUS using the following command:
augustus --species=magnaporthe_grisea --gff3=on \
--singlestrand=true --progress=true \
U249.fasta > U249-augustus.gff3
The first argument is the parameter file to be used. The next argument changes the output format to GFF3. The next argument tells AUGUSTUS to predict genes seperately on each strand. The next argument shows the progress while AUGUSTUS is running. The last argument is the path to the genome sequence FASTA file. Finally, the output is piped to a new file, U249-augustus.gff3.
Since MAKER has a very large number of options, we used a configuration file to specify our desired behavior. MAKER configuration files must first be generated using the command maker -CTL
. The generated config files can then be modified to the user's liking. The config file we used can be found in the maker_opts.ctl file. This file was modified to include the paths to the genome, SNAP HMM trained in the earlier step, as well as an additional Magnaporthe protein sequence file obtained from NCBI. Options for the model organism to be used for AUGUSTUS as well as other options were also modified. With the config file ready, we can run maker as such:
maker 2>&1 | tee maker.log
Note that we use the tee
command to redirect the output to a log file as well as printing the output to the terminal.
Finally, we merge all the results into one GFF file, since MAKER seperates results for each contig into its own directory:
gff3_merge -d U249.maker.output/U249_master_datastore_index.log \
-o U249-annotations.gff