Skip to content

wjyyang24/MyGenome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MyGenome

Analysis for ABT480/CS485G genome assembly. The genome is for the organism Pyricularia oryzae, which is a fungus that infects monocots such as wheat, rice, ryegrass, etc. The host organism for the specimen which provided the original sequence data was Panicum sabulorum (a type of grass).

1. Sequence quality assessment and trimming

The F1 and R1 sequence datasets were analyzed using FASTQC:

ssh -Y <[email protected]>
cd MyGenome
fastqc &

Load in F1 and R1 datasets into GUI interface. Take screen shots of output files:

F1screenshot.png

Running trimmomatic

java -jar ~/trimmomatic-0.38.jar PE -threads 16 -phred33 -trimlog file.txt \
U249_1.fq.gz U249_2.fq.gz U249_1_paired.fastq U249_1_unpaired.fastq U249_2_paired.fastq U249_2_unpaired.fastq \
ILLUMINACLIP:adaptors.fasta:2:30:10 SLIDINGWINDOW:20:20 MINLEN:100

Count number of forward reads remaining

First, run head to view the headers of each read to find a suitable string for running grep

head U249_1_paired.fastq

Use the beginning of the header for each read to count the number of reads with grep

grep -c @A00261:902:HGC52DSX7: U249_1_paired.fastq

The output shows there are 12042994 forward reads remaining after trimming.

2. Genome assembly using Velvet

The genome assembly was done on the University of Kentucky Morgan Compute Cluster. Before running the assembly ensure that your forward and reverse trimmed, paired files are present in the directory, and that they are uncompressed. Then, rename your files to match the following format (.fastq is ok too): genomeID_1_paired.fq; genomeID_2_paired.fq (make sure you use mv and not cp as it is critical the directory only contain two files with a paired.fq suffix).

Velvet was run as a SLURM script as follows:

sbatch velvetoptimiser_noclean.sh U249 61 131 10

This command submits the script containing the velvet command with arguments corresponding to the location of the script, the genome name, starting k-mer value, ending k-mer value, and step size as a SLURM job into the supercomputer job queue. Progress can be monitored in the SLURM log file (cat slurm-jobID.out).

The SLURM script was then rerun a second time with a narrower k-mer range and step size of 2 to reach the best possible assembly:

sbatch velvetoptimiser_noclean.sh U249 91 131 2

The optimized assembly can be found in the newly generated velvet_U249_91_131_2_noclean directory (name may vary) as contigs.fa. The assembly was then renamed to U249.fasta.

Rename sequence headers to a standard format

The assembly headers were renamed to a more standard format using the SimpleFastaHeaders.pl perl script:

perl SimpleFastaHeaders.pl U249.fasta U249

The arguments correspond to the location of the assembly and the genome ID to be used in the new headers. This produces a new FASTA file U249_nh.fasta.

Optimized assembly results

Full log file here: velvet_optimized_logfile.txt

velvet_optimized_results.png

3. BLASTing assembled genome against mitochondrial genome and the B71v2sh reference genome

In preparation for submitting the assembled genome to NCBI, contigs that comprise mostly mitochondrial sequences must be identified. NCBI also does not accept contigs with less than 200 base pairs, so these were removed from the assembly using the CullShortContigs.pl perl script:

perl CullShortContigs.pl U249_nh.fasta

(Note: this script produces a new FASTA file U249_final.fasta. In later steps this new file was used in place of U249.fasta, even though this is not reflected in the example commands)

To determine which contigs in the resulting assembly correspond to the mitochondrial genome, run the following command:

blastn -query MoMitochondrion.fasta -subject U249_nh.fasta -evalue 1e-50 -max_target_seqs 20000 -outfmt '6 qseqid sseqid slen length qstart qend sstart send btop' > B71v2sh.U249.BLAST

Details on the output format can be read at http://www.metagenomics.wiki/tools/blast/blastn-output-format-6. The results were then outputted to U249_mitochondrion.csv using the following command:

awk '$3/$4 > 0.9 {print $2 ",mitochondrion"}' B71v2sh.U249.BLAST > U249_mitochondrion.csv

BLAST genome assembly against a repeat-masked version of the B71 reference genome

Run the following command to perform a BLAST search against the B71 reference genome (not provided):

blastn -query B71v2sh_masked.fasta -subject U249.fasta -evalue 1e-50 -max_target_seqs 20000 -outfmt '6 qseqid sseqid qstart qend sstart send btop' -out B71v2sh.U249.BLAST

Then, create a directory named U249_BLAST and move your BLAST output file into that directory.

Identify genetic variants between the B71v2sh genome and the genome assembly

Run the CallVariants.sh SLURM script:

sbatch CallVariants.sh U249_BLAST/

A summary of the total number of variants (SNPs) identified is in the SNP_counts_out_Apr_9_14-32-41_2024 file. Information on each SNP is present in the B71v2sh_v_U249_out file.

4. Assess genome completeness using BUSCO

Genome completeness was analyzed using BUSCO, which uses BLAST to search for a set of single-copy genes that are known to have orthologs in genomes of related species. For this analysis, the odb10_ascomycota (a fungal phylum) ortholog database was used.

BUSCO was ran as a SLURM script as follows:

sbatch BuscoSingularity.sh U249.fasta

Results

busco_results.png

5. Gene prediction

Gene prediction was first done using the standalone gene finders SNAP and AUGUSTUS, which both utilize hidden Markov models (HHMs) to search for regions in the genome which have a high likelihood of being genes. For the final gene prediction, MAKER was utilized. MAKER utilizes results from SNAP, AUGUSTUS, and other gene finding programs, as well as EST and protein sequences to create a more accurate gene prediction which takes both predictions and evidences into account.

Note: the following sections were run using the screen command

SNAP

The first step is generating the training data. The data used for this were the gene annotations from the reference genome of the related B71 strain. The annotations were in GFF format, but they need to be converted to ZFF format for SNAP using the following command:

maker2zff B71Ref2.gff3

This command generates the files genome.ann (ZFF format), containing the positions of the exons and genes in each contig, and genome.dna (FASTA format), containing the contig sequences. To extract the gene sequences from these files, we can run fathom (a sequence analysis and extraction tool included in SNAP):

fathom genome.ann genome.dna -categorize 1000

The categorize subcommand creates pairs of ZFF and FASTA corresponding to various regions such as those with errors, overlapping genes, etc. It takes as a argument the max number of base pairs of intergenic sequence on both sides of each gene to train the HMM on sequences frequently found near genes.

Next, the genome, transcript, and protein sequences need to be extracted using the export subcommand of fathom. The plus subcommand will also be used to flip genes that are on the reverse strand. For this step, we will be using the uni.ann and uni.dna files, which contain unique, non-overlapping genes without alternative splicing, as this is what performs best for SNAP training:

fathom uni.ann uni.dna -export 1000 -plus

This command outputs four files: export.ann, containing ZFF annotations of the exons of each gene, export.dna, containing the DNA sequence of each genes, including introns and flanking regions, export.tx, containing the DNA sequence for each transcript, and export.aa, containing the protein sequence of each gene.

Now, to train the HMM, we will use the forge command included in SNAP:

forge export.ann export.dna

This command outputs a large number of files, so we use the hmm-assembler.pl script included with SNAP to consolidate these into one file:

hmm-assembler.pl Moryzae . > Moryzae.hmm

The first argument is the name to be used in the header for identification. The second argument is the directory containing the files to be consolidated. We then redirect the output of the script into a new file, Moryzae.hmm.

Running SNAP

snap-hmm Moryzae.hmm U249.fasta > U249-snap.zff

Full results in U249-snap.zff. Summary of results can be obtained using the following command:

fathom U249-snap.zff U249.fasta -gene-stats

snap_gene_stats.png

AUGUSTUS

AUGUSTUS uses a different HMM implementation than SNAP, so results may vary from the SNAP results. We do not need to train AUGUSTUS since it already includes a trained model for the species Magnaporthe grisea, which is very closely related to Pyricularia Oryzae. Run AUGUSTUS using the following command:

augustus --species=magnaporthe_grisea --gff3=on \
--singlestrand=true --progress=true \
U249.fasta > U249-augustus.gff3

The first argument is the parameter file to be used. The next argument changes the output format to GFF3. The next argument tells AUGUSTUS to predict genes seperately on each strand. The next argument shows the progress while AUGUSTUS is running. The last argument is the path to the genome sequence FASTA file. Finally, the output is piped to a new file, U249-augustus.gff3.

MAKER

Since MAKER has a very large number of options, we used a configuration file to specify our desired behavior. MAKER configuration files must first be generated using the command maker -CTL. The generated config files can then be modified to the user's liking. The config file we used can be found in the maker_opts.ctl file. This file was modified to include the paths to the genome, SNAP HMM trained in the earlier step, as well as an additional Magnaporthe protein sequence file obtained from NCBI. Options for the model organism to be used for AUGUSTUS as well as other options were also modified. With the config file ready, we can run maker as such:

maker 2>&1 | tee maker.log

Note that we use the tee command to redirect the output to a log file as well as printing the output to the terminal.

Finally, we merge all the results into one GFF file, since MAKER seperates results for each contig into its own directory:

gff3_merge -d U249.maker.output/U249_master_datastore_index.log \
-o U249-annotations.gff

Results

Log: U249_master_datastore_index.log

GFF: U249-annotations.gff.tar.gz

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published