Skip to content

Command Line Configuration

Moritz Smolka edited this page Jul 21, 2017 · 11 revisions

Note: This page is concerned with defining data sets in a benchmark configuration file. For most cases, we suggest using the graphical web interface instead. The web interface generates benchmark configuration files automatically, which are then stored in the setups_generated directory and can be shared and used on other machines.

Defining Data Sets

This page shows examples for benchmark configuration files that define different types of data sets supported by Teaser. Configuration files should be placed in the setups directory and have a .yaml file extension. For detailed information on data set types and evaluation, see Data Sets.

For more examples, we suggest browsing some of the default configuration files in the setups directory.

After creating your configuration file, benchmark mappers on it using ./teaser.py <my_filename>.yaml.

Built-in Simulation

For a list of available simulation parameters, see Table of Simulation Parameters.

Example Configuration File

include:
  - base_teaser

teaser:
  tests:
    my_customized_ecoli_dataset: #Name of your data set
       type: simulated_teaser
       reference: E_coli.fasta
       platform: illumina
       simulator: mason
       paired: No
       read_length: 150
       mutation_rate: 0.05
       mutation_indel_frac: 0.02

       sampling:
          enable: Yes
          ratio: 0.15

       #(Optional) Title of the data set to be shown in reports
       title: Dros Test

evaluation:
   threshold: 75

Running Teaser using this benchmark configuration will cause the data set to be generated and evaluated.

Custom Simulation

To important a custom simulation, Teaser requires read file(s) in FASTQ format, and the gold standard file in SAM format. The gold standard file should contain an entry for each read, most importantly having the RNAME and POS fields set to the simulated source position. See Data Sets for more information on how we evaluate simulated data sets.

Teaser does not apply subsampling to custom simulations. Read files will be imported directly.

Example Configuration File

include:
  - base_teaser

teaser:
   tests:
      #Custom simulation import example 1
      custom_se:
         type: simulated_custom
         reference: E_coli.fasta
         paired: No
         import_read_files: [/path/to/my/reads.fastq]
         import_gold_standard_file: /path/to/my/alignments.sam

      #Custom simulation import example 2
      custom_pe:
         type: simulated_custom
         reference: E_coli.fasta
         paired: Yes
         import_read_files: [/path/to/my/reads1.fastq,/path/to/my/reads2.fastq]
         import_gold_standard_file: /path/to/my/alignments.sam

Running Teaser using this benchmark configuration will cause the data sets to be imported and evaluated. The field import_read_files must be a list of the absolute paths to either one or two FASTQ files (based on the value of the paired field which may be either Yes or No). The field import_gold_standard_file must be set to the absolute path of the SAM file containing the simulated source positions for each read. Teaser will create copies of these files during the import process.

Real Read Data

For real read data, Teaser cannot evaluate correctness of alignments. However, the mapped percentage of reads and performance metrics are available and may be useful for a comparison with results from a simulation. See Data Sets for more information on how we evaluate real data sets.

Example Configuration File

include:
  - base_teaser

teaser:
   tests:
      #Real data import example 1
      #Default. This will calculate the read count to sample 
      #based on estimated average read length and size of the reference.
      real_se:
         type: real
         reference: E_coli.fasta
         paired: No
         import_read_files: [/path/to/my/reads.fastq]

      #Real data import example 2
      #Sample 10000 reads
      real_se_use_custom:
         type: real
         reference: E_coli.fasta
         paired: No
         import_read_files: [/path/to/my/reads.fastq]
         read_count: 10000

      #Real data import example 3
      #Sampling disabled - all reads will be imported
      real_se_use_all:
         type: real
         reference: E_coli.fasta
         paired: No
         import_read_files: [/path/to/my/reads.fastq]
         sampling: {enable: No}

      #Real data import example 4
      #Paired-end example
      real_pe:
         type: real
         reference: E_coli.fasta
         paired: Yes
         import_read_files: [/path/to/my/reads_1.fastq,/path/to/my/reads_2.fastq]

Running Teaser using this benchmark configuration will cause the real read data sets to be imported and evaluated. Set the type field of the data set to real, the import_read_files to a list containing the absolute paths to either one or two FASTQ files (the paired field must be set to Yes or No accordingly). By default, Teaser will automatically sample a number of reads from the input.

Clone this wiki locally