This repository contains the Imputation Server 2 workflow to facilitate genotype imputation at scale. It serves as the underlying workflow of the Michigan Imputation Server.
Das S*, Forer L*, Schönherr S*, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287 (2016). *Shared first authors
imputationserver2 is MIT Licensed and was developed at the Institute of Genetic Epidemiology, Medical University of Innsbruck, Austria.
If you have any questions about imputationserver2 please contact
If you encounter any problems, feel free to open an issue here.
Version 2.0.3 - Version 2.0.6 - Fix QC issues and remove HTSJDK index creation for input validation and QC.
Version 2.0.2 - Set minimac4 tmp directory (required for larger sample sizes).
Version 2.0.1 - Provide statistics to users in case QC failed; check normalized multiallelic variants in reference panel.
Version 2.0.0 - First stable release; migration of the imputation workflow to Nextflow.
The pipeline provides small test data to verify installation:
nextflow run main.nf -c conf/test_single_vcf.configjob.config:
params {
project = "my-test-project"
build = "hg19"
files = "tests/data/input/three/*.vcf.gz"
allele_frequency_population = "eur"
mode = "imputation"
refpanel_yaml = "tests/hapmap-2/2.0.0/imputation-hapmap2.yaml"
output = "output"
}Run pipeline with job.config configuration:
nextflow run main.nf -c job.config| Parameter | Default Value | Description |
|---|---|---|
project |
null |
Project name |
project_date |
date |
Project date |
files |
null |
List of input files |
allele_frequency_population |
null |
Allele Frequency Population information |
refpanel_yaml |
null |
Reference panel YAML file |
mode |
imputation |
Processing mode (e.g., 'imputation' or `qc-only``) |
chunksize |
20000000 |
Chunk size for processing |
min_samples |
20 |
Minimum number of samples needed |
max_samples |
50000 |
Maximum number of samples allowed |
merge_samples |
true |
Execute compression and encryption workflow |
password |
null |
Password for encryption |
send_mail |
false |
Enable or disable email notifications |
service.name |
Imputation Server 2 |
Service name |
service.email |
null |
Service email |
service.url |
null |
Service URL |
user.name |
null |
User's name |
user.email |
null |
User's email |
phasing.engine |
eagle |
Phasing method (e.g., 'eagle' or beagle) |
phasing.window |
5000000 |
Phasing window size |
imputation.enabled |
true |
Enable or disable imputation |
imputation.window |
500000 |
Imputation window size |
imputation.minimac_min_ratio |
0.00001 |
Minimac minimum ratio |
imputation.min_r2 |
0 |
R2 filter value |
imputation.meta |
false |
Enable or disable empirical output creation |
imputation.md5 |
false |
Enable or disable md5 sum creation for results |
imputation.create_index |
false |
Enable or disable index creation for imputed files |
imputation.decay |
0 |
Set minimac decay |
encryption.enabled |
true |
Enable or disable encryption |
encryption.aes |
false |
Enable or disable AES method for encryption |
ancestry.enabled |
false |
Enable or disable ancestry analysis |
ancestry.dim |
10 |
Ancestry analysis dimension |
ancestry.dim_high |
20 |
High dimension for ancestry analysis |
ancestry.batch_size |
50 |
Batch size for ancestry analysis |
ancestry.reference |
null |
Ancestry reference data |
ancestry.max_pcs |
8 |
Maximum principal components for ancestry |
ancestry.k |
10 |
K value for ancestry analysis |
ancestry.threshold |
0.75 |
Ancestry threshold |
This document describes the structure of a YAML file used to configure a reference panel for Imputation Servers. Reference panels are essential for genotype imputation, allowing the server to infer missing genotype data accurately.
| Field | Description |
|---|---|
name |
The name of the reference panel. |
description |
A brief description of the reference panel. |
version |
The version of the reference panel. |
website |
The website where more information about the panel can be found. |
category |
The category to which the reference panel belongs. TODO: has to be RefPanel |
properties |
A section containing specific properties of the reference panel. |
The properties section contains the following key-value pairs:
| Property | Description | Required |
|---|---|---|
id |
An identifier for the reference panel. | yes |
genotypes |
The location of the genotype files for the reference panel data. | yes |
sites |
The location of the site files for the reference panel data. | yes |
mapEagle |
The location of the genetic map file used for phasing with eagle. | yes |
refEagle |
The location of the BCF file for the reference panel data for eagle. | yes |
mapBeagle |
The location of the genetic map file used for phasing with Beagle. | no |
refBeagle |
The location of the BCF file for the reference panel data for Beagle. | no |
build |
The genome build version used for the reference panel (e.g., hg19 or hg38). | yes |
range |
Specify a range that is used for imputation (e.g. HLA) | no |
mapMinimac |
The location of the map file for Minimac | no |
populations |
A dictionary mapping population identifiers to their names. | yes |
qcFilter |
A dictionary mapping quality filters to their values. | no |
The populations section contains a dictionary mapping population identifiers to their names and sample size. This mapping helps categorize and label the populations represented in the reference panel.
| Identifier | Name |
|---|---|
id |
The id of the popualtion (e.g. eur) |
name |
The label of the population. (e.g. EUR) |
samples |
Number of samples in the reference panel |
Note: the population id has to be the same as in the sites files.
| Filter | Name | Default |
|---|---|---|
overlap |
Minimal overlap between gwas data and reference panel | 0.5 |
minSnps |
Minimal #SNPs per chunk | 3 |
sampleCallrate |
Minimal sample call rate | 0.5 |
mixedGenotypeschrX |
- | 0.1 |
strandFlips |
Maximal allowed strand flips | 100 |
Here's an example YAML configuration for a reference panel. This configuration describes a reference panel named "HapMap 2" for an Imputation Server, including details about its version, data sources, and represented populations. The files are stored in subdirectories of the application and can be consumed by the pipeline from there.
name: HapMap 2 (GRCh37/hg19)
description: HapMap2 Reference Panel for Michigan Imputation Server
version: 2.0.0
website: http://imputationserver.sph.umich.edu
category: RefPanel
id: hapmap-2
properties:
id: hapmap-2
genotypes: ${CLOUDGENE_APP_LOCATION}/msavs/hapmap_r22.chr$chr.CEU.hg19.recode.msav
sites: ${CLOUDGENE_APP_LOCATION}/sites/hapmap_r22.chr$chr.CEU.hg19_impute.sites.gz
mapEagle: ${CLOUDGENE_APP_LOCATION}/map/genetic_map_hg19_withX.txt.gz
refEagle: ${CLOUDGENE_APP_LOCATION}/bcfs/hapmap_r22.chr$chr.CEU.hg19.recode.bcf
build: hg19
qcFilter:
alleleSwitches: 100
populations:
- id: eur
name: EUR
samples: 60
- id: "off"
name: Off
samples: -1A full example of a reference panel, including all data and the cloudgene.yaml, can be downloaded here.
In the example YAML configuration provided, you may have noticed the presence of the $chr variable in some URLs. This variable is a placeholder for the chromosome number and will be replaced by the Nextflow pipeline.
A site file is a tab-delimited file consisting of 8 columns: ID, CHROM, POS, REF, ALT, AAF_EUR, AAF_ALL, MAF_EUR, and MAF_ALL. The first five columns (ID, CHROM, POS, REF, and ALT) are required, while the Allele Frequency (AAF) and Minor Allele Frequency (MAF) columns are optional.
The optional AAF and MAF columns provide allele frequency information for different populations supported by the reference panel. Specifically, AAF_EUR and MAF_EUR represent allele frequencies for the European population, while AAF_ALL and MAF_ALL represent allele frequencies for all populations combined.
- Install Nextflow
- Docker or Singularity
- Java 14
- Install cloudgene3:
curl -fsSL https://get.cloudgene.io | bash - Install impuationserver2 app:
./cloudgene install genepi/imputationserver2@latest - Install hapmap2 referenece panel:
./cloudgene install https://imputationserver.sph.umich.edu/resources/ref-panels/imputationserver2-hapmap2.zip - Start cloudgene server:
./cloudgene server - Open http://localhost:8082
- Login with default admin account: username
adminand passwordadmin1978 - Imputation can be tested with the following test file
The default configuration runs with Docker and uses Nextflow's local executor.
Configure via web interface (Applications -> imputationserver -> Settings) or adapt/create file apps/imputationserver/nextflow.config and add the following:
process {
executor = 'slurm'
queue = 'QueueName' // replace with your queue name
}
errorStrategy = {task.exitStatus == 143 ? 'retry' : 'terminate'}
maxErrors = '-1'
maxRetries = 3See more about SLURM Nextflow Documentation.
- Create AWS Batch queue and AMI role (see Nextflow Documentation)
- Configure via web interface (Applications -> imputationserver -> Settings) or adapt/create file
apps/imputationserver/nextflow.configand add the following:
aws {
region = 'eu-central-1'
client {
uploadChunkSize = 10485760
}
batch {
cliPath = '/home/ec2-user/miniconda/bin/aws'
executionRole = 'arn:aws:iam::***' // replace with your AMI role
}
}
process {
executor = 'awsbatch'
queue = 'QueueName' // replace with your Queue name
scratch = false
}- Got to Settings -> General and set Workspace to "S3" and enter the location of a subfolder in an S3 bucket. Enter the location of a subfolder in an S3 bucket. Currently, it must be a subfolder; a bucket won't work (Example:
s3://cloudgene/workspace).
Optional add Wave and Fusion support to improve performance:
wave {
enabled = true
endpoint = 'https://wave.seqera.io'
}
fusion {
enabled = true
}- Configure mail server in Settings -> General -> Mail
- Configure Nextflow to use Cloudgenes mail settings by add the following to the global configuration (Settings -> General -> Nextflow) or adapt/create files
config/nextflow.confing(see Nextflow Documention for all available mail settings)
mail {
smtp.host = "${CLOUDGENE_SMTP_HOST}"
smtp.port = "${CLOUDGENE_SMTP_PORT}"
smtp.user = "${CLOUDGENE_SMTP_USER}"
smtp.password = "${CLOUDGENE_SMTP_PASSWORD}"
smtp.auth = true
smtp.starttls.enable = true
smtp.ssl.protocols = 'TLSv1.2'
}- Add
params.config.send_mail = trueto the application specific configuration to activate mail notifications in the imputationserver2 pipeline
Parameters can be changed in the nextflow.config file of the application. Example:
params.chunk_size = 500_000
params.imputation.window = 100_000docker build --platform linux/amd64 -t statgen/imputationserver2:latest .nf-test test- Build a local Docker image (as described above).
- Test the image with
nf-test(as described above). - Bump version according to
<genepi-version + 0.0.1>-statgen.<sequential-count>- Update version string in all its locations:
nextflow.configcloudgene.yamlcloudgene.hla.yamlcloudgene.pgs.yaml
- Create commit bumping version:
git add ... git commit -m 'Bump version to <version>' - Tag the commit:
git tag -a <version> (add info about changes in commit editor)
- Push commit and tag to GitHub:
git push git push --tags
- Update version string in all its locations:
- Tag the latest Docker image with the new version.
docker image tag statgen/imputationserver2:latest statgen/imputationserver2:<version>
- Push the Docker tag to ECR.
- Look up the image ID in Docker:
docker images
- Tag the image as an ECR resource:
docker tag <image-id> public.ecr.aws/<ecr-public-hash>/<ecr-repo-name>:<version>
- Push to ECR:
If your Docker instance is not logged in to AWS ECR, you might need to run the following:
docker push public.ecr.aws/<ecr-public-hash>/<ecr-repo-name>:<version>
See the official AWS documentation for more details.aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com
- Look up the image ID in Docker: