GATK & GATK best practices notes

GATK & GATK best practices notesUsage0.PipelineIndexVariantDiscoveryGermlineSomaticDatapre-processingSingle-sampleSingle-sampleShortvariants:SNPsandIndelsSingle-sample&JointCallingTumor-Normal&Tumor-OnlyCopyNumberVariants(CNVs)MultisampleT

大家好,欢迎来到IT知识分享网。

0. Pipeline Index

Variant Discovery Germline Somatic
Data pre-processing Single-sample Single-sample
Short variants: SNPs and Indels Single-sample & Joint Calling Tumor-Normal & Tumor-Only
Copy Number Variants (CNVs) Multisample Tumor-Normal & Tumor-Only
Structural Variants (SVs)

Special use cases: Metagenomic analysis (PathSeq), Mitochondrial short variants, Liquid blood biopsy

Pipeline Index

1. Data pre-processing

1) 目的

Raw sequence data (FASTQ/uBAM) -> Analysis Ready BAM.

2) 步骤

  • Map to Reference
    使用BWA把读对比对到参考基因组。由于比对算法单独处理每个读对,因而可以通过并行提高效率。
java -jar picard.jar SamToFastq \ 
        INPUT=$ubam  \                    #ubam
        FASTQ=/dev/stdout  \
        INTERLEAVE=true  \
        NON_PF=true  | \
bwa mem -K 100000000 -p -v 3 -t 16 -Y $ref /dev/stdin - | \ #ref idx?
java -jar picard.jar MergeBamAlignment \
     VALIDATION_STRINGENCY=SILENT \
     EXPECTED_ORIENTATIONS=FR \
     ATTRIBUTES_TO_RETAIN=X0 \
     ATTRIBUTES_TO_REMOVE=NM \
     ATTRIBUTES_TO_REMOVE=MD \
     ALIGNED_BAM=/dev/stdin \
     UNMAPPED_BAM=$ubam \   #ubam
     OUTPUT=$out.bam \      #output bam
     REFERENCE_SEQUENCE=$reff \   #ref fa
     PAIRED_RUN=true \
     SORT_ORDER="unsorted" \
     IS_BISULFITE_SEQUENCE=false \
     ALIGNED_READS_ONLY=false \
     CLIP_ADAPTERS=false \
     MAX_RECORDS_IN_RAM=2000000 \
     ADD_MATE_CIGAR=true \
     MAX_INSERTIONS_OR_DELETIONS=-1 \
     PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
     PROGRAM_RECORD_ID="bwamem" \
     PROGRAM_GROUP_VERSION="${bwa_version}" \   #version
     PROGRAM_GROUP_COMMAND_LINE="bwa mem -K 100000000 -p -v 3 -t 16 -Y ${ref}" \
     PROGRAM_GROUP_NAME="bwamem" \
     UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
     ALIGNER_PROPER_PAIR_FLAGS=true \
     UNMAP_CONTAMINANT_READS=true \
     ADD_PG_TAG_TO_READS=false
  • Mark Duplicates
    Picard 将5’ 端比对位置相同的读对判定为重复。除质量分数最高的读对外,其余会被标记为重复。
     patterned vs. random
    由于需要在大量读对之间进行比较,导致标记重复成为流程的一个性能瓶颈。MarkDuplicatesSpark通过Apache Spark完成并行计算,以优化性能。
    另外,也可以使用单线程工具MarkDuplicates标记重复,之后按照坐标顺序对bam排序,排序可以使用SortSam完成。
java -jar picard.jar \
       MarkDuplicates \
       INPUT=$out.bam \   #in bams
       OUTPUT=$out.mr.bam \   #out bams
       METRICS_FILE=${metrics_filename} \   #metrics
       VALIDATION_STRINGENCY=SILENT \
       OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
       ASSUME_SORT_ORDER="queryname" \
       CREATE_MD5_FILE=true
java -jar picard.jar \
    SortSam \
    INPUT=$out.mr.bam \   #in bam
    OUTPUT=/dev/stdout \
    SORT_ORDER="coordinate" \
    CREATE_INDEX=false \
    CREATE_MD5_FILE=false \
    MAX_RECORDS_IN_RAM=300000 | \
java -jar picard.jar \
    SetNmAndUqTags \
    INPUT=/dev/stdin \
    OUTPUT=$out.mr.sort.bam \   #out bam
    CREATE_INDEX=true \
    CREATE_MD5_FILE=true \
    REFERENCE_SEQUENCE=$reff
  • Base Quality Score Recalibration
    碱基质量分数是鉴定变异的重要依据,但可能受到建库、芯片和仪器等诸多因素的影响。BaseRecalibrator, Apply Recalibration等通过机器学习方法来检测和校正碱基质量分数的系统误差。首先获取数据集中碱基判定的协方差,然后使用协方差建立模型,最后基于模型进行质量校正。
gatk BaseRecalibrator \
    -R $reff \
    -I $out.mr.sort.bam \   #in bam
    --use-original-qualities \
    -O ${recalibration_report_filename} \
    --known-sites ${dbSNP_vcf} \
    --known-sites ${sep=” —known-sites “ known_indels_sites_VCFs}
gatk GatherBQSRReports \
   -I ${ 
   sep=' -I ' input_bqsr_reports} \
   -O ${output_report_filename}
gatk ApplyBQSR \
    -R $reff \
    -I $out.mr.sort.bam \   #in bam
    -O ${output_bam_basename}.bam \
    -bqsr ${recalibration_report} \
    --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \
    --add-output-sam-program-record \
    --create-output-bam-md5 \
    --use-original-qualities

3) pipeline

###########################
# way1: step-by-step
###########################
# Map to Reference
# Mark Duplicates
# Base Quality Score Recalibration

###########################
# way2: wdl
###########################
java -jar womtools.jar inputs processing-for-variant-discovery-gatk4.wdl > processing_inputs.json
java -jar Cromwell.jar run processing-for-variant-discovery-gatk4.wdl —inputs processing_inputs.json

Data pre-processing for variant discovery
GATK4最佳实践-数据预处理篇

2. Germline SNP/INDEL

1) 目的

Analysis ready BAM -> GVCF

2) 步骤

  • HaplotyperCaller
gatk --java-options "-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
    HaplotypeCaller \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -L ${interval_list} \
    -O ${output_filename} \
    -contamination 0 -ERC GVCF
  • Merge
gatk --java-options -Xmx2G  \
    MergeVcfs \
    --INPUT ${ 
   sep=' --INPUT ' input_vcfs} \
    --OUTPUT ${output_filename}

3) pipeline

Reference Implementations

Pipeline Summary
Prod* germline short variant per-sample calling uBAM->GVCF
Prod* germline short variant joint genotyping GVCFs->cohort VCF
$5 Genome Analysis Pipeline uBAM->GVCF/cohort VCF
Generic germline short variant per-sample calling analysis-ready BAM->GVCF
Generic germline short variant joint genotyping GVCFs->cohort VCF
Intel germline short variant per-sample calling uBAM->GVCF

3. Somatic SNP/INDEL

About GATK and GATK best practices

GATK

GATK(GenomeAnalysisToolkit)是一款侧重于从高通量测序数据中找变异(Variant Discovery)的命令行工具集。
Tool Documentation Index

GATK最佳实践

GATK最佳实践(GATK best practice)是Broad研究所使用的reads-to-variants的流程集。纵向上,最佳实践包括预处理(Data Pre-processing)、找变异(Variant Discovery)和处理后(非必须,部分应用的最佳实践需要该步骤)三个步骤。横向上,最佳实践因应用目的(①目标变异;②实验设计,如WGS, WES, panel)的不同而不同。最佳实现的呈现形式包括文档,文档中执行内容的参考实现(即用WDL编写的脚本)等。

Requirements

  • Unix-style OS
  • Java 1.8
  • (some) R: gsalib, ggplot2, reshape, gplots
  • (some) Python 2.6+; Python 3.6.2

What are the requirements for running GATK?

Download and Install

# docker
docker pull broadinstitute/gatk     # pull latest images, specify version by appending :tag
docker run -it broadinstitute/gatk  # start up the GATK container

Run GATK in a Docker container

扩展阅读

1) 测序重复

How PCR duplicates arise in next-generation sequencing
博文介绍了PCR重复的来源,并用简化的泊松模型解释了重复率。

Increased read duplication on patterned flowcells- understanding the impact of Exclusion Amplification
介绍了常见重复的种类,重点介绍了因美纳 Patterned Flow Cells 的ExAmp/Clustering重复。
2) 竞品
Sentieon使用和GATK同样的数学模型;在算法设计上提高了计算效率;使用工业级软件标准实现。
参考

Best Practices Workflows

  • About the GATK Best Practices
  • Getting started with GATK4

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/21531.html

(0)

相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注微信