大家好,欢迎来到IT知识分享网。
目录
0. Pipeline Index
Variant Discovery | Germline | Somatic |
---|---|---|
Data pre-processing | Single-sample | Single-sample |
Short variants: SNPs and Indels | Single-sample & Joint Calling | Tumor-Normal & Tumor-Only |
Copy Number Variants (CNVs) | Multisample | Tumor-Normal & Tumor-Only |
Structural Variants (SVs) |
Special use cases: Metagenomic analysis (PathSeq), Mitochondrial short variants, Liquid blood biopsy
1. Data pre-processing
1) 目的
Raw sequence data (FASTQ/uBAM) -> Analysis Ready BAM.
2) 步骤
- Map to Reference
使用BWA把读对比对到参考基因组。由于比对算法单独处理每个读对,因而可以通过并行提高效率。
java -jar picard.jar SamToFastq \
INPUT=$ubam \ #ubam
FASTQ=/dev/stdout \
INTERLEAVE=true \
NON_PF=true | \
bwa mem -K 100000000 -p -v 3 -t 16 -Y $ref /dev/stdin - | \ #ref idx?
java -jar picard.jar MergeBamAlignment \
VALIDATION_STRINGENCY=SILENT \
EXPECTED_ORIENTATIONS=FR \
ATTRIBUTES_TO_RETAIN=X0 \
ATTRIBUTES_TO_REMOVE=NM \
ATTRIBUTES_TO_REMOVE=MD \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=$ubam \ #ubam
OUTPUT=$out.bam \ #output bam
REFERENCE_SEQUENCE=$reff \ #ref fa
PAIRED_RUN=true \
SORT_ORDER="unsorted" \
IS_BISULFITE_SEQUENCE=false \
ALIGNED_READS_ONLY=false \
CLIP_ADAPTERS=false \
MAX_RECORDS_IN_RAM=2000000 \
ADD_MATE_CIGAR=true \
MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
PROGRAM_RECORD_ID="bwamem" \
PROGRAM_GROUP_VERSION="${bwa_version}" \ #version
PROGRAM_GROUP_COMMAND_LINE="bwa mem -K 100000000 -p -v 3 -t 16 -Y ${ref}" \
PROGRAM_GROUP_NAME="bwamem" \
UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
ALIGNER_PROPER_PAIR_FLAGS=true \
UNMAP_CONTAMINANT_READS=true \
ADD_PG_TAG_TO_READS=false
- Mark Duplicates
Picard 将5’ 端比对位置相同的读对判定为重复。除质量分数最高的读对外,其余会被标记为重复。
由于需要在大量读对之间进行比较,导致标记重复成为流程的一个性能瓶颈。MarkDuplicatesSpark通过Apache Spark完成并行计算,以优化性能。
另外,也可以使用单线程工具MarkDuplicates标记重复,之后按照坐标顺序对bam排序,排序可以使用SortSam完成。
java -jar picard.jar \
MarkDuplicates \
INPUT=$out.bam \ #in bams
OUTPUT=$out.mr.bam \ #out bams
METRICS_FILE=${metrics_filename} \ #metrics
VALIDATION_STRINGENCY=SILENT \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
ASSUME_SORT_ORDER="queryname" \
CREATE_MD5_FILE=true
java -jar picard.jar \
SortSam \
INPUT=$out.mr.bam \ #in bam
OUTPUT=/dev/stdout \
SORT_ORDER="coordinate" \
CREATE_INDEX=false \
CREATE_MD5_FILE=false \
MAX_RECORDS_IN_RAM=300000 | \
java -jar picard.jar \
SetNmAndUqTags \
INPUT=/dev/stdin \
OUTPUT=$out.mr.sort.bam \ #out bam
CREATE_INDEX=true \
CREATE_MD5_FILE=true \
REFERENCE_SEQUENCE=$reff
- Base Quality Score Recalibration
碱基质量分数是鉴定变异的重要依据,但可能受到建库、芯片和仪器等诸多因素的影响。BaseRecalibrator, Apply Recalibration等通过机器学习方法来检测和校正碱基质量分数的系统误差。首先获取数据集中碱基判定的协方差,然后使用协方差建立模型,最后基于模型进行质量校正。
gatk BaseRecalibrator \
-R $reff \
-I $out.mr.sort.bam \ #in bam
--use-original-qualities \
-O ${recalibration_report_filename} \
--known-sites ${dbSNP_vcf} \
--known-sites ${sep=” —known-sites “ known_indels_sites_VCFs}
gatk GatherBQSRReports \
-I ${
sep=' -I ' input_bqsr_reports} \
-O ${output_report_filename}
gatk ApplyBQSR \
-R $reff \
-I $out.mr.sort.bam \ #in bam
-O ${output_bam_basename}.bam \
-bqsr ${recalibration_report} \
--static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \
--add-output-sam-program-record \
--create-output-bam-md5 \
--use-original-qualities
3) pipeline
###########################
# way1: step-by-step
###########################
# Map to Reference
# Mark Duplicates
# Base Quality Score Recalibration
###########################
# way2: wdl
###########################
java -jar womtools.jar inputs processing-for-variant-discovery-gatk4.wdl > processing_inputs.json
java -jar Cromwell.jar run processing-for-variant-discovery-gatk4.wdl —inputs processing_inputs.json
Data pre-processing for variant discovery
GATK4最佳实践-数据预处理篇
2. Germline SNP/INDEL
1) 目的
Analysis ready BAM -> GVCF
2) 步骤
- HaplotyperCaller
gatk --java-options "-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-O ${output_filename} \
-contamination 0 -ERC GVCF
- Merge
gatk --java-options -Xmx2G \
MergeVcfs \
--INPUT ${
sep=' --INPUT ' input_vcfs} \
--OUTPUT ${output_filename}
3) pipeline
Reference Implementations
Pipeline | Summary |
---|---|
Prod* germline short variant per-sample calling | uBAM->GVCF |
Prod* germline short variant joint genotyping | GVCFs->cohort VCF |
$5 Genome Analysis Pipeline | uBAM->GVCF/cohort VCF |
Generic germline short variant per-sample calling analysis-ready | BAM->GVCF |
Generic germline short variant joint genotyping | GVCFs->cohort VCF |
Intel germline short variant per-sample calling | uBAM->GVCF |
3. Somatic SNP/INDEL
About GATK and GATK best practices
GATK
GATK(GenomeAnalysisToolkit)是一款侧重于从高通量测序数据中找变异(Variant Discovery)的命令行工具集。
Tool Documentation Index
GATK最佳实践
GATK最佳实践(GATK best practice)是Broad研究所使用的reads-to-variants的流程集。纵向上,最佳实践包括预处理(Data Pre-processing)、找变异(Variant Discovery)和处理后(非必须,部分应用的最佳实践需要该步骤)三个步骤。横向上,最佳实践因应用目的(①目标变异;②实验设计,如WGS, WES, panel)的不同而不同。最佳实现的呈现形式包括文档,文档中执行内容的参考实现(即用WDL编写的脚本)等。
Requirements
What are the requirements for running GATK?
Download and Install
# docker
docker pull broadinstitute/gatk # pull latest images, specify version by appending :tag
docker run -it broadinstitute/gatk # start up the GATK container
Run GATK in a Docker container
扩展阅读
1) 测序重复
How PCR duplicates arise in next-generation sequencing
博文介绍了PCR重复的来源,并用简化的泊松模型解释了重复率。
Increased read duplication on patterned flowcells- understanding the impact of Exclusion Amplification
介绍了常见重复的种类,重点介绍了因美纳 Patterned Flow Cells 的ExAmp/Clustering重复。
2) 竞品
Sentieon使用和GATK同样的数学模型;在算法设计上提高了计算效率;使用工业级软件标准实现。
参考
Best Practices Workflows
- About the GATK Best Practices
- Getting started with GATK4
免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/21531.html