Sentieon(replacing BWA+GATK)¶

Warning

sentieon license 已过期，已无法使用。推荐使用GATK或deepvariant，或使用GPU进行加速的call变异工具 parabricks。

sentieon介绍¶

sentieon是Sentieon公司开发的用于替代GATK进行变异检测的数据分析套件，其一个软件就可以搞定变异检测所需的全部流程，因为其采用的数学模型与GATK相同，sentieon结果与GATK结果的一致性非常高(99.7%)。Sentieon主要有一下几个特点：

精度高、重复性好：GATK 采用了downsampling的方式进行取样，即在测序深度非常高的区域进行采样然后检测变异，因为变异检测的精度和重复性不是非常高。sentieon没有downsampling，因此精度和重复性会比GATK稍好，这在医疗领域比较重要；
速度快：sentieon改写了BWA，相比原生的BWA，速度提高1倍；同时排序和重复序列标记模块也比GATK模块中的快； sentieon速度的巨大提升主要体现在变异检测这一步，可达20-50倍；从fq->vcf，相比BWA/GATK速度提升至少10倍以上；
大样本 joint calling: 可以支持10万个样本同时joint calling，测试使用水稻3200个样本同时跑joint calling，单节点40小时就可以完成；
对存储IO要求高：如果是单机运行，建议使用固态盘。作重集群存储带宽基本满足要求。之前测试，400个haplotyper任务同时跑，存储读带宽峰值高达90GB/s。3200个水稻样本单节点36线程joint calling，节点读带宽稳定在3-4GB/s；
纯软件解决方案：目前有不少加速GATK的方案，大部分都是采用FPGA专有硬件加速，不仅成本较高而且很难大规模地使用，sentieon采用的纯软件方法，可以直接使用现有的硬件，版本更新也很方便；
持续更新：与GATK官方同步更新，一般延迟3个月。基本GATK更新版本中比较有用的工具，sentieon基本也会添加在后续更新的版本中。从2015年发布至今，已有几十个版本的更新；
大规模实践检验：从2015年发布第一个版本开始，sentieon不仅在FDA举行的各种比赛中表现优异，且在世界范围内为各大药厂、临床机构、科研机构以及生物公司所采用，基本得到了业界的认可
官方支持：在使用过程中如遇到任何问题，均可邮件给sentieon公司，基本一到两天内能收到回复；

官网：https://www.sentieon.com/

官方文档：https://support.sentieon.com/manual/

官方参考脚本：https://github.com/Sentieon/sentieon-scripts

作重集群sentieon使用¶

由于实验室多个课题组有大量的群体项目，其中大部分项目都在使用GATK进行变异检测，耗费的资源和时间都比较长。为提高科研效率，节省大家的时间，作重计算平台已采购了Sentieon部署在集群上免费提供给大家使用。此外杨庆勇老师已在集群用户群内上传的sentieon的相关资料和流程，有需要的可以下载参考学习，该流程已适配了作重集群，因此也可直接使用。集群/public/exercise/sentieon/ 目录内也有测试数据，大家可以练习。

benchmark¶

BWA¶

BWA测试命令

software/version	runtime(second)	cputime(second)	maxmem(MB)	speedup
BWA/0.7.17	7232	220916	42952	1
BWA-MEM2/2.2.1	4577	111560	60952	1.6x
sentieon-BWA/202010.04	5085	130644	15648	1.4x
sentieon-BWA/202112	3888	128449	16078	1.9x

#BSUB -J bwa
#BSUB -n 36
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal


module load sentieon/202112
module load sentieon/202010.04
module load BWA/0.7.17
module load BWA-MEM2/2.2.1
module load SAMtools/1.9


# fastq文件路径
fq1=/public/home/software/test/callsnp/data/fastq/BYU21001_1.fastq.gz
fq2=/public/home/software/test/callsnp/data/fastq/BYU21001_2.fastq.gz
# 参考基因组文件
fasta=/public/home/software/test/callsnp/data/Gbarbadense_genome.fasta

#sentieon-BWA
sentieon bwa mem -R "@RG\tID:${i}\tSM:${i}\tPL:PL" -t $LSB_DJOB_NUMPROC -K 10000000 $fasta $fq1 $fq2 | sentieon util sort -r $fasta -o ${i}_sorted.bam -t $LSB_DJOB_NUMPROC --sam2bam -i -

#bwa
bwa mem -t $LSB_DJOB_NUMPROC $fasta $fq1 $fq2 | samtools sort -@${LSB_DJOB_NUMPROC} -o ${i}_sorted.bam -

#bwa-mem2
bwa-mem2.avx512bw mem -t $LSB_DJOB_NUMPROC $fasta $fq1 $fq2 | samtools sort -@${LSB_DJOB_NUMPROC} -o ${i}_sorted.bam -

minimap2¶

minimap2测试命令

software/version	runtime(second)	cputime(second)	maxmem(MB)	speedup
minimap2/v2.17	1182	23468	6291	1
sentieon-minimap2/202112	961	18750	8464	1.2x

module load sentieon/202112;sentieon minimap2 -t 20 -ax map-hifi YJSM.prefix1.fa YJSM.hifi_reads.fastq.gz > YJSM.minimap2.2.sentieon.sam

module load minimap2/v2.17;minimap2 -t 20 -ax map-hifi YJSM.prefix1.fa YJSM.hifi_reads.fastq.gz > YJSM.minimap2.2.sentieon.sam

sentieon使用技巧¶

sentieon使用如果有什么问题，可以邮件给官方，如果问题比较普遍，解决后可以联系管理员整理到这里给大家参考。

由于sentieon的BWA相比原生的BWA速度提高了一倍，因此即使不是做变异检测，只用使用BWA比对，也同样可以使用sentieon内的BWA模块，以提高计算速度；
杨老师给的流程中使用了cram格式保存BWA比对后的文件，相比bam格式，其占用的存储空间更小，大约只有bam的60%。但这会导致bwa和sort步骤变慢，可以根据自己的需求调整；
有同学使用的参考基因组有200万以上的contig，导致sentieon运行非常缓慢。官方给出的建议是使用--interval 1,2,3,4,5,6,7,8,9,10,Pt,Mt之类的参数，只分析感兴趣的chromosome；或者将大量的细碎的contig合并成一个contig，中间有N连接，保证原有reference的所有contig都分析到，但是需要对最后的VCF做进一步处理以还原为原有的reference；
joint calling (sentieon --algo GVCFtyper，即合并多样本的haplotyper运行结果) 之前，建议先用vcfconvert模块压缩所有的gvcf文件并建立索引， sentieon util vcfconvert input.gvcf output.gvcf.gz，然后再join calling，速度会有几十倍的提高（水稻3200个样本的joint calling时间从800h缩短到40h）；
全基因组运行haplotyper时，不要--bam_output选项，否则速度会变得很慢，而且可能会出错；
Haplotyper: Only a single sample is allowed in GVCF emit mode, 在使用--emit_mode GVCF的时候，bam文件只能包含单一sample，不支持多sample，否则会出现这个报错。如果BAM文件不是多sample的话，那就应该是@RG line设置有问题。
使用GATK call的gvcf文件，也可以使用sentieon进行joint call。

参考脚本¶

拟南芥 sentieon callsnp 流程脚本

说明sentieon_ath50x.lsfsentieon_ath50x.sh

拟南芥NGS的测试文件

/public/exercise/sentieon/tair10_1.fastq.gz
/public/exercise/sentieon/tair10_2.fastq.gz

fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa

#BSUB -J Sentieon
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

# 加载所需软件
# module load sentieon/201808.07
export SENTIEON_LICENSE=mn01:9000
module load SAMtools/1.9
release_dir=/public/home/software/opt/bio/software/Sentieon/201808.07

# 样本名称
i="tair10"
# fastq文件路径
fq1=/public/exercise/sentieon/tair10_1.fastq.gz
fq2=/public/exercise/sentieon/tair10_2.fastq.gz
# 参考基因组文件
fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa
# 输出文件路径
workdir=`pwd`/ath50x_result
# 需要使用的核心数
nt=16
# 比对信息
group_prefix="read_group_name"
platform="ILLUMINA"
mq=30

[ ! -d $workdir ] && mkdir -p $workdir
cd $workdir
# 输出文件
rawCram=$i.cram
sortedCram=$i.q$mq.sorted.cram
depCram=$i.deduped.cram
realnCram=$i.realn.cram
outvcf=$i.vcf
exec > $workdir/$i.callVCF.log 2>&1 # call vcf的日志文件

# ******************************************
# 1. 利用 BWA-MEM 进行比对并排序
# ******************************************
( $release_dir/bin/sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" \
-t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | samtools sort -@ $nt  --output-fmt CRAM \
--reference $fasta -o $rawCram - && samtools index -@ $nt $rawCram 
samtools view -hCS -T $fasta -q $mq -o $sortedCram $rawCram && \
samtools index -@ $nt $sortedCram
samtools flagstat $rawCram > $i.stat.raw.txt && \
samtools flagstat $sortedCram > $i.stat.q$mq.txt &

# ******************************************
# 2. Calculate data metrics
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo MeanQualityByCycle ${i}_mq_metrics.txt \
--algo QualDistribution ${i}_qd_metrics.txt --algo GCBias --summary ${i}_gc_summary.txt ${i}_gc_metrics.txt \
--algo AlignmentStat --adapter_seq '' ${i}_aln_metrics.txt --algo InsertSizeMetricAlgo ${i}_is_metrics.txt
$release_dir/bin/sentieon plot metrics -o ${i}_metrics-report.pdf gc=${i}_gc_metrics.txt \
qd=${i}_qd_metrics.txt mq=${i}_mq_metrics.txt isize=${i}_is_metrics.txt
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo LocusCollector --fun score_info ${i}_score.txt

# ******************************************
# 3. 去除 Duplicate Reads
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo Dedup --rmdup --cram_write_options version=3.0 \
--score_info ${i}_score.txt --metrics ${i}_dedup_metrics.txt $depCram && rm -f $sortedCram

# ******************************************
# 4. Indel 重排序 (可选)
# 如果只需要最终的比对结果文件，到这里就可以了，这条命令下面的命令都可以注释掉
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $depCram --algo Realigner --cram_write_options version=3.0 \
$realnCram && rm -f $depCram

# ******************************************
# 5. Variant calling
# ******************************************
$release_dir/bin/sentieon driver -t $nt -r $fasta -i $realnCram --algo Genotyper $outvcf

#!/bin/sh
# *******************************************


# Update with the fullpath location of your sample fastq
set -x
data_dir="$( cd -P "$( dirname "$0" )" && pwd )"  #workdir
fastq_1=/public/exercise/sentieon/tair10_1.fastq.gz
fastq_2=/public/exercise/sentieon/tair10_2.fastq.gz

# Update with the location of the reference data files
fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa

# Set SENTIEON_LICENSE if it is not set in the environment
module load SAMtools/1.9
#module load sentieon/201808.07
 export SENTIEON_LICENSE=mn01:9000

# Update with the location of the Sentieon software package
SENTIEON_INSTALL_DIR=/public/home/software/opt/bio/software/Sentieon/201808.07

# It is important to assign meaningful names in actual cases.
# It is particularly important to assign different read group names.
sample="tair10"
group="G"
platform="ILLUMINA" 

# Other settings
nt=16 #number of threads to use in computation

# ******************************************
# 0. Setup
# ******************************************
workdir=$data_dir/result-tair10
mkdir -p $workdir
logfile=$workdir/run.log
exec >$logfile 2>&1
cd $workdir

#Sentieon proprietary compression
bam_option="--bam_compression 1"

# ******************************************
# 1. Mapping reads with BWA-MEM, sorting
# ******************************************
#The results of this call are dependent on the number of threads used. To have number of threads independent results, add chunk size option -K 10000000 

# speed up memory allocation malloc in bwa
export LD_PRELOAD=$SENTIEON_INSTALL_DIR/lib/libjemalloc.so
export MALLOC_CONF=lg_dirty_mult:-1

( $SENTIEON_INSTALL_DIR/bin/sentieon bwa mem -M -R "@RG\tID:$group\tSM:$sample\tPL:$platform" -t $nt -K 10000000 $fasta $fastq_1 $fastq_2 || echo -n 'error' ) | $SENTIEON_INSTALL_DIR/bin/sentieon util sort $bam_option -r $fasta -o sorted.bam -t $nt --sam2bam -i -

# ******************************************
# 2. Metrics
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i sorted.bam --algo MeanQualityByCycle mq_metrics.txt --algo QualDistribution qd_metrics.txt --algo GCBias --summary gc_summary.txt gc_metrics.txt --algo AlignmentStat --adapter_seq '' aln_metrics.txt --algo InsertSizeMetricAlgo is_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot GCBias -o gc-report.pdf gc_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot QualDistribution -o qd-report.pdf qd_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot MeanQualityByCycle -o mq-report.pdf mq_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot InsertSizeMetricAlgo -o is-report.pdf is_metrics.txt

# ******************************************
# 3. Remove Duplicate Reads
# To mark duplicate reads only without removing them, remove "--rmdup" in the second command
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt -i sorted.bam --algo LocusCollector --fun score_info score.txt
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt -i sorted.bam --algo Dedup --rmdup --score_info score.txt --metrics dedup_metrics.txt $bam_option deduped.bam 

# ******************************************

# ******************************************
# 5. Base recalibration
# ******************************************

# Perform recalibration
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam --algo QualCal recal_data.table

# Perform post-calibration check (optional)
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo QualCal recal_data.table.post
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt --algo QualCal --plot --before recal_data.table --after recal_data.table.post recal.csv   
$SENTIEON_INSTALL_DIR/bin/sentieon plot QualCal -o recal_plots.pdf recal.csv


# ******************************************
# 6. HC Variant caller
# Note: Sentieon default setting matches versions before GATK 3.7.
# Starting GATK v3.7, the default settings have been updated multiple times. 
# Below shows commands to match GATK v3.7 - 4.1
# Please change according to your desired behavior.
# ******************************************

# Matching GATK 3.7, 3.8, 4.0
#$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo Haplotyper  --emit_conf=10 --call_conf=10 output-hc.vcf.gz

# Matching GATK 4.1
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo Haplotyper  --genotype_model multinomial --emit_conf 30 --call_conf 30 output-hc.vcf.gz

其它sentieon脚本

说明sentieon_people.lsfsentieon_bwa.lsf

人类NGS的测试文件

/public/exercise/sentieon/1.fastq.gz
/public/exercise/sentieon/2.fastq.gz

fasta=/public/exercise/sentieon/reference/ucsc.hg19_chr22.fasta
dbsnp=/public/exercise/sentieon/reference/dbsnp_135.hg19_chr22.vcf
known_1000G_indels=/public/exercise/sentieon/reference/1000G_phase1.snps.high_confidence.hg19_chr22.sites.vcf
known_Mills_indels=/public/exercise/sentieon/reference/Mills_and_1000G_gold_standard.indels.hg19_chr22.sites.vcf

#BSUB -J Sentieon
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

# 加载所需软件
# module load sentieon/201808.07
export SENTIEON_LICENSE=mn01:9000
module load SAMtools/1.9
release_dir=/public/home/software/opt/bio/software/Sentieon/201808.07

# 样本名称
i="Test"
# fastq文件路径
fq1=/public/exercise/sentieon/1.fastq.gz
fq2=/public/exercise/sentieon/2.fastq.gz
# 参考基因组文件
fasta=/public/exercise/sentieon/reference/ucsc.hg19_chr22.fasta
# 输出文件路径
workdir=`pwd`/People_result
# 需要使用的核心数
nt=16
# 相应数据库
dbsnp=/public/exercise/sentieon/reference/dbsnp_135.hg19_chr22.vcf
known_1000G_indels=/public/exercise/sentieon/reference/1000G_phase1.snps.high_confidence.hg19_chr22.sites.vcf
known_Mills_indels=/public/exercise/sentieon/reference/Mills_and_1000G_gold_standard.indels.hg19_chr22.sites.vcf
# 比对信息
group_prefix="read_group_name"
platform="ILLUMINA"
mq=30

[ ! -d $workdir ] && mkdir -p $workdir
cd $workdir
# 输出文件
rawCram=$i.cram
sortedCram=$i.q$mq.sorted.cram
depCram=$i.deduped.cram
realnCram=$i.realn.cram
outvcf=$i.vcf
exec > $workdir/$i.callVCF.log 2>&1 # call vcf的日志文件

# ******************************************
# 1. 利用 BWA-MEM 进行比对并排序
# ******************************************
( $release_dir/bin/sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" \
-t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | samtools sort -@ $nt  --output-fmt CRAM \
--reference $fasta -o $rawCram - && samtools index -@ $nt $rawCram 
samtools view -hCS -T $fasta -q $mq -o $sortedCram $rawCram && \
samtools index -@ $nt $sortedCram
samtools flagstat $rawCram > $i.stat.raw.txt && \
samtools flagstat $sortedCram > $i.stat.q$mq.txt &

# ******************************************
# 2. Calculate data metrics
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo MeanQualityByCycle ${i}_mq_metrics.txt \
--algo QualDistribution ${i}_qd_metrics.txt --algo GCBias --summary ${i}_gc_summary.txt ${i}_gc_metrics.txt \
--algo AlignmentStat --adapter_seq '' ${i}_aln_metrics.txt --algo InsertSizeMetricAlgo ${i}_is_metrics.txt
$release_dir/bin/sentieon plot metrics -o ${i}_metrics-report.pdf gc=${i}_gc_metrics.txt \
qd=${i}_qd_metrics.txt mq=${i}_mq_metrics.txt isize=${i}_is_metrics.txt
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo LocusCollector --fun score_info ${i}_score.txt

# ******************************************
# 3. 去除 Duplicate Reads
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo Dedup --rmdup --cram_write_options version=3.0 \
--score_info ${i}_score.txt --metrics ${i}_dedup_metrics.txt $depCram && rm -f $sortedCram

# ******************************************
# 4. Indel 重排序 (可选)
# 如果只需要最终的比对结果文件，到这里就可以了，这条命令下面的命令都可以注释掉
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $depCram --algo Realigner -k ${known_1000G_indels} --cram_write_options version=3.0 \
$realnCram && rm -f $depCram

# ******************************************
# 5. Variant calling
# ******************************************
$release_dir/bin/sentieon driver -t $nt -r $fasta -i $realnCram --algo Genotyper -d ${dbsnp} ${outvcf}

#BSUB -J Sentieon_bwa
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

# 加载所需软件
module load sentieon/201911

nt=$LSB_DJOB_NUMPROC

fasta=/public/home/test/haoliu/work/callsnp/data/hg38.fa

fq1=/public/home/test/haoliu/work/callsnp/data/ERR194146_1.fastq.gz
fq2=/public/home/test/haoliu/work/callsnp/data/ERR194146_2.fastq.gz
i=sample1
sortedBam=ERR194146.sorted.bam

(sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" -t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | sentieon util sort -r $fasta -o $sortedBam -t $nt --sam2bam -i -

参考文档¶

Sentieon_QuickStart

Computational_performance_and_accuracy_of_Sentieon_DNASeq_variant_calling_workflow

Introduction_short_Sentieon

Sentieon_DNA-Seq_Pipeline

Sentieon_DNASeq_Variant_Calling_Workflow_Demonstrates_Strong_Computational_Performance_and_Accuracy

本文阅读量次
本站总访问量次