Sentieon(replacing BWA+GATK)¶
Warning
sentieon license 已过期,已无法使用。推荐使用GATK或deepvariant,或使用GPU进行加速的call变异工具 parabricks。
sentieon介绍¶
sentieon是Sentieon公司开发的用于替代GATK进行变异检测的数据分析套件,其一个软件就可以搞定变异检测所需的全部流程,因为其采用的数学模型与GATK相同,sentieon结果与GATK结果的一致性非常高(99.7%)。Sentieon主要有一下几个特点:
- 精度高、重复性好:GATK 采用了downsampling的方式进行取样,即在测序深度非常高的区域进行采样然后检测变异,因为变异检测的精度和重复性不是非常高。sentieon没有downsampling,因此精度和重复性会比GATK稍好,这在医疗领域比较重要;
- 速度快:sentieon改写了BWA,相比原生的BWA,速度提高1倍;同时排序和重复序列标记模块也比GATK模块中的快; sentieon速度的巨大提升主要体现在变异检测这一步,可达20-50倍;从fq->vcf,相比BWA/GATK速度提升至少10倍以上;
- 大样本 joint calling: 可以支持10万个样本同时joint calling,测试使用水稻3200个样本同时跑joint calling,单节点40小时就可以完成;
- 对存储IO要求高:如果是单机运行,建议使用固态盘。作重集群存储带宽基本满足要求。之前测试,400个haplotyper任务同时跑,存储读带宽峰值高达90GB/s。3200个水稻样本单节点36线程joint calling,节点读带宽稳定在3-4GB/s;
- 纯软件解决方案:目前有不少加速GATK的方案,大部分都是采用FPGA专有硬件加速,不仅成本较高而且很难大规模地使用,sentieon采用的纯软件方法,可以直接使用现有的硬件,版本更新也很方便;
- 持续更新:与GATK官方同步更新,一般延迟3个月。基本GATK更新版本中比较有用的工具,sentieon基本也会添加在后续更新的版本中。从2015年发布至今,已有几十个版本的更新;
- 大规模实践检验:从2015年发布第一个版本开始,sentieon不仅在FDA举行的各种比赛中表现优异,且在世界范围内为各大药厂、临床机构、科研机构以及生物公司所采用,基本得到了业界的认可
- 官方支持:在使用过程中如遇到任何问题,均可邮件给sentieon公司,基本一到两天内能收到回复;
官方文档:https://support.sentieon.com/manual/
官方参考脚本:https://github.com/Sentieon/sentieon-scripts
作重集群sentieon使用¶
由于实验室多个课题组有大量的群体项目,其中大部分项目都在使用GATK进行变异检测,耗费的资源和时间都比较长。为提高科研效率,节省大家的时间,作重计算平台已采购了Sentieon部署在集群上免费提供给大家使用。此外杨庆勇老师已在集群用户群内上传的sentieon的相关资料和流程,有需要的可以下载参考学习,该流程已适配了作重集群,因此也可直接使用。集群/public/exercise/sentieon/ 目录内也有测试数据,大家可以练习。
benchmark¶
BWA¶
software/version | runtime(second) | cputime(second) | maxmem(MB) | speedup |
---|---|---|---|---|
BWA/0.7.17 | 7232 | 220916 | 42952 | 1 |
BWA-MEM2/2.2.1 | 4577 | 111560 | 60952 | 1.6x |
sentieon-BWA/202010.04 | 5085 | 130644 | 15648 | 1.4x |
sentieon-BWA/202112 | 3888 | 128449 | 16078 | 1.9x |
#BSUB -J bwa
#BSUB -n 36
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
module load sentieon/202112
module load sentieon/202010.04
module load BWA/0.7.17
module load BWA-MEM2/2.2.1
module load SAMtools/1.9
# fastq文件路径
fq1=/public/home/software/test/callsnp/data/fastq/BYU21001_1.fastq.gz
fq2=/public/home/software/test/callsnp/data/fastq/BYU21001_2.fastq.gz
# 参考基因组文件
fasta=/public/home/software/test/callsnp/data/Gbarbadense_genome.fasta
#sentieon-BWA
sentieon bwa mem -R "@RG\tID:${i}\tSM:${i}\tPL:PL" -t $LSB_DJOB_NUMPROC -K 10000000 $fasta $fq1 $fq2 | sentieon util sort -r $fasta -o ${i}_sorted.bam -t $LSB_DJOB_NUMPROC --sam2bam -i -
#bwa
bwa mem -t $LSB_DJOB_NUMPROC $fasta $fq1 $fq2 | samtools sort -@${LSB_DJOB_NUMPROC} -o ${i}_sorted.bam -
#bwa-mem2
bwa-mem2.avx512bw mem -t $LSB_DJOB_NUMPROC $fasta $fq1 $fq2 | samtools sort -@${LSB_DJOB_NUMPROC} -o ${i}_sorted.bam -
minimap2¶
software/version | runtime(second) | cputime(second) | maxmem(MB) | speedup |
---|---|---|---|---|
minimap2/v2.17 | 1182 | 23468 | 6291 | 1 |
sentieon-minimap2/202112 | 961 | 18750 | 8464 | 1.2x |
module load sentieon/202112;sentieon minimap2 -t 20 -ax map-hifi YJSM.prefix1.fa YJSM.hifi_reads.fastq.gz > YJSM.minimap2.2.sentieon.sam
module load minimap2/v2.17;minimap2 -t 20 -ax map-hifi YJSM.prefix1.fa YJSM.hifi_reads.fastq.gz > YJSM.minimap2.2.sentieon.sam
sentieon使用技巧¶
sentieon使用如果有什么问题,可以邮件给官方,如果问题比较普遍,解决后可以联系管理员整理到这里给大家参考。
- 由于sentieon的BWA相比原生的BWA速度提高了一倍,因此即使不是做变异检测,只用使用BWA比对,也同样可以使用sentieon内的BWA模块,以提高计算速度;
- 杨老师给的流程中使用了cram格式保存BWA比对后的文件,相比bam格式,其占用的存储空间更小,大约只有bam的60%。但这会导致bwa和sort步骤变慢,可以根据自己的需求调整;
- 有同学使用的参考基因组有200万以上的contig,导致sentieon运行非常缓慢。官方给出的建议是使用--interval 1,2,3,4,5,6,7,8,9,10,Pt,Mt之类的参数,只分析感兴趣的chromosome;或者将大量的细碎的contig合并成一个contig,中间有N连接,保证原有reference的所有contig都分析到,但是需要对最后的VCF做进一步处理以还原为原有的reference;
- joint calling (sentieon --algo GVCFtyper,即合并多样本的haplotyper运行结果) 之前,建议先用vcfconvert模块压缩所有的gvcf文件并建立索引,
sentieon util vcfconvert input.gvcf output.gvcf.gz
,然后再join calling,速度会有几十倍的提高(水稻3200个样本的joint calling时间从800h缩短到40h); - 全基因组运行haplotyper时,不要--bam_output选项,否则速度会变得很慢,而且可能会出错;
- Haplotyper: Only a single sample is allowed in GVCF emit mode, 在使用--emit_mode GVCF的时候,bam文件只能包含单一sample,不支持多sample,否则会出现这个报错。如果BAM文件不是多sample的话,那就应该是@RG line设置有问题。
- 使用GATK call的gvcf文件,也可以使用sentieon进行joint call。
参考脚本¶
拟南芥 sentieon callsnp 流程脚本
拟南芥NGS的测试文件
/public/exercise/sentieon/tair10_1.fastq.gz
/public/exercise/sentieon/tair10_2.fastq.gz
fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa
#BSUB -J Sentieon
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
# 加载所需软件
# module load sentieon/201808.07
export SENTIEON_LICENSE=mn01:9000
module load SAMtools/1.9
release_dir=/public/home/software/opt/bio/software/Sentieon/201808.07
# 样本名称
i="tair10"
# fastq文件路径
fq1=/public/exercise/sentieon/tair10_1.fastq.gz
fq2=/public/exercise/sentieon/tair10_2.fastq.gz
# 参考基因组文件
fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa
# 输出文件路径
workdir=`pwd`/ath50x_result
# 需要使用的核心数
nt=16
# 比对信息
group_prefix="read_group_name"
platform="ILLUMINA"
mq=30
[ ! -d $workdir ] && mkdir -p $workdir
cd $workdir
# 输出文件
rawCram=$i.cram
sortedCram=$i.q$mq.sorted.cram
depCram=$i.deduped.cram
realnCram=$i.realn.cram
outvcf=$i.vcf
exec > $workdir/$i.callVCF.log 2>&1 # call vcf的日志文件
# ******************************************
# 1. 利用 BWA-MEM 进行比对并排序
# ******************************************
( $release_dir/bin/sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" \
-t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | samtools sort -@ $nt --output-fmt CRAM \
--reference $fasta -o $rawCram - && samtools index -@ $nt $rawCram
samtools view -hCS -T $fasta -q $mq -o $sortedCram $rawCram && \
samtools index -@ $nt $sortedCram
samtools flagstat $rawCram > $i.stat.raw.txt && \
samtools flagstat $sortedCram > $i.stat.q$mq.txt &
# ******************************************
# 2. Calculate data metrics
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo MeanQualityByCycle ${i}_mq_metrics.txt \
--algo QualDistribution ${i}_qd_metrics.txt --algo GCBias --summary ${i}_gc_summary.txt ${i}_gc_metrics.txt \
--algo AlignmentStat --adapter_seq '' ${i}_aln_metrics.txt --algo InsertSizeMetricAlgo ${i}_is_metrics.txt
$release_dir/bin/sentieon plot metrics -o ${i}_metrics-report.pdf gc=${i}_gc_metrics.txt \
qd=${i}_qd_metrics.txt mq=${i}_mq_metrics.txt isize=${i}_is_metrics.txt
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo LocusCollector --fun score_info ${i}_score.txt
# ******************************************
# 3. 去除 Duplicate Reads
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo Dedup --rmdup --cram_write_options version=3.0 \
--score_info ${i}_score.txt --metrics ${i}_dedup_metrics.txt $depCram && rm -f $sortedCram
# ******************************************
# 4. Indel 重排序 (可选)
# 如果只需要最终的比对结果文件,到这里就可以了,这条命令下面的命令都可以注释掉
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $depCram --algo Realigner --cram_write_options version=3.0 \
$realnCram && rm -f $depCram
# ******************************************
# 5. Variant calling
# ******************************************
$release_dir/bin/sentieon driver -t $nt -r $fasta -i $realnCram --algo Genotyper $outvcf
#!/bin/sh
# *******************************************
# Update with the fullpath location of your sample fastq
set -x
data_dir="$( cd -P "$( dirname "$0" )" && pwd )" #workdir
fastq_1=/public/exercise/sentieon/tair10_1.fastq.gz
fastq_2=/public/exercise/sentieon/tair10_2.fastq.gz
# Update with the location of the reference data files
fasta=/public/exercise/sentieon/reference_Tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.modified.fa
# Set SENTIEON_LICENSE if it is not set in the environment
module load SAMtools/1.9
#module load sentieon/201808.07
export SENTIEON_LICENSE=mn01:9000
# Update with the location of the Sentieon software package
SENTIEON_INSTALL_DIR=/public/home/software/opt/bio/software/Sentieon/201808.07
# It is important to assign meaningful names in actual cases.
# It is particularly important to assign different read group names.
sample="tair10"
group="G"
platform="ILLUMINA"
# Other settings
nt=16 #number of threads to use in computation
# ******************************************
# 0. Setup
# ******************************************
workdir=$data_dir/result-tair10
mkdir -p $workdir
logfile=$workdir/run.log
exec >$logfile 2>&1
cd $workdir
#Sentieon proprietary compression
bam_option="--bam_compression 1"
# ******************************************
# 1. Mapping reads with BWA-MEM, sorting
# ******************************************
#The results of this call are dependent on the number of threads used. To have number of threads independent results, add chunk size option -K 10000000
# speed up memory allocation malloc in bwa
export LD_PRELOAD=$SENTIEON_INSTALL_DIR/lib/libjemalloc.so
export MALLOC_CONF=lg_dirty_mult:-1
( $SENTIEON_INSTALL_DIR/bin/sentieon bwa mem -M -R "@RG\tID:$group\tSM:$sample\tPL:$platform" -t $nt -K 10000000 $fasta $fastq_1 $fastq_2 || echo -n 'error' ) | $SENTIEON_INSTALL_DIR/bin/sentieon util sort $bam_option -r $fasta -o sorted.bam -t $nt --sam2bam -i -
# ******************************************
# 2. Metrics
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i sorted.bam --algo MeanQualityByCycle mq_metrics.txt --algo QualDistribution qd_metrics.txt --algo GCBias --summary gc_summary.txt gc_metrics.txt --algo AlignmentStat --adapter_seq '' aln_metrics.txt --algo InsertSizeMetricAlgo is_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot GCBias -o gc-report.pdf gc_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot QualDistribution -o qd-report.pdf qd_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot MeanQualityByCycle -o mq-report.pdf mq_metrics.txt
$SENTIEON_INSTALL_DIR/bin/sentieon plot InsertSizeMetricAlgo -o is-report.pdf is_metrics.txt
# ******************************************
# 3. Remove Duplicate Reads
# To mark duplicate reads only without removing them, remove "--rmdup" in the second command
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt -i sorted.bam --algo LocusCollector --fun score_info score.txt
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt -i sorted.bam --algo Dedup --rmdup --score_info score.txt --metrics dedup_metrics.txt $bam_option deduped.bam
# ******************************************
# ******************************************
# 5. Base recalibration
# ******************************************
# Perform recalibration
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam --algo QualCal recal_data.table
# Perform post-calibration check (optional)
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo QualCal recal_data.table.post
$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt --algo QualCal --plot --before recal_data.table --after recal_data.table.post recal.csv
$SENTIEON_INSTALL_DIR/bin/sentieon plot QualCal -o recal_plots.pdf recal.csv
# ******************************************
# 6. HC Variant caller
# Note: Sentieon default setting matches versions before GATK 3.7.
# Starting GATK v3.7, the default settings have been updated multiple times.
# Below shows commands to match GATK v3.7 - 4.1
# Please change according to your desired behavior.
# ******************************************
# Matching GATK 3.7, 3.8, 4.0
#$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo Haplotyper --emit_conf=10 --call_conf=10 output-hc.vcf.gz
# Matching GATK 4.1
$SENTIEON_INSTALL_DIR/bin/sentieon driver -r $fasta -t $nt -i deduped.bam -q recal_data.table --algo Haplotyper --genotype_model multinomial --emit_conf 30 --call_conf 30 output-hc.vcf.gz
其它sentieon脚本
人类NGS的测试文件
/public/exercise/sentieon/1.fastq.gz
/public/exercise/sentieon/2.fastq.gz
fasta=/public/exercise/sentieon/reference/ucsc.hg19_chr22.fasta
dbsnp=/public/exercise/sentieon/reference/dbsnp_135.hg19_chr22.vcf
known_1000G_indels=/public/exercise/sentieon/reference/1000G_phase1.snps.high_confidence.hg19_chr22.sites.vcf
known_Mills_indels=/public/exercise/sentieon/reference/Mills_and_1000G_gold_standard.indels.hg19_chr22.sites.vcf
#BSUB -J Sentieon
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
# 加载所需软件
# module load sentieon/201808.07
export SENTIEON_LICENSE=mn01:9000
module load SAMtools/1.9
release_dir=/public/home/software/opt/bio/software/Sentieon/201808.07
# 样本名称
i="Test"
# fastq文件路径
fq1=/public/exercise/sentieon/1.fastq.gz
fq2=/public/exercise/sentieon/2.fastq.gz
# 参考基因组文件
fasta=/public/exercise/sentieon/reference/ucsc.hg19_chr22.fasta
# 输出文件路径
workdir=`pwd`/People_result
# 需要使用的核心数
nt=16
# 相应数据库
dbsnp=/public/exercise/sentieon/reference/dbsnp_135.hg19_chr22.vcf
known_1000G_indels=/public/exercise/sentieon/reference/1000G_phase1.snps.high_confidence.hg19_chr22.sites.vcf
known_Mills_indels=/public/exercise/sentieon/reference/Mills_and_1000G_gold_standard.indels.hg19_chr22.sites.vcf
# 比对信息
group_prefix="read_group_name"
platform="ILLUMINA"
mq=30
[ ! -d $workdir ] && mkdir -p $workdir
cd $workdir
# 输出文件
rawCram=$i.cram
sortedCram=$i.q$mq.sorted.cram
depCram=$i.deduped.cram
realnCram=$i.realn.cram
outvcf=$i.vcf
exec > $workdir/$i.callVCF.log 2>&1 # call vcf的日志文件
# ******************************************
# 1. 利用 BWA-MEM 进行比对并排序
# ******************************************
( $release_dir/bin/sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" \
-t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | samtools sort -@ $nt --output-fmt CRAM \
--reference $fasta -o $rawCram - && samtools index -@ $nt $rawCram
samtools view -hCS -T $fasta -q $mq -o $sortedCram $rawCram && \
samtools index -@ $nt $sortedCram
samtools flagstat $rawCram > $i.stat.raw.txt && \
samtools flagstat $sortedCram > $i.stat.q$mq.txt &
# ******************************************
# 2. Calculate data metrics
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo MeanQualityByCycle ${i}_mq_metrics.txt \
--algo QualDistribution ${i}_qd_metrics.txt --algo GCBias --summary ${i}_gc_summary.txt ${i}_gc_metrics.txt \
--algo AlignmentStat --adapter_seq '' ${i}_aln_metrics.txt --algo InsertSizeMetricAlgo ${i}_is_metrics.txt
$release_dir/bin/sentieon plot metrics -o ${i}_metrics-report.pdf gc=${i}_gc_metrics.txt \
qd=${i}_qd_metrics.txt mq=${i}_mq_metrics.txt isize=${i}_is_metrics.txt
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo LocusCollector --fun score_info ${i}_score.txt
# ******************************************
# 3. 去除 Duplicate Reads
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $sortedCram --algo Dedup --rmdup --cram_write_options version=3.0 \
--score_info ${i}_score.txt --metrics ${i}_dedup_metrics.txt $depCram && rm -f $sortedCram
# ******************************************
# 4. Indel 重排序 (可选)
# 如果只需要最终的比对结果文件,到这里就可以了,这条命令下面的命令都可以注释掉
# ******************************************
$release_dir/bin/sentieon driver -r $fasta -t $nt -i $depCram --algo Realigner -k ${known_1000G_indels} --cram_write_options version=3.0 \
$realnCram && rm -f $depCram
# ******************************************
# 5. Variant calling
# ******************************************
$release_dir/bin/sentieon driver -t $nt -r $fasta -i $realnCram --algo Genotyper -d ${dbsnp} ${outvcf}
#BSUB -J Sentieon_bwa
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
# 加载所需软件
module load sentieon/201911
nt=$LSB_DJOB_NUMPROC
fasta=/public/home/test/haoliu/work/callsnp/data/hg38.fa
fq1=/public/home/test/haoliu/work/callsnp/data/ERR194146_1.fastq.gz
fq2=/public/home/test/haoliu/work/callsnp/data/ERR194146_2.fastq.gz
i=sample1
sortedBam=ERR194146.sorted.bam
(sentieon bwa mem -M -R "@RG\tID:${i}\tSM:${i}\tPL:$platform" -t $nt -K 10000000 $fasta $fq1 $fq2 || echo -n 'error' ) | sentieon util sort -r $fasta -o $sortedBam -t $nt --sam2bam -i -
参考文档¶
Computational_performance_and_accuracy_of_Sentieon_DNASeq_variant_calling_workflow
Sentieon_DNASeq_Variant_Calling_Workflow_Demonstrates_Strong_Computational_Performance_and_Accuracy
本文阅读量 次本站总访问量 次