跳转至

deepvariant

DeepVariant是由Google Brain开发的一个开源软件,用于对基因组序列进行变异检测。它利用深度学习技术来实现高质量、高准确性的变异检测,能够在人类基因组的全长上实现高达99%的准确率。

DeepVariant使用卷积神经网络(CNN)来对原始DNA测序数据进行分析,并识别基因组中的单核苷酸变异(SNP)和插入/缺失(indel)。它还能够检测复杂的变异形式,如基因重排和基因剪接变异。DeepVariant能够处理各种类型的测序数据,包括Illumina、PacBio和Nanopore等平台产生的数据。

DeepVariant运行过程中有三个核心步骤,分别用于生成变异候选位点、对候选位点进行筛选和重新分类,并将筛选后的位点转换为VCF格式输出。以下是这三个步骤的具体作用:

  • make_examples:这个步骤用于从原始的测序数据中生成变异候选位点。具体地,make_examples会将原始测序数据划分成许多小片段,然后对每个小片段进行特征提取,得到一组特征向量。接着,make_examples会将特征向量输入卷积神经网络模型,模型会输出该小片段中可能存在的变异候选位点。

  • call_variants:在make_examples步骤生成的变异候选位点中,有许多可能是假阳性(False Positive)或假阴性(False Negative),因此需要对这些位点进行筛选和重新分类。call_variants步骤会对make_examples生成的变异候选位点进行重评估,并根据一些额外的过滤条件对位点进行过滤。这个步骤使用的是支持向量机(Support Vector Machine)模型,对候选位点进行打分和筛选。

  • postprocess_variants:在经过call_variants筛选后,最后的变异位点需要进行重新分类,并输出为VCF格式的文件。这个步骤会使用一些额外的过滤条件,如深度、质量值等,来对最终的变异位点进行重新分类,并将结果输出为VCF文件。

需要注意的是,这三个步骤并不是独立的,而是互相依赖的。make_examples生成的变异候选位点是call_variants和postprocess_variants的输入,而call_variants生成的筛选结果是postprocess_variants的输入。这三个步骤通常会一起运行,形成一个完整的DeepVariant工作流。

安装

$ module load Singularity/3.7.3

# CPU版
$ singularity pull docker://google/deepvariant:latest
# 查看版本号
$ singularity exec deepvariant_latest.sif run_deepvariant  --version
DeepVariant version 1.6.0

# GPU版
$ singularity pull docker://google/deepvariant:latest-gpu

使用

deepvariant 具有CPU和GPU两个版本,在GPU版本中call_variants可使用GPU加速,make_examples和postprocess_variants只能使用CPU。

CPU

#BSUB -J deepvariant
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal


thread=$LSB_DJOB_NUMPROC

wdr="/public/home/software/test/human_callsnp/"

module load Singularity/3.7.3

time singularity exec $/deepvariant/1.4.0.sif run_deepvariant --model_type=WGS  \
    --ref=${wdr}/hg38.fa  \
    --reads=${wdr}/ERR194146_sort_redup.bam  \
    --output_vcf=${wdr}/deepvariant/ERR194146.vcf \
    --output_gvcf=${wdr}/deepvariant/ERR194146.gvcf \
    --num_shards=${thread}
以人的30x WGS数据为例,各步骤运行时间如下,可以看到主要时间消耗步骤为call_variants,占比92%。
步骤时间最大内存(GB)
make_examples83m5217
call_variants1256m164 GPU版77m31
postprocess_variants23m3217
total1363m46-

CPU+GPU

鉴于上面的步骤,可以将deepvariant分开运行,make_examples和postprocess_variants在普通CPU节点上运行,call_variants在GPU节点运行。这样可以最大化利用少量的GPU资源加速运行群体多样本数据。

  • 单个样本运行

单个样本分3个脚本运行

#BSUB -J make_examples
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

mkdir tmp/${id}
mkdir output/${id}

time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}
#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

time singularity exec  /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants  --outfile "output/${id}/call_variants_output.tfrecord.gz"  --examples "tmp/${id}/make_examples.tfrecord@36.gz"  --checkpoint "/opt/models/wgs/model.ckpt"
#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz" 
  • 多样本运行

多个样本可以使用这个脚本

#BSUB -J deepvariant
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=36
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
for id in $(cat list.txt)
do
    time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz" 

    bsub -K -J make_examples_${id} -n ${thread} -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}"

    bsub -K -J call_variants_${id} -n 1 -o %J.call_variants_${id}.out -e %J.call_variants_${id}.err -q gpu -m gpu01 -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time singularity exec  /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants  --outfile "output/${id}/call_variants_output.tfrecord.gz"  --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz"  --checkpoint "/opt/models/wgs/model.ckpt""

    bsub -K -J make_examples_${id} -n 1 -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz""
done
步骤时间最大内存(GB)
make_examples119m1816
call_variants77m314
postprocess_variants19m317
total216m6-
软件流程时间内存其它
bwa-mem2 samtools sort342min153G-
bwa samtools sort391min123G-
bwa-mem2528min132G-
bwa-mem2 samblaster samtools211min131G-
time bwa-mem2 mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt.bam

time bwa mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ../hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt2.bam

# sambamba  --tmpdir=TMPDIR
time bwa-mem2  mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | sambamba view -t $thread -S -f bam /dev/stdin |sambamba sort  /dev/stdin -t $thread -o ERR194146_srt3.bam

# samblaster
time bwa-mem2  mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | samblaster | samtools view -@ $thread -Sb - ERR194146_srt4.bam
本文阅读量  次
本站总访问量  次