deepvariant

DeepVariant是由Google Brain开发的一个开源软件，用于对基因组序列进行变异检测。它利用深度学习技术来实现高质量、高准确性的变异检测，能够在人类基因组的全长上实现高达99％的准确率。

DeepVariant使用卷积神经网络（CNN）来对原始DNA测序数据进行分析，并识别基因组中的单核苷酸变异（SNP）和插入/缺失（indel）。它还能够检测复杂的变异形式，如基因重排和基因剪接变异。DeepVariant能够处理各种类型的测序数据，包括Illumina、PacBio和Nanopore等平台产生的数据。

DeepVariant运行过程中有三个核心步骤，分别用于生成变异候选位点、对候选位点进行筛选和重新分类，并将筛选后的位点转换为VCF格式输出。以下是这三个步骤的具体作用：

make_examples：这个步骤用于从原始的测序数据中生成变异候选位点。具体地，make_examples会将原始测序数据划分成许多小片段，然后对每个小片段进行特征提取，得到一组特征向量。接着，make_examples会将特征向量输入卷积神经网络模型，模型会输出该小片段中可能存在的变异候选位点。
call_variants：在make_examples步骤生成的变异候选位点中，有许多可能是假阳性（False Positive）或假阴性（False Negative），因此需要对这些位点进行筛选和重新分类。call_variants步骤会对make_examples生成的变异候选位点进行重评估，并根据一些额外的过滤条件对位点进行过滤。这个步骤使用的是支持向量机（Support Vector Machine）模型，对候选位点进行打分和筛选。
postprocess_variants：在经过call_variants筛选后，最后的变异位点需要进行重新分类，并输出为VCF格式的文件。这个步骤会使用一些额外的过滤条件，如深度、质量值等，来对最终的变异位点进行重新分类，并将结果输出为VCF文件。

需要注意的是，这三个步骤并不是独立的，而是互相依赖的。make_examples生成的变异候选位点是call_variants和postprocess_variants的输入，而call_variants生成的筛选结果是postprocess_variants的输入。这三个步骤通常会一起运行，形成一个完整的DeepVariant工作流。

安装¶

$ module load Singularity/3.7.3

# CPU版
$ singularity pull docker://google/deepvariant:latest
# 查看版本号
$ singularity exec deepvariant_latest.sif run_deepvariant  --version
DeepVariant version 1.6.0

# GPU版
$ singularity pull docker://google/deepvariant:latest-gpu

使用¶

deepvariant 具有CPU和GPU两个版本，在GPU版本中call_variants可使用GPU加速，make_examples和postprocess_variants只能使用CPU。

CPU¶

#BSUB -J deepvariant
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal


thread=$LSB_DJOB_NUMPROC

wdr="/public/home/software/test/human_callsnp/"

module load Singularity/3.7.3

time singularity exec $/deepvariant/1.4.0.sif run_deepvariant --model_type=WGS  \
    --ref=${wdr}/hg38.fa  \
    --reads=${wdr}/ERR194146_sort_redup.bam  \
    --output_vcf=${wdr}/deepvariant/ERR194146.vcf \
    --output_gvcf=${wdr}/deepvariant/ERR194146.gvcf \
    --num_shards=${thread}

以人的30x WGS数据为例，各步骤运行时间如下，可以看到主要时间消耗步骤为call_variants，占比92%。

步骤	时间	最大内存(GB)
make_examples	83m52	17
call_variants	1256m16	4 GPU版77m31
postprocess_variants	23m32	17
total	1363m46	-

CPU+GPU¶

鉴于上面的步骤，可以将deepvariant分开运行，make_examples和postprocess_variants在普通CPU节点上运行，call_variants在GPU节点运行。这样可以最大化利用少量的GPU资源加速运行群体多样本数据。

单个样本运行

单样本make_examplescall_variantspostprocess_variants

单个样本分3个脚本运行

#BSUB -J make_examples
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

mkdir tmp/${id}
mkdir output/${id}

time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}

#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

time singularity exec  /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants  --outfile "output/${id}/call_variants_output.tfrecord.gz"  --examples "tmp/${id}/make_examples.tfrecord@36.gz"  --checkpoint "/opt/models/wgs/model.ckpt"

#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222

time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz"

多样本运行

多样本样本batch_deepvariant

多个样本可以使用这个脚本

#BSUB -J deepvariant
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01

thread=36
steps=$((${thread}-1))

wdr="/public/home/software/test/human_callsnp/"
for id in $(cat list.txt)
do
    time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz" 

    bsub -K -J make_examples_${id} -n ${thread} -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}"

    bsub -K -J call_variants_${id} -n 1 -o %J.call_variants_${id}.out -e %J.call_variants_${id}.err -q gpu -m gpu01 -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time singularity exec  /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants  --outfile "output/${id}/call_variants_output.tfrecord.gz"  --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz"  --checkpoint "/opt/models/wgs/model.ckpt""

    bsub -K -J make_examples_${id} -n 1 -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants   --ref "/public/home/software/test/human_callsnp//hg38.fa"   --infile "output/${id}/call_variants_output.tfrecord.gz"   --outfile "output/${id}/${id}.vcf.gz""
done

步骤	时间	最大内存(GB)
make_examples	119m18	16
call_variants	77m31	4
postprocess_variants	19m3	17
total	216m6	-

软件流程	时间	内存	其它
bwa-mem2 samtools sort	342min	153G	-
bwa samtools sort	391min	123G	-
bwa-mem2	528min	132G	-
bwa-mem2 samblaster samtools	211min	131G	-

time bwa-mem2 mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt.bam

time bwa mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ../hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt2.bam

# sambamba  --tmpdir=TMPDIR
time bwa-mem2  mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | sambamba view -t $thread -S -f bam /dev/stdin |sambamba sort  /dev/stdin -t $thread -o ERR194146_srt3.bam

# samblaster
time bwa-mem2  mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | samblaster | samtools view -@ $thread -Sb - ERR194146_srt4.bam

本文阅读量次
本站总访问量次