deepvariant
DeepVariant是由Google Brain开发的一个开源软件,用于对基因组序列进行变异检测。它利用深度学习技术来实现高质量、高准确性的变异检测,能够在人类基因组的全长上实现高达99%的准确率。
DeepVariant使用卷积神经网络(CNN)来对原始DNA测序数据进行分析,并识别基因组中的单核苷酸变异(SNP)和插入/缺失(indel)。它还能够检测复杂的变异形式,如基因重排和基因剪接变异。DeepVariant能够处理各种类型的测序数据,包括Illumina、PacBio和Nanopore等平台产生的数据。
DeepVariant运行过程中有三个核心步骤,分别用于生成变异候选位点、对候选位点进行筛选和重新分类,并将筛选后的位点转换为VCF格式输出。以下是这三个步骤的具体作用:
make_examples:这个步骤用于从原始的测序数据中生成变异候选位点。具体地,make_examples会将原始测序数据划分成许多小片段,然后对每个小片段进行特征提取,得到一组特征向量。接着,make_examples会将特征向量输入卷积神经网络模型,模型会输出该小片段中可能存在的变异候选位点。
call_variants:在make_examples步骤生成的变异候选位点中,有许多可能是假阳性(False Positive)或假阴性(False Negative),因此需要对这些位点进行筛选和重新分类。call_variants步骤会对make_examples生成的变异候选位点进行重评估,并根据一些额外的过滤条件对位点进行过滤。这个步骤使用的是支持向量机(Support Vector Machine)模型,对候选位点进行打分和筛选。
postprocess_variants:在经过call_variants筛选后,最后的变异位点需要进行重新分类,并输出为VCF格式的文件。这个步骤会使用一些额外的过滤条件,如深度、质量值等,来对最终的变异位点进行重新分类,并将结果输出为VCF文件。
需要注意的是,这三个步骤并不是独立的,而是互相依赖的。make_examples生成的变异候选位点是call_variants和postprocess_variants的输入,而call_variants生成的筛选结果是postprocess_variants的输入。这三个步骤通常会一起运行,形成一个完整的DeepVariant工作流。
安装¶
$ module load Singularity/3.7.3
# CPU版
$ singularity pull docker://google/deepvariant:latest
# 查看版本号
$ singularity exec deepvariant_latest.sif run_deepvariant --version
DeepVariant version 1.6.0
# GPU版
$ singularity pull docker://google/deepvariant:latest-gpu
使用¶
deepvariant 具有CPU和GPU两个版本,在GPU版本中call_variants可使用GPU加速,make_examples和postprocess_variants只能使用CPU。
CPU¶
#BSUB -J deepvariant
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
thread=$LSB_DJOB_NUMPROC
wdr="/public/home/software/test/human_callsnp/"
module load Singularity/3.7.3
time singularity exec $/deepvariant/1.4.0.sif run_deepvariant --model_type=WGS \
--ref=${wdr}/hg38.fa \
--reads=${wdr}/ERR194146_sort_redup.bam \
--output_vcf=${wdr}/deepvariant/ERR194146.vcf \
--output_gvcf=${wdr}/deepvariant/ERR194146.gvcf \
--num_shards=${thread}
步骤 | 时间 | 最大内存(GB) |
---|---|---|
make_examples | 83m52 | 17 |
call_variants | 1256m16 | 4 GPU版77m31 |
postprocess_variants | 23m32 | 17 |
total | 1363m46 | - |
CPU+GPU¶
鉴于上面的步骤,可以将deepvariant分开运行,make_examples和postprocess_variants在普通CPU节点上运行,call_variants在GPU节点运行。这样可以最大化利用少量的GPU资源加速运行群体多样本数据。
- 单个样本运行
单个样本分3个脚本运行
#BSUB -J make_examples
#BSUB -n 36
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))
wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222
mkdir tmp/${id}
mkdir output/${id}
time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}
#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01
thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))
wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222
time singularity exec /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants --outfile "output/${id}/call_variants_output.tfrecord.gz" --examples "tmp/${id}/make_examples.tfrecord@36.gz" --checkpoint "/opt/models/wgs/model.ckpt"
#BSUB -J call_variants
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01
thread=$LSB_DJOB_NUMPROC
steps=$((${thread}-1))
wdr="/public/home/software/test/human_callsnp/"
id=ERR194146
module load Singularity/3.7.3
module load parallel/20180222
time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants --ref "/public/home/software/test/human_callsnp//hg38.fa" --infile "output/${id}/call_variants_output.tfrecord.gz" --outfile "output/${id}/${id}.vcf.gz"
- 多样本运行
多个样本可以使用这个脚本
#BSUB -J deepvariant
#BSUB -n 1
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu
#BSUB -m gpu01
thread=36
steps=$((${thread}-1))
wdr="/public/home/software/test/human_callsnp/"
for id in $(cat list.txt)
do
time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants --ref "/public/home/software/test/human_callsnp//hg38.fa" --infile "output/${id}/call_variants_output.tfrecord.gz" --outfile "output/${id}/${id}.vcf.gz"
bsub -K -J make_examples_${id} -n ${thread} -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time seq 0 ${steps} | parallel -q --halt 2 --line-buffer singularity exec /share/Singularity/deepvariant/1.4.0.sif make_examples --mode calling --ref "/public/home/software/test/human_callsnp//hg38.fa" --reads "/public/home/software/test/human_callsnp//${id}_sort_redup.bam" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --channels "insert_size" --gvcf "tmp/${id}/gvcf.tfrecord@${thread}.gz" --task {}"
bsub -K -J call_variants_${id} -n 1 -o %J.call_variants_${id}.out -e %J.call_variants_${id}.err -q gpu -m gpu01 -R span[hosts=1] "module load Singularity/3.7.3;module load parallel/20180222;time singularity exec /share/Singularity/deepvariant/1.4.0-gpu.sif call_variants --outfile "output/${id}/call_variants_output.tfrecord.gz" --examples "tmp/${id}/make_examples.tfrecord@${thread}.gz" --checkpoint "/opt/models/wgs/model.ckpt""
bsub -K -J make_examples_${id} -n 1 -o %J.make_examples_${id}.out -e %J.make_examples_${id}.err -R span[hosts=1] "module load Singularity/3.7.3;time singularity exec /share/Singularity/deepvariant/1.4.0.sif postprocess_variants --ref "/public/home/software/test/human_callsnp//hg38.fa" --infile "output/${id}/call_variants_output.tfrecord.gz" --outfile "output/${id}/${id}.vcf.gz""
done
步骤 | 时间 | 最大内存(GB) |
---|---|---|
make_examples | 119m18 | 16 |
call_variants | 77m31 | 4 |
postprocess_variants | 19m3 | 17 |
total | 216m6 | - |
软件流程 | 时间 | 内存 | 其它 |
---|---|---|---|
bwa-mem2 samtools sort | 342min | 153G | - |
bwa samtools sort | 391min | 123G | - |
bwa-mem2 | 528min | 132G | - |
bwa-mem2 samblaster samtools | 211min | 131G | - |
time bwa-mem2 mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt.bam
time bwa mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ../hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz |samtools sort -@ $thread -o ERR194146_srt2.bam
# sambamba --tmpdir=TMPDIR
time bwa-mem2 mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | sambamba view -t $thread -S -f bam /dev/stdin |sambamba sort /dev/stdin -t $thread -o ERR194146_srt3.bam
# samblaster
time bwa-mem2 mem -t $thread -R '@RG\tID:test146\tPL:illumina\tLB:library\tSM:humen146' ./hg38.fa ../ERR194146_1.fastq.gz ../ERR194146_2.fastq.gz | samblaster | samtools view -@ $thread -Sb - ERR194146_srt4.bam
本站总访问量 次