alphafold3

介绍¶

alphafold3是Google DeepMind开发的、用于生物分子复合体结构预测的人工智能模型，该模型接受一个json形式的、用于描述生物分子体系的文件，输出预测结构与置信度打分信息。结构预测过程主要分为两步：

data pipeline: 在一定参数控制下，基于输入的序列信息进行生物信息学搜索，得到结构预测所需的共进化信息 (multi-sequence alginment, msa) 与结构模板 (templates)，并将这些信息与输入信息整合，将整合信息作为json文件输出
inference: 读取整合入msa和templates的json文件，转化为模型输入特征进行推理，输出mmcif格式的预测结构与confidence score

项目地址 https://github.com/google-deepmind/alphafold3/

软件镜像和数据库下载¶

可以使用南京大学的镜像站里的下载链接

https://mirror.nju.edu.cn/alphafold/

权重可以使用集群中申请好的。

基本使用¶

Warning

alphafold3 需要在 GPU 队列运行，需要向管理员申请 GPU 队列使用权限。

输入文件 input.json

{
  "name": "2PV7",
  "sequences": [
    {
      "protein": {
        "id": ["A", "B"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}

准备运行目录，并将输入文件放入其中， mkdir af3_run; mv input.json af3_run。

运行脚本

#BSUB -J alphafold3
#BSUB -n 16
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db/:/af3_db \
    $IMAGE/alphafold/3.0.1.sif \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/input.json \
    --model_dir=/af3_weights \
    --db_dir=/af3_db \
    --output_dir=/af3_run

结果文件位于 af3_run/2PV7 目录内，结果解读见官方文档 AlphaFold 3 Output。

分步运行¶

alphafold3 程序运行主要分为 data pipeline 和 inference 两步。其中，data pipeline 只使用 CPU 计算，且耗时较长；inference 会用到 GPU 进行蛋白结构预测。

因为集群上 GPU 有限，因此如果有较多蛋白需要预测，建议先在 CPU 队列上运行 data pipeline，然后再到 GPU 队列上运行 inference，以提高计算通量。

第一步，运行 data pipeline，添加参数 --norun_inference

#BSUB -J alphafold3_data_pipeline
#BSUB -n 16
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db/:/af3_db \
    $IMAGE/alphafold/3.0.1.sif \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/input.json \
    --model_dir=/af3_weights \
    --db_dir=/af3_db \
    --norun_inference \
    --output_dir=/af3_run

第二步，运行 inference，使用第一步生成的中间文件作为输入文件 --json_path=/af3_run/2pv7/2pv7_data.json，添加参数 --norun_data_pipeline。

#BSUB -J alphafold3_data_pipeline
#BSUB -n 4
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db/:/af3_db \
    $IMAGE/alphafold/3.0.1.sif \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/2pv7/2pv7_data.json \
    --model_dir=/af3_weights \
    --db_dir=/af3_db \
    --norun_data_pipeline \
    --output_dir=/af3_run

结果文件，第一步生成的中间文件目录 af3_run/2pv7，第二步生成的结果文件目录 af3_run/2pv7_20250213_101747。

$ tree af3_run/
af3_run/
├── 2pv7
│   └── 2pv7_data.json
├── 2pv7_20250213_101747
│   ├── 2pv7_confidences.json
│   ├── 2pv7_data.json
│   ├── 2pv7_model.cif
│   ├── 2pv7_summary_confidences.json
│   ├── ranking_scores.csv
│   ├── seed-1_sample-0
│   │   ├── confidences.json
│   │   ├── model.cif
│   │   └── summary_confidences.json
│   ├── seed-1_sample-1
│   │   ├── confidences.json
│   │   ├── model.cif
│   │   └── summary_confidences.json
│   ├── seed-1_sample-2
│   │   ├── confidences.json
│   │   ├── model.cif
│   │   └── summary_confidences.json
│   ├── seed-1_sample-3
│   │   ├── confidences.json
│   │   ├── model.cif
│   │   └── summary_confidences.json
│   ├── seed-1_sample-4
│   │   ├── confidences.json
│   │   ├── model.cif
│   │   └── summary_confidences.json
│   └── TERMS_OF_USE.md
└── input.json

版本更新¶

3.0.2¶

Warning

3.0.2 不支持 P100 GPU，因此不能在 GPU01 上运行推理过程。Not supported on Tesla P100-PCIE-16GB.。LSF 作业参数使用 #BSUB -gpu "num=1:gmem=20G 避免调度到 gpu 01 上。

shared database¶

3.0.2 支持 sharded database，即将较大的库如 mgy_clusters_2022_05.fa(120G)、uniprot_all_2021_04.fa(102G)等切分为多个较小的库用于多个Jackhmmer进程进行多序列比对，可以显著加快序列比对的速度。

这里以 mgy_clusters_2022_05.fa 为例，对其进行切分

$ module load seqkit
$ mkdir shards_bfd-first
# 统计序列数量，用于 AF3 的 --mgnify_z_value 参数
$ cat mgy_clusters_2022_05.fa|grep -c "^>"
# 这里切分为16等份
$ seqkit split2 -p 16 -O shards_mgy_clusters/ mgy_clusters_2022_05.fa
# af3 对sharded database 有严格的命名要求，需要按要求改文件名
# 改名脚本 rename.sh
$ cd shards_bfd-first
$ sh rename.sh
$ ls mgy*
mgy_clusters_2022_05.fa-00000-of-00016  mgy_clusters_2022_05.fa-00004-of-00016  mgy_clusters_2022_05.fa-00008-of-00016  mgy_clusters_2022_05.fa-00012-of-00016
mgy_clusters_2022_05.fa-00001-of-00016  mgy_clusters_2022_05.fa-00005-of-00016  mgy_clusters_2022_05.fa-00009-of-00016  mgy_clusters_2022_05.fa-00013-of-00016
mgy_clusters_2022_05.fa-00002-of-00016  mgy_clusters_2022_05.fa-00006-of-00016  mgy_clusters_2022_05.fa-00010-of-00016  mgy_clusters_2022_05.fa-00014-of-00016
mgy_clusters_2022_05.fa-00003-of-00016  mgy_clusters_2022_05.fa-00007-of-00016  mgy_clusters_2022_05.fa-00011-of-00016  mgy_clusters_2022_05.fa-00015-of-00016

rename.sh

TOTAL=16

for f in *.fa; do
  num=$(echo $f | grep -oP '(?<=part_)\d+')
  idx=$(printf "%05d" $((10#$num - 1)))
  total=$(printf "%05d" $TOTAL)
  mv "$f" "mgy_clusters_2022_05.fa-${idx}-of-${total}"
done

其它几个库都是如此操作。

一步运行¶

#BSUB -J alphafold3
#BSUB -n 32
#BSUB -gpu "num=1:gmem=20G"
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db/:/af3_db \
    $IMAGE/alphafold/3.0.2.sif  \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/input.json \
    --model_dir=/af3_weights \
    --output_dir=/af3_run \
    \
    --uniprot_cluster_annot_database_path=/af3_db/shards_uniprot/uniprot_all_2021_04.fa@16 \
    --mgnify_database_path=/af3_db/shards_mgy_clusters/mgy_clusters_2022_05.fa@16 \
    --uniref90_database_path=/af3_db/shards_uniref90/uniref90_2022_05.fa@16 \
    --small_bfd_database_path=/af3_db/shards_bfd-first/bfd-first_non_consensus_sequences.fasta@16 \
    --uniprot_cluster_annot_z_value=225619586 \
    --mgnify_z_value=623796864 \
    --uniref90_z_value=153742194 \
    --small_bfd_z_value=65984053 \
    \
    --pdb_database_path=/af3_db/mmcif_files \
    --seqres_database_path=/af3_db/pdb_seqres_2022_09_28.fasta \
    \
    --ntrna_database_path=/af3_db/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta \
    --rfam_database_path=/af3_db/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta \
    --rna_central_database_path=/af3_db/rnacentral_active_seq_id_90_cov_80_linclust.fasta \
    \
    --jackhmmer_n_cpu=1 \
    --jackhmmer_max_parallel_shards=16 \
    --nhmmer_n_cpu=1 \
    --nhmmer_max_parallel_shards=16

分步运行¶

第一步，运行 data pipeline，添加参数 --norun_inference。只使用 CPU 运行多序列比对，使用普通节点即可。

#BSUB -J alphafold3
#BSUB -n 16
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db/:/af3_db \
    $IMAGE/alphafold/3.0.2.sif  \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/input.json \
    --model_dir=/af3_weights \
    --norun_inference \
    --output_dir=/af3_run \
    \
    --uniprot_cluster_annot_database_path=/af3_db/shards_uniprot/uniprot_all_2021_04.fa@16 \
    --mgnify_database_path=/af3_db/shards_mgy_clusters/mgy_clusters_2022_05.fa@16 \
    --uniref90_database_path=/af3_db/shards_uniref90/uniref90_2022_05.fa@16 \
    --small_bfd_database_path=/af3_db/shards_bfd-first/bfd-first_non_consensus_sequences.fasta@16 \
    --uniprot_cluster_annot_z_value=225619586 \
    --mgnify_z_value=623796864 \
    --uniref90_z_value=153742194 \
    --small_bfd_z_value=65984053 \
    \
    --pdb_database_path=/af3_db/mmcif_files \
    --seqres_database_path=/af3_db/pdb_seqres_2022_09_28.fasta \
    \
    --ntrna_database_path=/af3_db/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta \
    --rfam_database_path=/af3_db/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta \
    --rna_central_database_path=/af3_db/rnacentral_active_seq_id_90_cov_80_linclust.fasta \
    \
    --jackhmmer_n_cpu=1 \
    --jackhmmer_max_parallel_shards=16 \
    --nhmmer_n_cpu=1 \
    --nhmmer_max_parallel_shards=16

第二步，运行 inference，使用第一步生成的中间文件作为输入文件 --json_path=/af3_run/2pv7/2pv7_data.json，添加参数 --norun_data_pipeline。需要在 GPU 队列运行。

#BSUB -J alphafold3
#BSUB -n 16
#BSUB -gpu "num=1:gmem=20G"
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q gpu

module load Singularity/3.7.3

singularity exec \
    --nv \
    --bind af3_run:/af3_run \
    --bind /public/home/software/opt/af3_weights/:/af3_weights \
    --bind /public/home/software/opt/af3_db2/:/af3_db \
    $IMAGE/alphafold/3.0.2.sif \
    /alphafold3_venv/bin/python /app/alphafold/run_alphafold.py \
    --json_path=/af3_run/input.json \
    --model_dir=/af3_weights \
    --json_path=/af3_run/2PV7/2PV7_data.json \
    --norun_data_pipeline \
    --output_dir=/af3_run \

性能测试¶

alphafold 3.0.1 相比 alphafold 3.0.0 功能上有了较多的更新，运行速度也有了优化，速度有了大幅提高，主要是第一步 data pipeline 运行速度提高比较多，inference 改善很小。测试数据使用的上面的 2PV7 数据。

硬件：GPU NVIDIA 4090D * 1，CPU AMD 9654 * 2，内存 1TB

软件	运行时间(s)
AlphaFold v3.0.0	2010
AlphaFold v3.0.1	635
AlphaFold v3.0.2	270

参考¶

https://doc.nju.edu.cn/books/efe93/page/alphafold-3

https://docs.hpc.sjtu.edu.cn/app/bioinformatics/alphafold3.html

本文阅读量次
本站总访问量次