open-omics-alphafold
open-omics-alphafold为intel官方在intel xeon cpu上优化过的alphafold,推理过程基于intel第4代可扩展处理器(sapphire rapids)AVX512 FP32 和 AMX-BF16,加速效果堪比NVIDIA A100。
官方代码库:https://github.com/IntelLabs/open-omics-alphafold
官方测试数据 Intel Xeon is all you need for AI inference: Performance Leadership on Real World Applications
安装配置¶
Note
安装配置过程中需要多次从GitHub上下载代码,如果出现网络问题无法下载的情况,可以使用GitHub镜像,具体使用见 github镜像。
安装配置 oneAPI BaseKit,具体见 oneAPI
下载代码库
安装目录为
~/opt/alphafold/
$ cd ~/opt/alphafold/ $ git clone https://github.com/IntelLabs/open-omics-alphafold.git
使用mamba安装依赖
$ cd ~/opt/alphafold/open-omics-alphafold/ # 由于安装过程中存在冲突问题,删除conda_requirements.yml中的 jax==0.4.8 jaxlib==0.4.7 $ micromamba env create -f conda_requirements.yml $ micromamba activate iaf2 (iaf2) $ # update submodules (iaf2) $ git submodule update --init --recursive
- 载入oneAPI和高版本GCC(GCC >= 9.4.0)环境,以及cmake。这里使用集群上已经安装好的。
(iaf2) $ source /public/home/software/opt/bio/software/oneAPI_BaseKit/2024.0.1.46/setvars.sh intel64 (iaf2) $ source /public/home/software/opt/bio/software/oneAPI_BaseKit/2024.0.1.46/tbb/latest/env/vars.sh (iaf2) $ module load GCC/9.4.0 CMake/3.19.8 (iaf2) $ export CC=/public/home/software/opt/bio/software/GCC/9.4.0/bin/gcc
- 安装AVX-512优化版hh-suite 测试安装的hh-suite,
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/ (iaf2) $ git clone --recursive https://github.com/IntelLabs/hh-suite.git (iaf2) $ cd hh-suite (iaf2) $ mkdir build && cd build # 由于集群CPU是skylake,将icelake-server换成skylake-avx512,否则会出现报错,如果是其他CPU型号,则需要更换对应的产品代码 (iaf2) $ cmake -DCMAKE_INSTALL_PREFIX=`pwd`/release -DCMAKE_CXX_COMPILER="icpx" -DCMAKE_CXX_FLAGS_RELEASE="-O3 -march=skylake-avx512" .. (iaf2) $ make -j 4 && make install (iaf2) $ ./release/bin/hhblits -h
./input//example.fa
为输入文件。如果下面的程序运行没问题则表示hh-suite安装正常,否则会在运行过程中报错Illegal instruction
。(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/ (iaf2) $ ./hh-suite/build/release/bin/hhblits -i ./input//example.fa -cpu 20 -oa3m /tmp/tmpthcmjqar_output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /public/home/software/opt/alphafold_data//bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /public/home/software/opt/alphafold_data//uniref30/UniRef30_2021_03
- 安装AVX-512优化版hmmer
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/ (iaf2) $ git clone --recursive https://github.com/IntelLabs/hmmer.git (iaf2) $ cd hmmer # 这里相比官网命令,少了 `make clean`,否则会报错 (iaf2) $ cd easel && autoconf && ./configure --prefix=`pwd` && cd .. (iaf2) $ autoconf && CC=icx CFLAGS="-O3 -march=icelake-server -fPIC" ./configure --prefix=`pwd`/release (iaf2) $ make -j 4 && make install (iaf2) $ ./release/bin/jackhmmer -h
- 安装TPP优化版AlphaFold2 Attention模块
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/ (iaf2) $ git clone https://github.com/libxsmm/tpp-pytorch-extension (iaf2) $ cd tpp-pytorch-extension (iaf2) $ git submodule update --init (iaf2) $ python setup.py install (iaf2) $ python -c "from tpp_pytorch_extension.alphafold.Alpha_Attention import GatingAttentionOpti_forward"
- extract weights
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/ (iaf2) $ mkdir weights && mkdir weights/extracted # 这里下载的alphafold2的数据库路径为/public/home/software/opt/alphafold_data/ (iaf2) $ python extract_params.py --input /public/home/software/opt/alphafold_data/params/params_model_1.npz --output_dir ./weights/extracted/model_1
测试安装环境¶
创建测试目录和相关数据
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/
(iaf2) mkdir test
(iaf2) mkdir test/{input,output}
# 测试你数据
(iaf2) cat test/input/example.fa
>example file
ATGCCGCATGGTCGTC
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/
#官方命令 bash online_preproc_baremetal.sh <root_home> <data-dir> <input-dir> <output-dir>
(iaf2) $ bash online_preproc_baremetal.sh ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/
# 官方命令
# bash online_inference_baremetal.sh <conda_env_path> <root_home> <data-dir> <input-dir> <output-dir> <model_name>
(iaf2) $ bash online_inference_baremetal.sh iaf2 ~/micromamba/envs/iaf2 ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/ model_1
(iaf2) $ cd ~/opt/alphafold/open-omics-alphafold/
# 下载 stereo_chemical_props.txt 文件
(iaf2) $ wget -q -P ./alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt --no-check-certificate
# 官方命令,这里多了一个参数 <conda_env_path>,
# bash one_amber.sh <conda_env_path> <root_home> <data-dir> <input-dir> <output-dir> <model_name>
(iaf2) $ bash one_amber.sh ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/ model_1
运行¶
#BSUB -J alphafold
#BSUB -n 4
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
module load micromamba/1.3.0
eval "$(micromamba shell hook --shell=bash)"
micromamba activate iaf2
source /public/home/software/opt/bio/software/oneAPI_BaseKit/2024.0.1.46/setvars.sh intel64
module load GCC/9.4.0
# 需要在open-omics-alphafold安装目录中运行
cd ~/opt/alphafold/open-omics-alphafold/
echo "alphafold start"
date
echo "preproc"
time bash online_preproc_baremetal.sh ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/
echo "inference"
time bash online_inference_baremetal.sh ~/micromamba/envs/iaf2 ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/ model_1
echo "amber relax"
time bash one_amber.sh ~/opt/alphafold/open-omics-alphafold/ /public/home/software/opt/alphafold_data/ test/input/ test/output/ model_1
date
echo "alphafold end"
集群上运行¶
open-omics-alphafold安装比较麻烦,github下载代码也不稳定,可直接使用集群上已经配置好的版本。
Warning
由于CPU指令集限制,集群上只能使用 normal 队列、high队列、smp队列中的s001-s004节点才能运行 open-omics-alphafold,否则会出现报错或运行速度很慢的情况。
module 中的版本¶
# IAF2的工作目录
$ mkdir IAF2_work
$ export IAF2WORK=${PWD}/IAF2_work/
$ cd IAF2_work
# 存放IAF2输入和输出文件目录
$ mkdir data
$ mkdir data/input
$ mkdir data/output
# 其中example.fa为输入文件
$ tree data/
data/
├── input
│ └── example.fa
└── output
$ module load open-omics-alphafold/1.0
$ ln -s ${IAF2ROOT}/iaf2 .
$ ln -s ${IAF2ROOT}/open-omics-alphafold .
#BSUB -J alphafold
#BSUB -n 4
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
cd $IAF2WORK
module load open-omics-alphafold/1.0
module load micromamba/1.3.0
eval "$(micromamba shell hook --shell=bash)"
micromamba activate $IAF2ROOT/iaf2
source /public/home/software/opt/bio/software/oneAPI_BaseKit/2024.0.1.46/setvars.sh intel64
# 需要在open-omics-alphafold安装目录中运行
cd ${IAF2ROOT}/open-omics-alphafold
echo "alphafold start"
date
echo "preproc"
time bash online_preproc_baremetal.sh ${IAF2ROOT}/open-omics-alphafold /public/home/software/opt/alphafold_data/ $IAF2WORK/data/input/ $IAF2WORK/data/output/
echo "inference"
time bash online_inference_baremetal.sh ${IAF2ROOT}/iaf2 ${IAF2ROOT}/open-omics-alphafold /public/home/software/opt/alphafold_data/ $IAF2WORK/data/input/ $IAF2WORK/data/output/ model_1
echo "amber relax"
time bash one_amber.sh ${IAF2ROOT}/open-omics-alphafold /public/home/software/opt/alphafold_data/ $IAF2WORK/data/input/ $IAF2WORK/data/output/ model_1
date
echo "alphafold end"
使用镜像¶
#BSUB -J iaf2_monomer
#BSUB -n 10
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
module load Singularity/3.7.3
# 蛋白序列前处理
time singularity exec \
-B /public/home/software/opt/alphafold_data/:/database/ -B /public/home/software/opt/alphafold_data/iaf2_monomer_weights/:/weights \
-B data/input/:/input/ -B data/output/:/output/ \
--pwd /opt/intel-alphafold2-monomer/ $IMAGE/alphafold/iaf2_monomer.sif conda run -n iaf2 bash online_preproc_baremetal.sh 10
# 结构预测
time singularity exec \
-B /public/home/software/opt/alphafold_data/:/database/ -B /public/home/software/opt/alphafold_data/iaf2_mononer_weights/:/weights \
-B data/input/:/input/ -B data/output/:/output/ \
--pwd /opt/intel-alphafold2-monomer/ $IMAGE/alphafold/iaf2_monomer.sif conda run -n iaf2 bash online_inference_baremetal.sh 10 model_1,model_2,model_3,model_4,model_5
data/input/
和data/output/
分别为输入数据目录和输出数据目录,输入的蛋白序列需以.fa
结尾,data/output/
需已创建好;bash online_preproc_baremetal.sh 10
蛋白前处理设置使用10线程,默认为4线程;bash online_inference_baremetal.sh 10 model_1,model_2,model_3,model_4,model_5
结构预测使用10线程,默认为4线程;模型使用全部的5个模型model_1,model_2,model_3,model_4,model_5
,也可使用其中的部分模型model_1,model_2,model_3
;除输入输出目录外、线程数设置、模型选择之外,其它参数无需调整;
benchmark¶
使用与alphafold benchmark时一致的蛋白序列,大约250。
- seq1
>ghd7 MSMGPAAGEGCGLCGADGGGCCSRHRHDDDGFPFVFPPSACQGIGAPAPPVHEFQFFGNDGGGDDGESVAWLFDDYPPPSPVAAAAGMHHRQPPYDGVVAPPSLFRRNTGAGGLTFDVSLGERPDLDAGLGLGGGGGRHAEAAASATIMSYCGSTFTDAASSMPKEMVAAMADDGESLNPNTVVGAMVEREAKLMRYKEKRKKRCYEKQIRYASRKAYAEMRPRVRGRFAKEPDQEAVAPPSTYVDPSRLELGQWFR
软件 | 硬件 | 运行时间(s) | 最大内存(MB) | 加速倍数 |
---|---|---|---|---|
alphafold2 | CPU(intel 6150) | 232305/12164 | 15024/15084 | 1 |
open-omics-alphafold | CPU(intel 6150) | 1646 | 21669 | 141 |
alphafold2 | GPU(P100) | 4523 | 13711 | 51 |
- seq2
>2LHC_1|Chain A|Ga98|artificial gene (32630) PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK
软件 | 硬件 | 运行时间(s) | 最大内存(MB) | 加速倍数 |
---|---|---|---|---|
alphafold2 | CPU(intel 6150) | 5431 | 14837 | 1 |
open-omics-alphafold | CPU(intel 6150) | 1061 | 20909 | |
alphafold2 | GPU(P100) | 2380 | 13543 |
open-omics-alphafold-multimer¶
Note
open-omics-alphafold-multimer 可预测复合体蛋白结构
使用镜像¶
集群上的运行脚本如下
#BSUB -J iaf2_multimer
#BSUB -n 10
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
module load Singularity/3.7.3
# 蛋白序列前处理
time singularity exec \
-B /public/home/software/opt/alphafold_data/:/database/ -B /public/home/software/opt/alphafold_data/iaf2_multimer_weights/:/weights \
-B data/input/:/input/ -B data/output/:/output/ \
--pwd /opt/intel-alphafold2-multimer/ $IMAGE/alphafold/iaf2_multimer.sif conda run -n iaf2 bash online_preproc_multimer.sh 10
# 结构预测
time singularity exec \
-B /public/home/software/opt/alphafold_data/:/database/ -B /public/home/software/opt/alphafold_data/iaf2_multimer_weights/:/weights \
-B data/input/:/input/ -B data/output/:/output/ \
--pwd /opt/intel-alphafold2-multimer/ $IMAGE/alphafold/iaf2_multimer.sif conda run -n iaf2 bash online_inference_multimer.sh 10 model_2_multimer_v3,model_3_multimer_v3,model_4_multimer_v3
说明:
data/input/
和data/output/
分别为输入数据目录和输出数据目录,输入的蛋白序列需以.fa
结尾,data/output/
需已创建好;bash online_preproc_multimer.sh 10
蛋白前处理设置使用10线程,默认为4线程;online_inference_multimer.sh 10 model_2_multimer_v3,model_3_multimer_v3,model_4_multimer_v3
结构预测使用10线程,默认为4线程;模型使用model_2_multimer_v3,model_3_multimer_v3,model_4_multimer_v3
,也可同时使用全部的5个模型model_1_multimer_v3,model_2_multimer_v3,model_3_multimer_v3,model_4_multimer_v3,model_5_multimer_v3
;除输入输出目录外、线程数设置、模型选择之外,其它参数无需调整;
benchmark¶
seq1
>example1 MPADAKIRWGELEEDDGGDLDFLLPPRVVIGPDENGFKKTIEYRFDDDGNKVKVTTTTRVRKLARARLSKAAVERRSWGKFGDAASGDDASARLTVVSTEEILLERPRAPGSKADEPSASGDPLAMASKGGAVLMVCRTCGKKGDHWTSKCPYKDLAPPTDASDTPPTSDGPAALGGPAKGSYVAPRLRAGAVHTDAGHDMRRRNDENSVRVTNLSEDTREPDLLELFRTFGAVSRVYVAVDQKTGMSRGFGFVNFVHREDAEKAISKLNGYGYDNLILHVEMAAPRPT >example2 MAQGEQGALAQFGEWLWSNPIEPDQNEELVDAQEEEGQILYLDQQAGLRYSYSQSTTLRPTPQGQSSSVPTFRNAQRFQVEYSSPTTVTRSQTSRLSLSHTRPPLQSAQCLLNSTLRAHNQPWVATLTHSPSQNQQPKTSPPNRLTGRNSGRAR
软件 硬件 运行时间(s) 最大内存(MB) 加速倍数 alphafold2 CPU(intel 6150) 628010 24589 - open-omics-alphafold CPU(intel 6150) 10112/35225 30520 - alphafold2 GPU(P100) 65301 23289 - seq2
>2MX4 PTRTVAISDAAQLPHDYCTTPGGTLFSTTPGGTRIIYDRKFLLDR >2MX4 PTRTVAISDAAQLPHDYCTTPGGTLFSTTPGGTRIIYDRKFLLDR >2MX4 PTRTVAISDAAQLPHDYCTTPGGTLFSTTPGGTRIIYDRKFLLDR
软件 硬件 运行时间(s) 最大内存(MB) 加速倍数 alphafold2 CPU(intel 6150) 50218 16811 - open-omics-alphafold CPU(intel 6150) 1626/4900 21431 - alphafold2 GPU(P100) 7499 15526 - seq3
>igmfc_a HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_b HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_c HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_d HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_e HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_f HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_g HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ >igmfc_h HHHHHHHGHLVVITIIEPSLEDMLMNKKAQLVCDVNELVPGFLSVKWENDNGKTLTSRKGVTDKIAILDITYEDWSNGTVFYCAVDHMENLGDLVKKAYKRETGGVPQRPSVFLLAPAEQTSDNTVTLTCYVKDFYPKDVLVAWLVDDEPVERTSSSALYQFNTTSQIQSGRTYSVYSQLTFSNDLWKNEEVVYSCVVYHESMIKSTNIIMRTIDRTSNQPNLVNLSLNVPQRCMAQ
软件 硬件 运行时间(s) 最大内存(MB) 加速倍数 alphafold2 CPU(intel 6150) - - - open-omics-alphafold CPU(intel 6150) 43306 61838 - alphafold2 GPU(P100) 282203 25755 -
本站总访问量 次