interproscan

InterProScan是一个用于基因组序列功能注释的工具，它可以预测蛋白质和核酸序列的功能、结构和域。InterProScan基于多个数据库（如InterPro、Pfam、PRINTS、Prosite等）中的注释信息和模式，并利用序列比对和统计分析来推断序列的功能特征。

网页API¶

EBI提供了远程计算服务，通过 iprscan5.pl 或 iprscan5.py 可以将本地的序列文件上传到InterProScan官网后台计算，结果返回本地。本地不需要消耗计算资源，需要能联网，可直接在登录节点运行。每条序列返回一个结果文件。

$ wget https://raw.githubusercontent.com/ebi-wp/webservice-clients/master/perl/iprscan5.pl
$ chmod +x iprscan5.pl

$ ./iprscan5.pl --multifasta test_all_appl.fasta --maxJobs 25 --useSeqId --email test@test.com --outformat tsv
Submitting job for: UPI00043D6473
JobId: iprscan5-R20231012-030441-0244-77128907-p1m
RUNNING
RUNNING
RUNNING
RUNNING
FINISHED
Creating result file: UPI00043D6473.tsv.tsv
Submitting job for: UPI0004FABBC5
JobId: iprscan5-R20231012-030556-0901-62528903-p1m
RUNNING
RUNNING
RUNNING
RUNNING
FINISHED
Creating result file: UPI0004FABBC5.tsv.tsv
Submitting job for: UPI0002E0D40B
JobId: iprscan5-R20231012-030716-0570-58059556-p1m

$ ls
UPI00043D6473.tsv.tsv UPI0004FABBC5.tsv.tsv

本地运行¶

官方文档：https://interproscan-docs.readthedocs.io

$ module load interproscan/5.55-88.0
$ interproscan.sh -i <input_file> -o <output_directory> [options]

-i：指定输入文件，可以是蛋白质或核酸序列文件。
-o：指定输出目录，用于保存结果文件。
-f：指定输出格式，默认为tsv（Tab-separated values）。其他可选格式包括json、xml、html等。
-cpu：指定CPU核心数，用于并行计算，默认为1。
-goterms：生成Gene Ontology (GO) 注释信息。
-pathways：生成kegg pathway 注释信息。
-appl：指定要运行的特定应用程序模块。例如，-appl CDD将只运行CDD模块，-appl Pfam,Smart将同时运行Pfam和Smart模块。
-iprlookup：启用InterPro数据库匹配，用于进一步注释已预测的域。

基本使用¶

由于interproscan是个计算密集型程序，每条序列的计算需要花费不少时间。因此，程序默认会将用户的序列与interpro官方已有的数据比较，如果用户提交的序列与官方已有数据完全匹配，则直接返回已经计算好的结果，此过程需要联网。如果无网络链接，程序会报错。

$ module load interproscan/5.55-88.0
$ interproscan.sh -i test_single_protein.fasta -f tsv -cpu 20

离线使用¶

在离线环境中可以添加 -dp 选项，使所有计算均在本地完成，耗时较长。

$ module load interproscan/5.55-88.0
$ interproscan.sh -i test_single_protein.fasta -f tsv -cpu 20 -dp

Tips

由于集群计算节点没有联网，使用本地计算服务程序时，需要添加 -dp 选项。

本文阅读量次
本站总访问量次