kingfisher

简介¶

使用kingfisher可以快速从多个源下载公共测序数据 (EBI ENA, NCBI SRA, Amazon AWS 和 Google Cloud )，用户提供一个或多个"run accession number"，如 ERR1739691，或 "BioProject accession number"，如 PRJNA621514 或 SRP260223。

官网：https://wwood.github.io/kingfisher-download/

该软件主要有3种模式

get 模式：从多个源下载测序数据
annotate 模式：获取样本信息
extract 模式：格式转化，主要是将sra文件转成fastq格式

安装¶

可以使用conda进行

$ conda create -n kingfisher -c conda-forge -c bioconda kingfisher
$ conda activate kingfisher
(kingfisher)$ kingfisher get -r SRR12118866 -m ena-ftp

也可以singularity pull镜像

$ singularity pull docker://wwood/kingfisher:0.3.1

也可以手工安装方便做软件版本管理，需要先安装sratoolkit aria2 aspera-connect这几个依赖软件，以及 pandas tqdm requests extern argparse-manpage-birdtools awscli 这几个python包。

$ git clone https://github.com/wwood/kingfisher-download
$ python setup.py install --prefix=<install-path>
$ export PATH="<install-path>:$PATH"

集群上使用

$ module load kingfisher/0.3.1

使用¶

get 模式下载数据¶

主要参数

-r, --run-identifiers RUN_IDENTIFIERS [RUN_IDENTIFIERS ...] 要下载/提取的运行编号，例如 ERR1739691
--run-identifiers-list RUN_IDENTIFIERS_LIST 包含换行分隔的运行标识符列表的文本文件，即一个列数为1的CSV文件。
-p, --bioprojects BIOPROJECTS [BIOPROJECTS ...] 要从中下载/提取的生物项目 ID 号，例如 PRJNA621514 或 SRP260223
-t, --extraction-threads EXTRACTION_THREADS 将sra文件转成fastq文件时使用的线程数
-f, --output-format-possibilities {sra,fastq,fastq.gz,fasta,fasta.gz} 指定下载的数据格式，默认为"fastq fastq.gz"
-m, --download-methods {aws-http,prefetch,aws-cp,gcp-cp,ena-ascp,ena-ftp} [{aws-http,prefetch,aws-cp,gcp-cp,ena-ascp,ena-ftp} ...] 如何下载 .sra 文件。如果指定多个方式，将依次尝试，直到成功为止 [必选项]。

方法	描述
ena-ascp	使用Aspera从ENA下载 .fastq.gz 文件，然后可以进一步转换。这是最快的方法，因为不需要更快速的转换工具。
ena-ftp	使用curl从ENA下载 .fastq.gz 文件，然后可以进一步转换。这相对较快，因为不需要更快速的转换工具。
prefetch	使用NCBI预取从 SRA-Tools 下载 .SRA 文件，然后使用 fasterq-dump 提取。
aws-http	使用带有多个连接线程的 aria2c 从 AWS 开放数据计划下载 .SRA 文件，然后使用 fasterq-dump 提取。
aws-cp	使用 aws s3 cp 从 AWS 下载 .SRA 文件，然后使用 fasterq-dump 提取。通常不需要付费或 AWS 账户。
gcp-cp	使用 Google Cloud gsutil 从 Google Cloud 下载 .SRA 文件，然后使用 fasterq-dump 提取。需要付费和 Google Cloud 账户。

使用举例

$ kingfisher get -r ERR1739691 -m ena-ascp aws-http prefetch

extract 模式格式转换¶

将sra文件转成fastq格式

主要参数

-t 线程数
-f {sra,fastq,fastq.gz,fasta,fasta.gz} 转换后的格式

使用举例

$ kingfisher extract --sra ERR1739691.sra -t 16 -f fastq.gz

annotate模式获取样本信息¶

可以获取样本的大小、建库方式、测序平台等信息

主要参数

-r, --run-identifiers RUN_IDENTIFIERS [RUN_IDENTIFIERS ...] 要下载/提取的运行编号，例如 ERR1739691
--run-identifiers-list, --run-accession-list, --run-identifiers-list RUN_IDENTIFIERS_LIST 包含换行分隔的运行标识符列表的文本文件，即一个列数为1的CSV文件。
-p, --bioprojects BIOPROJECTS [BIOPROJECTS ...] 要从中下载/提取的生物项目 ID 号，例如 PRJNA621514 或 SRP260223
-o, --output-file OUTPUT_FILE 输出写入的文件名 [默认: stdout]
-f, --output-format {human,csv,tsv,json,feather,parquet} 输出格式 [默认 human]
-a, --all-columns 打印所有元数据列 [默认: 仅打印少数选择的列]

使用举例

$ kingfisher annotate -r ERR1739691
run        | bioproject | Gbp   | library_strategy | library_selection | model               | sample_name | taxon_name
---------- | ---------- | ----- | ---------------- | ----------------- | ------------------- | ----------- | ----------
ERR1739691 | PRJEB15706 | 2.382 | WGS              | RANDOM            | Illumina HiSeq 2500 | MM1_1       | metagenome

本文阅读量次
本站总访问量次

kingfisher

简介¶

安装¶

使用¶

get 模式 下载数据¶

extract 模式 格式转换¶

annotate模式 获取样本信息¶

get 模式下载数据¶

extract 模式格式转换¶

annotate模式获取样本信息¶