SRA toolkit

使用SRAdb V2获取SRA数据

安装SRAdbV2包
install.packages('BiocManager')
BiocManager::install('seandavi/SRAdbV2')

使用SRAdbV2 首先需要创建一个 R6类-Omicidx

library(SRAdbV2)
oidx = Omicidx$new()

创建好Omicidx实例后，就可以使用oidx$search()来进行数据检索
query=paste(
paste0('sample_taxon_id:', 10116),
'AND experiment_library_strategy:"rna seq"',
'AND experiment_library_source:transcriptomic',
'AND experiment_platform:illumina')
z = oidx$search(q=query,entity='full',size=100L)

其中，entity 参数是指可以通过API获得的SRA实体类型， size 参数指查询结果返回的记录数

由于有时候返回的结果集数据量很会大，所以我们可以使用 Scroller 来对结果进行检索提炼
s = z$scroll()
s
s$count

s$count 可以让我们简单看一下返回数据的条数有多少

Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: api-omicidx.cancerdatasci.org

1.1 Scroller提供两种方法来存取数据
第一种方法，是把所有的查询结果都加载到R的内存中，但是这会很慢
res = s$collate(limit = 1000)
head(res)
然后使用 reset() 重新设置Scroller

s$reset()
s

第二种方法是，使用 yield 方法来迭代取数据
j = 0
## fetch only 500 records, but
## `yield` will return NULL
## after ALL records have been fetched
while(s$fetched < 500) {
    res = s$yield()
    # do something interesting with `res` here if you like
    j = j + 1
    message(sprintf('total of %d fetched records, loop iteration # %d', s$fetched, j))
}

如果没有获取到完整的数据集，Scroller对象的has_next()方法会报出 TRUE
使用 reset() 函数可以将光标移动到数据集的开头

2. Query syntax
见这里
https://bioconductor.github.io/BiocWorkshops/public-data-resources-and-bioconductor.html#query-syntax

3. Using the raw API without R/Bioconductor
可以不通过R/Bioconductor，而是用原生API获取数据
SRAdbV2封装了web的API，因此可以通过web API访问其中数据

sra_browse_API()

基于web的API为实验数据查询提供了一个有用的接口，基于json的可以用

sra_get_swagger_json_url()

===========================================

安装 sra toolkit

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

wget --output-document sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz

tar -vxzf sratoolkit.current-centos_linux64.tar.gz

export PATH=$PATH:$PWD/sratoolkit.2.10.9-centos_linux64/bin

===========================================

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

Tool: prefetch

Usage:

prefetch [options] <path/SRA file | path/kart file> [<path/file> ...]

prefetch [options] <SRA accession>

prefetch [options] --list <kart_file>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data transfer:
-f	\|	--force <value>	Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).
		--transport <value>	Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.
-l	\|	--list	List the contents of a kart file.
-s	\|	--list-sizes	List the content of kart file with target file sizes.
-N	\|	--min-size <size>	Minimum file size to download in KB (inclusive).
-X	\|	--max-size <size>	Maximum file size to download in KB (exclusive). Default: 20G.
-o	\|	--order <value>	Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.
-a	\|	--ascp-path <ascp-binary\|private-key-file>	Path to ascp program and private key file (asperaweb_id_dsa.openssh).
-p	\|	--progress <value>	Time period in minutes to display download progress (0: no progress). Default: 1.
		--option-file <file>	Read more options and parameters from the file.

Use examples:

prefetch cart_0.krt

Download the files listed in the kart file.

prefetch -l cart_0.krt

Lists the contents of the kart file.

prefetch -X 200G cart_0.krt

Sets the maximum download file size to 200GB and downloads the files listed in the kart.

prefetch -o kart cart_0.krt

Downloads the contents in the order listed in the kart. Preferred for large run sets (example: 100+) where calculating the download sizes may cause a delay to the start of downloads.

prefetch -a "/opt/aspera/bin/ascp|/opt/aspera/etc/asperaweb_id_dsa.openssh" SRR390728

When the toolkit is unable to locate an installed version of Aspera, the location of ascp and ssh key (-a /opt/aspera/bin/ascp|/opt/aspera/bin/asperaweb_id_dsa.openssh") can be provided.

prefetch -t ascp -a "/opt/aspera/bin/ascp|/opt/aspera/bin/asperaweb_id_dsa.openssh" --option-file file.txt

Will force download to be only through aspera (-t ascp) and will prevent http download, default operation is to attempt ascp first and use http if Aspera is not found or fails. Will sequentially download the SRA data files and references required for a list of accessions in "file.txt". The format for "file.txt" is a newline-separated list of accessions: SRR# SRR# SRR# …

prefetch ~/Downloads/SRR390728.sra

If you have already downloaded an SRA datafile (example here: SRR390728.sra, present in the "~/Downloads" directory), this command will retrieve all of the reference sequences required to extract the data. This command is useful for resolving errors of the type "name not found while resolving tree" - meaning that a reference(s) is required, but cannot be located.

prefetch -c SRR390728

This command will check the availability of all needed reference sequences (-c) for a given accession.

===========================================

A non-R solution is to use the SRA toolkit prefetch command on a list of SRA identifiers.

First you need the file list. You can batch download it. In your case, go to https://www.ncbi.nlm.nih.gov/sra?term=SRP026197 Top-right, click to "Send To", "File", "Accession List".

Once you have it saved in a file (default is SraAccList.txt) you can use the command (tested in SRA toolkit 2.9.0):

prefetch $(<SraAccList.txt)

===========================================

prefetch 无法显示进度和速度；

wget 显示进度和速度；

迅雷显示进度和速度；

===========================================

REF

https://www.biostars.org/p/93494/

https://blog.csdn.net/candle_light/article/details/92806204

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch

https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit

https://bioconductor.github.io/BiocWorkshops/public-data-resources-and-bioconductor.html#usage-1

相关阅读:
各类免费资料及书籍索引大全（珍藏版）
转—如何写一篇好的技术博客
 如何写技术博客
 Spring + Spring MVC + Mybatis 框架整合
 Httpclient 4.5.2 请求http、https和proxy
HttpClient4.5.2 连接池原理及注意事项
 php加密数字字符串，使用凯撒密码原理
 php 阿里云视频点播事件回调post获取不到参数
 Nginx代理后服务端使用remote_addr获取真实IP
记录：mac的浏览器访问任何域名、网址都跳转到本地127.0.0.1或固定网址
原文地址：https://www.cnblogs.com/emanlee/p/14502070.html