• <二代測序> 下载 NCBI sra 文件


    本文近期更新地址:
    http://blog.csdn.net/tanzuozhev/article/details/51077222

    随着測序技术的不断提高。二代測序数据成指数增长。
    NCBI提供了SRA数据库存储这些数据。
    http://www.ncbi.nlm.nih.gov/sra

    为了方便更好的分析这些数据,NCBI提供了下载的命令行工具:sra-toolkit。

    包含下面命令:
    官方文档:
    http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

    prefetch: Allows command-line downloading of SRA, dbGaP, and ADSP data 下载数据

    fastq-dump: Convert SRA data into fastq format # 将下载的sra数据转换为 fastq文件,支持 PE

    sam-dump: Convert SRA data to sam format# sra转换为sam

    sra-pileup: Generate pileup statistics on aligned SRA data

    vdb-config: Display and modify VDB configuration information

    vdb-decrypt: Decrypt non-SRA dbGaP data (“phenotype data”)

    prefetch

    经常使用命令
    Data transfer:
    # 假设已有下载的文件是否强制下载,默觉得非强制
    -f  |   --force <value> Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).
    
    # 选择下载的方式 ascp 和 http,默认先尝试 ascp。再尝试http
    --transport <value> Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.
    
    # 列举 kart 文件里的 内容,大小
    # 你能够把须要下载的项目放入 kart 文件
    -l  |   --list  List the contents of a kart file.
    -s  |   --list-sizes    List the content of kart file with target file sizes.
    
    # 设置文件的最小尺寸
    -N  |   --min-size <size>   Minimum file size to download in KB (inclusive).
    
    # 设置文件的最大尺寸
    -X  |   --max-size <size>   Maximum file size to download in KB (exclusive). Default: 20G.
    
    # 排序方式
    -o  |   --order <value> Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.

    样例

    prefetch ERR732926

    直接下载 ERR732926 样本的文件,默认放入 ~//ncbi/public/sra 文件夹下

    prefetch cart_0.krt

    下载 kart文件里的列表

    prefetch -l cart_0.krt

    列举cart_0.krt文件的内容

    fastq-dump

    
    General:
    -h  |   --help  Displays ALL options, general usage, and version information.
    -V  |   --version   Display the version of the program.
    Data formatting:
    #切割 paired-end data
    --split-files   Dump each read into separate file. Files will receive suffix corresponding to read number.
    --split-spot    Split spots into individual reads.
    
    # 仅仅保留fasta,没有质量得分
    --fasta <[line width]>  FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
    -I  |   --readids   Append read id after spot id as 'accession.spot.readid' on defline.
    -F  |   --origfmt   Defline contains only original sequence name.
    -C  |   --dumpcs <[cskey]>  Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
    -B  |   --dumpbase  Formats sequence using base space (default for other than SOLiD).
    -Q  |   --offset <integer>  Offset to use for ASCII quality scores. Default is 33 ("!").
    Filtering:
    -N  |   --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range.
    -X  |   --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.
    -M  |   --minReadLen <len>  Filter by sequence length >= <len>
    --skip-technical    Dump only biological reads.
    --aligned   Dump only aligned sequences. Aligned datasets only; see sra-stat.
    --unaligned Dump only unaligned sequences. Will dump all for unaligned datasets.
    
    # 输出数据
    Workflow and piping:
    -O  |   --outdir <path> Output directory, default is current working directory ('.').
    -Z  |   --stdout    Output to stdout, all split data become joined into single stream.
    --gzip  Compress output using gzip.
    --bzip2 Compress output using bzip2.

    样例

    fastq-dump -X 5 -Z SRR390728

    能够在不下载的情况下。显示SRR390728样本的前五个读段(20行)

    fastq-dump -I –split-files SRR390728

    处理 paired-end 文件
    Produces two fastq files (–split-files) containing “.1” and “.2” read suffices (-I) for paired-end data.

    fastq-dump –split-files –fasta 60 SRR390728

    Produces two (–split-files) fasta files (–fasta) with 60 bases per line (“60” included after –fasta).

    fastq-dump –split-files –aligned -Q 64 SRR390728

    Produces two fastq files (–split-files) that contain only aligned reads (–aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.

    列举出经常使用命令,假设有其它须要请阅读官方文档。

  • 相关阅读:
    微信第三方平台开发之代小程序实现业务
    解决Chrome网页编码显示乱码的问题
    .Net Core 使用 System.Drawing.Common 在CentOS下报错
    CentOS安装nmap端口查看工具
    解决Nginx反向代理不会自动对特殊字符进行编码的问题 如gitblit中的~波浪线
    Centos7最小安装化后安装图形界面
    手把手教您在 Windows Server 2019 上使用 Docker
    windows10下安装docker报错:error during connect
    git删除远程分支
    linux下shell显示git当前分支
  • 原文地址:https://www.cnblogs.com/tlnshuju/p/7182690.html
Copyright © 2020-2023  润新知