• TCGA下载神器--TCGAbiolinks


    http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery:_searching_tcga_open-access_data

    举例:

    Updates

    Recently the TCGA data has been moved from the DCC server to The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal In this version of the package, we rewrote all the functions that were acessing the old TCGA server to GDC.

    The GDC, which receives, processes, harmonizes, and distributes clinical, biospecimen, and genomic data from multiple cancer research programs, has data from the following programs:

    • The Cancer Genome Atlas (TCGA)
    • Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
    • the Cancer Genome Characterization Initiative (CGCI)

    The big change is that the GDC data is harmonized against GRCh38. However, not all data has been harmonized yet. The old TCGA data can be acessed through GDC legacy Archive, in which the majority of data can be found.

    More information about the project can be found in GDC FAQS

    The functions TCGAqueryTCGAdownloadTCGAPrepareTCGAquery_mafTCGAquery_clinical, were replaced by GDCqueryGDCdownloadGDCprepareGDCquery_mafGDCquery_clinical.

    And it can acess both the GDC and GDC Legacy Archive.

    Note: Not all the examples in this vignette were updated.

    Introduction

    Motivation: The Cancer Genome Atlas (TCGA) provides us with an enormous collection of data sets, not only spanning a large number of cancers but also a large number of experimental platforms. Even though the data can be accessed and downloaded from the database, the possibility to analyse these downloaded data directly in one single R package has not yet been available.

    TCGAbiolinks consists of three parts or levels. Firstly, we provide different options to query and download from TCGA relevant data from all currently platforms and their subsequent pre-processing for commonly used bio-informatics (tools) packages in Bioconductor or CRAN. Secondly, the package allows to integrate different data types and it can be used for different types of analyses dealing with all platforms such as diff.expression, network inference or survival analysis, etc, and then it allows to visualize the obtained results. Thirdly we added a social level where a researcher can found a similar intereset in a bioinformatic community, and allows both to find a validation of results in literature in pubmed and also to retrieve questions and answers from site such as support.bioconductor.org, biostars.org, stackoverflow,etc.

    This document describes how to search, download and analyze TCGA data using the TCGAbiolinks package.

    Installation

    To install use the code below.

    source("https://bioconductor.org/biocLite.R")
    biocLite("TCGAbiolinks")

    For a Graphical User Interface, please see TCGAbiolinksGUI. The GUI in under review and will soon be available in Bioconductor repository.

    Citation

    Please cite TCGAbiolinks package:

    • “TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.” Nucleic acids research (2015): gkv1507(Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan 2016)

    Related publications to this package:

    • “TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages”. F1000Research 10.12688/f1000research.8923.1 (Silva, TC and Colaprico, A and Olsen, C and D’Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H 2016)

    Also, if you have used ELMER analysis please cite:

    • Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. “Inferring regulatory element landscapes and transcription factor networks from cancer methylomes.” Genome Biol 16 (2015): 105.
    • Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. “Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes.” Critical reviews in biochemistry and molecular biology 50.6 (2015): 550-573.
     

    GDCquery: Searching TCGA open-access data

     

    GDCquery: Searching GDC data for download

    You can easily search GDC data using the GDCquery function.

    Using a summary of filters as used in the TCGA portal, the function works with the following arguments:

    • project A list of valid project (see table below)
    • data.category A valid project (see list with getProjectSummary(project))
    • data.type A data type to filter the files to download
    • sample.type A sample type to filter the files to download (See table below)
    • workflow.type GDC workflow type
    • barcode A list of barcodes to filter the files to download
    • legacy Search in the legacy repository? Default: FALSE
    • platform Experimental data platform (HumanMethylation450, AgilentG4502A_07 etc). Used only for legacy repository
    • file.type A string to filter files, based on its names. Used only for legacy repository

    The next subsections will detail each of the search arguments. Below, we show some search examples:

    #---------------------------------------------------------------
    #  For available entries and combinations please se table below
    #---------------------------------------------------------------
    
    # Gene expression aligned against Hg38
    query <- GDCquery(project = "TARGET-AML",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts")
    
    # All DNA methylation data for TCGA-GBM and TCGA-GBM
    query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
                          legacy = TRUE,
                          data.category = "DNA methylation",
                          platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27"))
    
    # Using sample type to get only Primary solid Tumor samples and Solid Tissue Normal
    query.mirna <- GDCquery(project = "TCGA-ACC", 
                            data.category = "Transcriptome Profiling", 
                            data.type = "miRNA Expression Quantification",
                            sample.type = c("Primary solid Tumor","Solid Tissue Normal"))
    
    # Example Using legacy to accessing hg19 and filtering by barcode
    query <- GDCquery(project = "TCGA-GBM",
                      data.category = "DNA methylation", 
                      platform = "Illumina Human Methylation 27", 
                      legacy = TRUE,
                      barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))
    
    # Gene expression aligned against hg19.
    query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type  = "normalized_results",
                      experimental.strategy = "RNA-Seq",
                      barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                      legacy = TRUE)
    
    # Searching idat file for DNA methylation
    query <- GDCquery(project = "TCGA-OV",
                      data.category = "Raw microarray data",
                      data.type = "Raw intensities", 
                      experimental.strategy = "Methylation array", 
                      legacy = TRUE,
                      file.type = ".idat",
                      platform = "Illumina Human Methylation 450")
  • 相关阅读:
    MySQL优化语句
    Nagios监控mysql的安装配置及报警
    LR,mad
    时序预测 04
    [标点符...] 机器学习算法之XGBoost -什么是XGBoost? -优势&运算流程 -算法原理&数学原理 -一棵树的生成细节 -主要参数介绍
    [标点符] 机器学习算法之Boosting -集成学习的概念扥类 -bagging/boosting/stacking的区别 -boosting算法原理 -AdaBoost/Gradient Boosting/XGBoost简介
    [标点符] 机器学习算法之决策树 学习笔记(1/3) 待续...
    【第17期Datawhale | 零基础入门金融风控-贷款违约预测】Task05:模型融合(3天) : stacking (叠加)+GPU加速示例
    [特征工程01] 什么是归一化?归一化/标准化有什么用?pandas与归一化的简单实践
    【第17期Datawhale | 零基础入门金融风控-贷款违约预测】Task04:建模与调参(3天) : 调参一下记录和结果
  • 原文地址:https://www.cnblogs.com/nkwy2012/p/8044052.html
Copyright © 2020-2023  润新知