基因组云计算书籍推荐：Genomics in the Cloud Using Docker, GATK, and WDL in Terra

基因组云计算书籍推荐：Genomics in the Cloud Using Docker, GATK, and WDL in Terra
给一起学习基因组云计算的小伙伴推荐一本书，《Genomics in the Cloud：Using Docker, GATK, and WDL in Terra》，作者是GATK社区管理员，2020年出版，还算比较新吧。

Github地址：
genomics-in-the-cloud

本书涵盖内容：
- 基本基因组学和计算技术背景
- 基本的云计算操作
- GATK 入门，以及三个主要的 GATK 最佳实践管道
- 使用 WDL 和 Cromwell 使用脚本化工作流程自动分析
- 在云中扩展工作流执行，包括并行化和成本优化
- 使用 Jupyter 笔记本在云中进行交互式分析
- 使用 Terra 的安全协作和计算可重复性
书很厚，花了很大篇幅介绍Broad自己的产品，但我们基本不会用到它的云平台Terra，排版很差，这是本书不足之处。另外，该书是针对人类基因组来写的，所以范围有限。不过有选择性地挑选一些章节来看，不失为一个好的选择，毕竟这方面的书籍太少了。

以下是目录，若要获取pdf电子版，请关注微信公众号Bioinfarmer，后台回复：cloud。
1. Introduction
  The Promises and Challenges of Big Data in Biology and Life Sciences
  Infrastructure Challenges
  Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
  Cloud-Hosted Data and Compute
  Platforms for Research in the Life Sciences
  Standardization and Reuse of Infrastructure
  Being FAIR
  Wrap-Up and Next Steps
2. Genomics in a Nutshell: A Primer for Newcomers to the Field
  Introduction to Genomics
  The Gene as a Discrete Unit of Inheritance (Sort Of)
  The Central Dogma of Biology: DNA to RNA to Protein
  The Origins and Consequences of DNA Mutations
  Genomics as an Inventory of Variation in and Among Genomes
  The Challenge of Genomic Scale, by the Numbers
  Genomic Variation
  The Reference Genome as Common Framework
  Physical Classification of Variants
  Germline Variants Versus Somatic Alterations
  High-Throughput Sequencing Data Generation
  From Biological Sample to Huge Pile of Read Data
  Types of DNA Libraries: Choosing the Right Experimental Design
  Data Processing and Analysis
  Mapping Reads to the Reference Genome
  Variant Calling
  Data Quality and Sources of Error
  Functional Equivalence Pipeline Specification
  Wrap-Up and Next Steps
3. Computing Technology Basics for Life Scientists
  Basic Infrastructure Components and Performance Bottlenecks
  Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
  Levels of Compute Organization: Core, Node, Cluster, and Cloud
  Addressing Performance Bottlenecks
  Parallel Computing
  Parallelizing a Simple Analysis
  From Cores to Clusters and Clouds: Many Levels of Parallelism
  Trade-Offs of Parallelism: Speed, Efficiency, and Cost
  Pipelining for Parallelization and Automation
  Workflow Languages
  Popular Pipelining Languages for Genomics
  Workflow Management Systems
  Virtualization and the Cloud
  VMs and Containers
  Introducing the Cloud
  Categories of Research Use Cases for Cloud Services
  Wrap-Up and Next Steps
4. First Steps in the Cloud
  Setting Up Your Google Cloud Account and First Project
  Creating a Project
  Checking Your Billing Account and Activating Free Credits
  Running Basic Commands in Google Cloud Shell
  Logging in to the Cloud Shell VM
  Using gsutil to Access and Manage Files
  Pulling a Docker Image and Spinning Up the Container
  Mounting a Volume to Access the Filesystem from Within the Container
  Setting Up Your Own Custom VM
  Creating and Configuring Your VM Instance
  Logging into Your VM by Using SSH
  Checking Your Authentication
  Copying the Book Materials to Your VM
  Installing Docker on Your VM
  Setting Up the GATK Container Image
  Stopping Your VM…to Stop It from Costing You Money
  Configuring IGV to Read Data from GCS Buckets
  Wrap-Up and Next Steps
5. First Steps with GATK
  Getting Started with GATK
  Operating Requirements
  Command-Line Syntax
  Multithreading with Spark
  Running GATK in Practice
  Getting Started with Variant Discovery
  Calling Germline SNPs and Indels with HaplotypeCaller
  Filtering Based on Variant Context Annotations
  Introducing the GATK Best Practices
  Best Practices Workflows Covered in This Book
  Other Major Use Cases
  Wrap-Up and Next Steps
6. GATK Best Practices for Germline Short Variant Discovery
  Data Preprocessing
  Mapping Reads to the Genome Reference
  Marking Duplicates
  Recalibrating Base Quality Scores
  Joint Discovery Analysis
  Overview of the Joint Calling Workflow
  Calling Variants per Sample to Generate GVCFs
  Consolidating GVCFs
  Applying Joint Genotyping to Multiple Samples
  Filtering the Joint Callset with Variant Quality Score Recalibration
  Refining Genotype Assignments and Adjusting Genotype Confidence
  Next Steps and Further Reading
  Single-Sample Calling with CNN Filtering
  Overview of the CNN Single-Sample Workflow
  Applying 1D CNN to Filter a Single-Sample WGS Callset
  Applying 2D CNN to Include Read Data in the Modeling
  Wrap-Up and Next Steps
7. GATK Best Practices for Somatic Variant Discovery
  Challenges in Cancer Genomics
  Somatic Short Variants (SNVs and Indels)
  Overview of the Tumor-Normal Pair Analysis Workflow
  Creating a Mutect2 PoN
  Running Mutect2 on the Tumor-Normal Pair
  Estimating Cross-Sample Contamination
  Filtering Mutect2 Calls
  Annotating Predicted Functional Effects with Funcotator
  Somatic Copy-Number Alterations
  Overview of the Tumor-Only Analysis Workflow
  Creating a Somatic CNA PoN
  Applying Denoising
  Performing Segmentation and Call CNAs
  Additional Analysis Options
  Wrap-Up and Next Steps
8. Automating Analysis Execution with Workflows
  Introducing WDL and Cromwell
  Installing and Setting Up Cromwell
  Your First WDL: Hello World
  Learning Basic WDL Syntax Through a Minimalist Example
  Running a Simple WDL with Cromwell on Your Google VM
  Interpreting the Important Parts of Cromwell’s Logging Output
  Adding a Variable and Providing Inputs via JSON
  Adding Another Task to Make It a Proper Workflow
  Your First GATK Workflow: Hello HaplotypeCaller
  Exploring the WDL
  Generating the Inputs JSON
  Running the Workflow
  Breaking the Workflow to Test Syntax Validation and Error Messaging
  Introducing Scatter-Gather Parallelism
  Exploring the WDL
  Generating a Graph Diagram for Visualization
  Wrap-Up and Next Steps
9. Deciphering Real Genomics Workflows
  Mystery Workflow #1: Flexibility Through Conditionals
  Mapping Out the Workflow
  Reverse Engineering the Conditional Switch
  Mystery Workflow #2: Modularity and Code Reuse
  Mapping Out the Workflow
  Unpacking the Nesting Dolls
  Wrap-Up and Next Steps
10. Running Single Workflows at Scale with Pipelines API
  Introducing the GCP Genomics Pipelines API Service
  Enabling Genomics API and Related APIs in Your Google Cloud Project
  Directly Dispatching Cromwell Jobs to PAPI
  Configuring Cromwell to Communicate with PAPI
  Running Scattered HaplotypeCaller via PAPI
  Monitoring Workflow Execution on Google Compute Engine
  Understanding and Optimizing Workflow Efficiency
  Granularity of Operations
  Balance of Time Versus Money
  Suggested Cost-Saving Optimizations
  Platform-Specific Optimization Versus Portability
  Wrapping Cromwell and PAPI Execution with WDL Runner
  Setting Up WDL Runner
  Running the Scattered HaplotypeCaller Workflow with WDL Runner
  Monitoring WDL Runner Execution
  Wrap-Up and Next Steps
11. Running Many Workflows Conveniently in Terra
  Getting Started with Terra
  Creating an Account
  Creating a Billing Project
  Cloning the Preconfigured Workspace
  Running Workflows with the Cromwell Server in Terra
  Running a Workflow on a Single Sample
  Running a Workflow on Multiple Samples in a Data Table
  Monitoring Workflow Execution
  Locating Workflow Outputs in the Data Table
  Running the Same Workflow Again to Demonstrate Call Caching
  Running a Real GATK Best Practices Pipeline at Full Scale
  Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
  Examining the Preloaded Data
  Selecting Data and Configuring the Full-Scale Workflow
  Launching the Full-Scale Workflow and Monitoring Execution
  Options for Downloading Output Data—or Not
  Wrap-Up and Next Steps
12. Interactive Analysis in Jupyter Notebook
  Introduction to Jupyter in Terra
  Jupyter Notebooks in General
  How Jupyter Notebooks Work in Terra
  Getting Started with Jupyter in Terra
  Inspecting and Customizing the Notebook Runtime Configuration
  Opening Notebook in Edit Mode and Checking the Kernel
  Running the Hello World Cells
  Using gsutil to Interact with Google Cloud Storage Buckets
  Setting Up a Variable Pointing to the Germline Data in the Book Bucket
  Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
  Visualizing Genomic Data in an Embedded IGV Window
  Setting Up the Embedded IGV Browser
  Adding Data to the IGV Browser
  Setting Up an Access Token to View Private Data
  Running GATK Commands to Learn, Test, or Troubleshoot
  Running a Basic GATK Command: HaplotypeCaller
  Loading the Data (BAM and VCF) into IGV
  Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
  Visualizing Variant Context Annotation Data
  Exporting Annotations of Interest with VariantsToTable
  Loading R Script to Make Plotting Functions Available
  Making Density Plots for QUAL by Using makeDensityPlot
  Making a Scatter Plot of QUAL Versus DP
  Making a Scatter Plot Flanked by Marginal Density Plots
  Wrap-Up and Next Steps
13. Assembling Your Own Workspace in Terra
  Managing Data Inside and Outside of Workspaces
  The Workspace Bucket as Data Repository
  Accessing Private Data That You Manage Outside of Terra
  Accessing Data in the Terra Data Library
  Re-Creating the Tutorial Workspace from Base Components
  Creating a New Workspace
  Adding the Workflow to the Methods Repository and Importing It into the Workspace
  Creating a Configuration Quickly with a JSON File
  Adding the Data Table
  Filling in the Workspace Resource Data Table
  Creating a Workflow Configuration That Uses the Data Tables
  Adding the Notebook and Checking the Runtime Environment
  Documenting Your Workspace and Sharing It
  Starting from a GATK Best Practices Workspace
  Cloning a GATK Best Practices Workspace
  Examining GATK Workspace Data Tables to Understand How the Data Is Structured
  Getting to Know the 1000 Genomes High Coverage Dataset
  Copying Data Tables from the 1000 Genomes Workspace
  Using TSV Load Files to Import Data from the 1000 Genomes Workspace
  Running a Joint-Calling Analysis on the Federated Dataset
  Building a Workspace Around a Dataset
  Cloning the 1000 Genomes Data Workspace
  Importing a Workflow from Dockstore
  Configuring the Workflow to Use the Data Tables
  Wrap-Up and Next Steps
14. Making a Fully Reproducible Paper
  Overview of the Case Study
  Computational Reproducibility and the FAIR Framework
  Original Research Study and History of the Case Study
  Assessing the Available Information and Key Challenges
  Designing a Reproducible Implementation
  Generating a Synthetic Dataset as a Stand-In for the Private Data
  Overall Methodology
  Retrieving the Variant Data from 1000 Genomes Participants
  Creating Fake Exomes Based on Real People
  Mutating the Fake Exomes
  Generating the Definitive Dataset
  Re-Creating the Data Processing and Analysis Methodology
  Mapping and Variant Discovery
  Variant Effect Prediction, Prioritization, and Variant Load Analysis
  Analytical Performance of the New Implementation
  The Long, Winding Road to FAIRness
  Final Conclusions
https://www.oreilly.com/library/view/genomics-in-the/9781491975183/
https://www.amazon.ca/Genomics-Cloud-GATK-Spark-Docker/dp/1491975199
本文来自博客园，作者：Bioinfarmer，转载请注明原文链接：https://www.cnblogs.com/jessepeng/p/16302307.html。若要及时了解动态信息，请关注同名微信公众号：Bioinfarmer。
相关阅读:
HttpWebRequest 使用心得
 CDMA短信猫AT命令全集，CDMA短信猫二次开发所能使用的AT命令
 AT指令介绍及用法,AT 指令集合
 开发日记 1
一个简单存储过程的性能分析
 At 拨号
 使用 WebBrowser 操作 js
jQuery获取Select选择的Text和Value
mysql远程连接速度慢的问题[未解决]
JQuery AJAX
原文地址：https://www.cnblogs.com/jessepeng/p/16302307.html