计算架构的演进

计算架构的演进
姚伟峰
计算架构的演进

Landmark

Superscalar时期(1990s)

ILP(Instruction Level Parallelism)

DLP(Data Level Parallelism)

Heterogeneous Parallelism

Multi Core时期(2000s)

TLP(Thread Level Parallelism)

Physical Multi-Core

Hardware Threading

Heterogeneous Computing时期(2010s+)

CPU Micro-Architecture Characteristics

GPU Micro-Architecture Characteristics

References
Landmark

Superscalar时期(1990s)

超标量时期主要关注single core的性能，主要使用的方法有：

ILP(Instruction Level Parallelism)

ILP顾名思义就是挖掘指令性并行的机会，从而增加指令吞吐。指令吞吐的度量是: IPC(Instructions Per Cycle)即每个clock cycle可以执行的指令数。在未做ILP的时候 IPC = 1。
增加IPC主要通过pipeline技术来完成(如下图)。Pipeine技术把指令的执行过程分成多个阶段(stages)，然后通过一个同步clock来控制，使得每一拍指令都会往前行进到pipeline的下一个阶段，这样理想情况下可以保证同一个cycle有条指令在pileline内，使得pipeline的所有stage都是忙碌的。称为pipeline的深度(depth)。

下图是RISC-V的标准pipeline，它的，分别为：取指(Instruction Fetch, IF)，译指(Instruction Decode, ID)，执行(Execute, EX)，访存(Memory Access, Mem)，回写(Write Back, WB)。其中IF和ID称为前端(Front End)或者Control Unit，执行/访存/回写称为后端(Back End)或者广义Arithmetic/Logical Unit(ALU)。

有了pipeline，可以通过增加pipeline width的方式提高指令并行度，即使得pipeline可以在同一个cycle取、译、发射(issue, 发射到execution engine执行)多个指令的方式来达成ILP。物理上，需要多路ID/IF以及多个execution engines。如果一个core在一个cycle最多可以issue 条指令，我们就叫这个架构为m-wide的multi-issue core。Multi-issue core也叫superscalar core。
下图为x86 SunnyCove core(Skylake的core)示意图，可以看到，它有4个计算ports(即execution engines)，我们可以称它为4-wide multi-issue core，且它的最大可达IPC为4。

DLP(Data Level Parallelism)

提高数据并行度的主要方式是增加每个execution engine单clock cycle能处理的数据数来达成。传统的CPU一个clock只能处理一个标量的运算，我们叫它scalar core。增加DLP的方式是使得一个clock能处理一个特定长度的vector数据，这就是vector core。目前vector core主要通过SIMD(Single Instruction Multiple Data)技术来实现数据并行，如ARM的NEON，X86的SSE、AVX(Advanced Vector eXtensions)、AVX2、AVX-512，以及GPU的SIMT(Single Instruction Multiple Data)的execution engine都是SIMD。
下图SunnyCove core的port 5有一个AVX-512的FMA512(512-bit Fused MultiplyAdd) 它可以带来16个FP32乘加运算的DLP。

Heterogeneous Parallelism

这一时期，我们也能依稀看到异构并行的萌芽，体现在标量和向量的异构并行上。下图就体现出标量和向量的并行。

Multi Core时期(2000s)

多核时期在继续抠ILP、DLP的同时，慢慢开始重视TLP(Thread Level Parallelism)。主要想法是通过组合多个同构(homogeneous)核，横向扩展并行计算能力。

TLP(Thread Level Parallelism)

Physical Multi-Core

Physical Multi-Core就很简单了，就是纯氪金，对CPU和GPU而言都是堆核，只不过GPU把核叫作SM(Streaming Multiprocessor, NV)，SubSlice(Intel)或Shader Array(AMD)。
下图是x86 Icelake socket，它有28个cores。

下图是NV A100对应的GA100 full chip，它有128个cores(SMs)。

Hardware Threading

相比Physical Multi-Core，Hardware Threading就是挖掘存量了。它的基本假设是现有单程序pipeline里因为各种依赖会造成各种stall，导致pipeline bubble，且想仅靠从单个程序中来fix从而打满pipeline利用率比较困难，所以考虑跨程序挖掘并行度。基于这个假设，一个自然的想法就是增加多个程序context，如果一个程序stall了，pipeline就切到另一个，从而增加打满pipeline的概率。示意图如下。
CPU：

GPU：

这就可以看出为啥叫threading了，就是不增加实际pipeline，只增加execution context的白嫖，:)，这是threading的精髓。跟software threading不一样的地方是，这个execution context的维护和使用是hardware做的，而不是software做的，因此叫hardware threading。
因为有多个contexts对应于同一个pipeline，因此如何Front End间如何issue指令也有两种方式：
- SMT(Simultaneous Multi-Threading)
  Each clock, the pipeline chooses instructions from multiple threads to run on ALUs。典型的SMT就是Intel X86 CPU的Hyper Threading Technology(HT or HTT)，每个core有2个SMT threads；以及NV GPU的warp。
- IMT(Interleaved Multi-Threading)
  Each clock, the pipeline chooses a thread, and runs an instruction
  from the thread on the core’s ALUs. Intel Gen GPU采用的SMT和IMT的混合技术。
Heterogeneous Computing时期(2010s+)

由于application对算力(Higher Computation Capacity)和能效(Better Power Efficiency)的需求越来越高，体系架构为了应对这种需求发生了methodology的shift，从One for All走向Suit is Best。这是需求侧。
而在供给侧，GPU的成功也侧面证明了Domain Specific Computing的逐渐成熟。

对software productivity而言，需要有两个前提条件：
- Unified Data Access
  这个OpenCL 2.0的SVM (Shared Virtual Memory)和CUDA的UVM(Unified Virtual Memory)有希望。硬件上coherency-aware data access硬件如CXL可以从性能角度support这个。
- Unified Programming Model
  需要类C/C++的且支持异构计算的统一编程语言，这个有CUDA，OpenCL以及DPC++。
异构计算的题中之意是：Use best compute engine for each workload by re-balancing ILP & DLP & TLP，最终计算能力是3者的组合：

对不同的wokload，我们需要考虑我们是更倾向于A Few Big还是Many Small。

目前最常见的异构计算是CPU和GPU的异构计算，CPU作为latency machine代表, GPU作为throughput machine的代表，二者各有所长。

CPU Micro-Architecture Characteristics
- TLP
  tens of cores, each with 2 hardware threads;
- ILP
  4 compute ports w/ OoO(Out of Order) issue
- DLP
  SIMD width supported by Increased $ bandwidth
GPU Micro-Architecture Characteristics
- TLP
  hundreds of cores, each with many(e.g. 32) hardware threads;
  simple and efficient thread generation/dispatching/monitoring
- ILP
  2~3 compute ports, mainly in-order issue
- DLP
  Wider SIMD width, plus large register files reduce $/memory bandwidth needs and improve compute capability and efficiency
随着算力和能效的要求越来越高，除了挖掘冯-诺伊曼体系内的异构计算机会外(如CPU+GPU异构, CPU+ASIC异构等)。大家还开始revisit其他体系结构，寻找cross体系结构的异构机会，如最近一段时间大家讨论比较多的dataflow architecture或者spatial computing。路是越走越宽了！

References
相关阅读:
14-3 SQL Server基本操作
 14-2 SQL语言简介
 14-1数据库基础--数据库相关技术
 2.9_Database Interface ADO结构组成及连接方式实例
 2.8_Database Interface ADO由来
 2.7_Database Interface OLE-DB诞生
 容器化技术之K8S
容器化技术之Docker
NLP(二)
cmake
原文地址：https://www.cnblogs.com/Matrix_Yao/p/15870917.html

计算架构的演进

Landmark

Superscalar时期(1990s)

ILP(Instruction Level Parallelism)

DLP(Data Level Parallelism)

Heterogeneous Parallelism

Multi Core时期(2000s)

TLP(Thread Level Parallelism)

Physical Multi-Core

Hardware Threading

Heterogeneous Computing时期(2010s+)

CPU Micro-Architecture Characteristics

GPU Micro-Architecture Characteristics

References