• CUDA教程三、cuda的随机数生成


    上一篇我们介绍了runtime库中的一些函数,接下来我们来介绍cuda随机数的生成。


    回顾

    cuda将函数与变量根据其所在位置,分割成两部分。其中主机端(host)的函数与变量可以互相自由调用,设备端(device)的函数与变量也可自由调用,不过设备端有一种特殊的函数——__kernel__函数(核函数),这是主机端调用设备端函数的唯一方法。

    核函数的调用需要运行时参数,host端和device端都是如此。

    对于变量的访问,host端可以修改和读取__device__和__constant__,却无法访问__shared__;device端可以读取上述三种,却只能修改__device__和__shared__变量。至于host端变量,自然只有host端函数才能修改和读取,device端无法访问。

    问题的提出

    神经网络的初始化、游戏NPC的随机生成等,都需要用到随机数。cuda的访问控制,给了随机数生成以一定的问题,host端当然可以用cstdlib中的rand()函数生成随机数,但设备端如何使用这些随机数?每调用一次rand(),就通过cudaMemcpy传递给显存吗?显然不是,这会消耗太多I/O时间;先调用n次,一次性传到device中吗?虽然可行,但并不是最优解。能否用一种方法,让主机仅仅传给设备一个信号,使得多个随机数在device端被生成,这样就能避免过多的I/O带来的耗时问题;或者调用一个设备端函数,使得设备端可以根据需要动态生成随机数。于是curand库和curand_kernel库进入了我们的视野。

    curand库——host端的随机数生成

    curand库使用了我们的第一种思路,让主机向设备传递信号,指定随机数的生成数量,生成若干随机数以待使用。

    随机数生成器

    curand提供了多种不同的随机数生成器:

     1 /**
     2  * CURAND generator types
     3  */
     4 enum curandRngType {
     5     CURAND_RNG_TEST = 0,
     6     CURAND_RNG_PSEUDO_DEFAULT = 100,           ///< Default pseudorandom generator
     7     CURAND_RNG_PSEUDO_XORWOW = 101,            ///< XORWOW pseudorandom generator
     8     CURAND_RNG_PSEUDO_MRG32K3A = 121,          ///< MRG32k3a pseudorandom generator
     9     CURAND_RNG_PSEUDO_MTGP32 = 141,            ///< Mersenne Twister MTGP32 pseudorandom generator
    10     CURAND_RNG_PSEUDO_MT19937 = 142,           ///< Mersenne Twister MT19937 pseudorandom generator
    11     CURAND_RNG_PSEUDO_PHILOX4_32_10 = 161,     ///< PHILOX-4x32-10 pseudorandom generator
    12     CURAND_RNG_QUASI_DEFAULT = 200,            ///< Default quasirandom generator
    13     CURAND_RNG_QUASI_SOBOL32 = 201,            ///< Sobol32 quasirandom generator
    14     CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 = 202,  ///< Scrambled Sobol32 quasirandom generator
    15     CURAND_RNG_QUASI_SOBOL64 = 203,            ///< Sobol64 quasirandom generator
    16     CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 = 204   ///< Scrambled Sobol64 quasirandom generator
    17 };
    18 typedef enum curandRngType curandRngType_t;

    许多英语比我好的同学,通过看注释就能看明白,枚举值大于等于100且小于200的这六个随机数发生器是pseudorandom(伪随机数),而枚举值大于等于200的这五个随机数发生器则是quasirandom(真随机数)。

    学过高中数学,我们知道,根据算法得到的随机数为伪随机数,而根据现实中的不依赖算法的随机事件得到的随机数为真随机数。换句话说,伪随机数可以被破解,而真随机数是无法预测的。

    比如cstdlib库中的rand函数是这样实现的:

    1 unsigned int _Status = 0;
    2 void __cdecl srand(unsigned int _Seed) {
    3     _Status = _Seed;
    4 }
    5 int __cdecl rand(void) {
    6     _Status = 214013u * _Status + 2531011u;
    7     return (int)(_Status >> 16u & 0x7fffu);
    8 }

    很明显这是通过算法模拟出来的随机,当然是伪随机数。

    而模拟电视在无信号的时候出现的无序噪点,每一个噪点的值都是真随机数生成的。

    curand库生成伪随机数的算法,也是通过算法得到的,如XORWOW算法则是NVIDIA根据XORSHIFT算法改进而成的。而真随机数,则是直接对显卡电流等实际数据的采样,再加上简单的计算处理得到的。在密码学加密中,随机数尽量由真随机数生成器生成,防止敌人预测出序列,轻松破解。

    随机数的生成

    生成随机数的方法也很简单,首先要初始化随机数生成器,然后调用“传递信号”的函数,让device端生成若干随机数。device端使用时只需知道随机数数组的指针即可。

    比如通过下面的操作:

     1 curandGenerator_t gen;        //随机数生成器
     2 curandStatus_t cst;        //curand库函数返回值暂存
     3 cudaError_t cet;        //runtime库函数返回值暂存
     4 unsigned int* arr;        //随机数存储位置
     5 const size_t N = 128;
     6 
     7 //初始化生成器
     8 cst = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_XORWOW);
     9 assert(CURAND_STATUS_SUCCESS == cst);
    10 //动态申请内存
    11 cet = cudaMalloc(&arr, sizeof(unsigned int) * N);
    12 assert(cudaSuccess == cet);
    13 //将N个随机数生成在数组arr中
    14 cst = curandGenerate(gen, arr, N);
    15 assert(CURAND_STATUS_SUCCESS == cst);

    本例中使用的是XORWOW生成器,同理我们也可以用其他生成器,包括真随机数生成器,生成其他类型的数、其他类型的分布,同样也可以简略来写:

     1 curandGenerator_t gen;        //随机数生成器
     2 unsigned long long* uint64_arr;    //64位无符号整数数组
     3 double* double_arr;        //64位浮点数数组
     4 const size_t N = 128;
     5 
     6 //初始化生成器
     7 assert(CURAND_STATUS_SUCCESS == curandCreateGenerator(&gen, CURAND_RNG_QUASI_SOBOL64));
     8 //动态申请内存
     9 assert(cudaSuccess == cudaMalloc(&uint64_arr, sizeof(unsigned long long) * N));
    10 assert(cudaSuccess == cudaMalloc(&double_arr, sizeof(double) * N));
    11 //将随机数生成在两个数组中
    12     //生成unsigned long long类型的若干个随机数
    13 assert(CURAND_STATUS_SUCCESS == curandGenerateLongLong(gen, uint64_arr, N));
    14     //生成符合μ=0、σ²=1标准正态分布的double类型的若干个随机数
    15 assert(CURAND_STATUS_SUCCESS == curandGenerateNormalDouble(gen, double_arr, N, 0.0, 1.0));

    除此之外,curand库依然有很多生成不同类型随机数的函数,其中常用的函数原型如下:

     1 curandStatus_t CURANDAPI curandGenerate ( curandGenerator_t generator, unsigned int* outputPtr, size_t num );
     2 //Generate 32-bit pseudo or quasirandom numbers.
     3 
     4 curandStatus_t CURANDAPI curandGenerateLogNormal ( curandGenerator_t generator, float* outputPtr, size_t n, float  mean, float  stddev );
     5 //Generate log-normally distributed floats.
     6 
     7 curandStatus_t CURANDAPI curandGenerateLogNormalDouble ( curandGenerator_t generator, double* outputPtr, size_t n, double  mean, double  stddev );
     8 //Generate log-normally distributed doubles.
     9 
    10 curandStatus_t CURANDAPI curandGenerateLongLong ( curandGenerator_t generator, unsigned long long* outputPtr, size_t num );
    11 //Generate 64-bit quasirandom numbers.
    12 
    13 curandStatus_t CURANDAPI curandGenerateNormal ( curandGenerator_t generator, float* outputPtr, size_t n, float  mean, float  stddev );
    14 //Generate normally distributed doubles.
    15 
    16 curandStatus_t CURANDAPI curandGenerateNormalDouble ( curandGenerator_t generator, double* outputPtr, size_t n, double  mean, double  stddev );
    17 //Generate normally distributed doubles.
    18 
    19 curandStatus_t CURANDAPI curandGeneratePoisson ( curandGenerator_t generator, unsigned int* outputPtr, size_t n, double  lambda );
    20 //Generate Poisson-distributed unsigned ints.
    21 
    22 curandStatus_t CURANDAPI curandGenerateUniform ( curandGenerator_t generator, float* outputPtr, size_t num );
    23 //Generate uniformly distributed floats.
    24 
    25 curandStatus_t CURANDAPI curandGenerateUniformDouble ( curandGenerator_t generator, double* outputPtr, size_t num );
    26 //Generate uniformly distributed doubles.

    编译错误的处理

    需要注意的是,用VS直接创建的cuda工程,直接编译上述代码很可能出现一些编译错误:

    error LNK2019: 无法解析的外部符号 curandCreateGenerator,函数 main 中引用了该符号
    error LNK2019: 无法解析的外部符号 curandGenerate,函数 main 中引用了该符号

    有编程经验的朋友们看到LNK就知道这是链接器报错,肯定是少链接了一些文件。没错,只需要在项目-属性-链接器-输入-附加依赖项中添加一项"curand.lib"即可。

    curand_kernel库——device端的动态随机数生成

    也行有人觉得上述方法太过呆板,能生成多少随机数必须在调用device端函数前指定,实际工程中可能会生成过多的随机数,占用更多内存,也有可能生成的随机数不够,以致访问无效内存或随机数从头循环。

    但用第二种思路,使用一些__device__函数动态生成随机数,则会避免这些问题。

    可惜的是,国内国外论坛,甚至包括github,很少介绍和使用curand_kernel库,真随机数生成器的初始化方法更是无人介绍,连cuda文档也没有给出例子。

    不过也好,我的机会来了——本篇将是历史上第一篇详细介绍curand_kernel库使用方法,并给出使用实例的文章(狗头)。

    随机数生成器

    curand_kernel提供了八种随机数生成器,并分别封装为结构体:

     1 //伪随机数生成器原型
     2 typedef struct curandStateMtgp32 curandStateMtgp32_t;
     3 typedef struct curandStatePhilox4_32_10 curandStatePhilox4_32_10_t;
     4 typedef struct curandStateXORWOW curandStateXORWOW_t;
     5 typedef struct curandStateMRG32k3a curandStateMRG32k3a_t;
     6 //真随机数生成器原型
     7 typedef struct curandStateSobol32 curandStateSobol32_t;
     8 typedef struct curandStateScrambledSobol32 curandStateScrambledSobol32_t;
     9 typedef struct curandStateSobol64 curandStateSobol64_t;
    10 typedef struct curandStateScrambledSobol64 curandStateScrambledSobol64_t;

    名字都很眼熟吧,没错,前四个依旧是伪随机数生成器,后四个依旧是真随机数生成器。除了使用了MTGP32算法的curandStateMtgp32_t生成器外,其他七个生成器都可以用__device__函数——curand_init()初始化。

    不过,curand_kernel库的生成器类型并不和curand库的生成器一样是枚举类型,而是定义成了不同的结构体,甚至没使用继承或共用体。这也暗示了不同生成器的初始化需要传入不同的参数。

    MRG32K3A、Philox4_32_10与XORWOW生成器的初始化

    除去MTGP32生成器外,剩下的三个生成器的初始化方法非常相似:

    1 __device__​ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStateMRG32k3a_t* state )
    2 //Initialize MRG32k3a state.
    3 
    4 __device__​ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStatePhilox4_32_10_t* state )
    5 //Initialize Philox4_32_10 state.
    6 
    7 __device__​ void curand_init ( unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStateXORWOW_t* state )
    8 //Initialize XORWOW state.

    以XORWOW为例,我们可以在核函数中这么写来初始化多个curandStateXORWOW_t:

     1 __global__ void initialize_generator(curandStateXORWOW_t* states, unsigned long long seed, size_t size) {
     2     size_t Idx = threadIdx.x + blockDim.x * blockIdx.x;
     3     if(Idx >= size) return;
     4     
     5     curand_init(seed, Idx, 0, &states[Idx]);
     6 }
     7 
     8 //调用参数准备
     9 curandStateXORWOW_t* states;
    10 size_t size = 640 * 360;
    11 assert(cudaSuccess == cudaMalloc(&states, sizeof(curandStateXORWOW_t) * size));
    12 //调用运行时参数准备
    13 size_t thread_count = 1024;
    14 size_t block_count = (size - 1) / thread_count + 1;
    15 //调用核函数
    16 initialize_generator<<<block_count, thread_count>>>(states, time(nullptr), size);
    17 cudaDeviceSynchronize();
    18 assert(cudaSuccess == cudaGetLastError());

    MTGP32生成器的初始化

    而MTGP32生成器的初始化,需要包含curand_mtgp32_host.h库。库中有两个函数:

    1 __host__​ __forceinline__ curandStatus_t curandMakeMTGP32Constants ( const mtgp32_params_fast_t params[], mtgp32_kernel_params_t* p )
    2 //Set up constant parameters for the mtgp32 generator.
    3 __host__ __forceinline__ curandStatus_t CURANDAPI curandMakeMTGP32KernelState ( curandStateMtgp32_t* s, mtgp32_params_fast_t params[], mtgp32_kernel_params_t* k, int  n, unsigned long long seed )
    4 //Set up initial states for the mtgp32 generator.

    具体初始化方法如下:

    1 mtgp32_kernel_params* devKernelParams;
    2 curandStateMtgp32_t* states;
    3 size_t size = 128;
    4 
    5 assert(cudaSuccess == cudaMalloc(&devKernelParams, sizeof(mtgp32_kernel_params)));
    6 assert(cudaSuccess == cudaMalloc(&states, size * sizeof(curandStateMtgp32_t)));
    7 assert(CURAND_STATUS_SUCCESS == curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams));
    8 assert(CURAND_STATUS_SUCCESS == curandMakeMTGP32KernelState(states, mtgp32dc_params_fast_11213, devKernelParams, size, time(nullptr)));

    真随机数生成器的初始化

    curand_kernel库中的四个真随机数生成器的初始化都需要用到方向向量:

    1 __device__​ void curand_init ( curandDirectionVectors64_t direction_vectors, unsigned long long scramble_c, unsigned long long offset, curandStateScrambledSobol64_t* state );
    2 //Initialize Scrambled Sobol64 state.
    3 __device__​ void curand_init ( curandDirectionVectors64_t direction_vectors, unsigned long long offset, curandStateSobol64_t* state );
    4 //Initialize Sobol64 state.
    5 __device__​ void curand_init ( curandDirectionVectors32_t direction_vectors, unsigned int  scramble_c, unsigned int  offset, curandStateScrambledSobol32_t* state );
    6 //Initialize Scrambled Sobol32 state.
    7 __device__​ void curand_init ( curandDirectionVectors32_t direction_vectors, unsigned int  offset, curandStateSobol32_t* state );
    8 //Initialize Sobol32 state.

    curand库有获取方向向量的函数:

    1 curandStatus_t CURANDAPI curandGetDirectionVectors32 ( curandDirectionVectors32_t* vectors, curandDirectionVectorSet_t set );
    2 //Get direction vectors for 32-bit quasirandom number generation.
    3 curandStatus_t CURANDAPI curandGetDirectionVectors64 ( curandDirectionVectors64_t* vectors, curandDirectionVectorSet_t set );
    4 //Get direction vectors for 64-bit quasirandom number generation.

    同时有两个随机数生成器也需要施加干扰。curand库也有获取干扰因子的函数:

    1 curandStatus_t CURANDAPI curandGetScrambleConstants32 ( unsigned int** constants )
    2 //Get scramble constants for 32-bit scrambled Sobol'.
    3 curandStatus_t CURANDAPI curandGetScrambleConstants64 ( unsigned long long** constants )
    4 //Get scramble constants for 64-bit scrambled Sobol'.

    具体的生成方法就比较冗长了,这里以Scrambled Sobol 32为例:

     1 __global__ void initialize_generator(curandStateScrambledSobol32_t* states, curandDirectionVectors32_t* devVectors,
     2                      unsigned int* devScrambleConstants, size_t size) {
     3     size_t Idx = threadIdx.x + blockDim.x * blockIdx.x;
     4     if(Idx >= size) return;
     5     
     6     curand_init(devVectors[Idx], devScrambleConstants[Idx], Idx * 1024, &states[Idx]);
     7 }
     8 
     9 //host端变量
    10 curandDirectionVectors32_t* hostVectors;
    11 unsigned int* hostScrambleConstants;
    12 //device端变量
    13 curandStateScrambledSobol32_t* states;
    14 curandDirectionVectors32_t* devVectors;
    15 unsigned int* devScrambleConstants;
    16 
    17 const size_t size = 128;
    18 //获取方向向量和扰动量
    19 assert(CURAND_STATUS_SUCCESS == curandGetDirectionVectors32(&hostVectors, CURAND_SCRAMBLED_DIRECTION_VECTORS_32_JOEKUO6));
    20 assert(CURAND_STATUS_SUCCESS == curandGetScrambleConstants32(&hostScrambleConstants));
    21 //申请device端内存
    22 assert(cudaSuccess == cudaMalloc(&states, size * sizeof(curandStateScrambledSobol32_t)));
    23 assert(cudaSuccess == cudaMalloc(&devVectors, size * sizeof(curandDirectionVectors32_t)));
    24 assert(cudaSuccess == cudaMalloc(&devScrambleConstants, size * sizeof(unsigned int)));
    25 //将获取到的方向向量和扰动量拷贝到device端
    26 assert(cudaSuccess == cudaMemcpy(devVectors, hostVectors, size * sizeof(curandDirectionVectors32_t), cudaMemcpyHostToDevice));
    27 assert(cudaSuccess == cudaMemcpy(devScrambleConstants, hostScrambleConstants, size * sizeof(unsigned int), cudaMemcpyHostToDevice));
    28 
    29 size_t thread_count = 256;
    30 size_t block_count = (size - 1) / thread_count + 1;
    31 initialize_generator<<<block_count, thread_count>>>(states, devVectors, devScrambleConstants, size);
    32 cudaDeviceSynchronize();
    33 assert(cudaSuccess == cudaGetLastError());

    由于真随机数的初始化没有subsequence参数,所以不同线程随机数的区分要靠offset来完成,因此curand_init的offset参数被我填了Idx * 1024,这个1024不是固定的,只要足够大,使得不同线程生成的随机数互不重合,但又不会太大以至于访问无效内存即可。

    生成器的使用

    实际使用随机数的时候则直接使用curand函数即可,非常方便。curand函数原型如下:

     1 __device__​ unsigned int     curand ( curandStateMtgp32_t* state )
     2 //Return 32-bits of pseudorandomness from a mtgp32 generator.
     3 
     4 __device__​ unsigned long long     curand ( curandStateScrambledSobol64_t* state )
     5 //Return 64-bits of quasirandomness from a scrambled Sobol64 generator.
     6 
     7 __device__​ unsigned long long     curand ( curandStateSobol64_t* state )
     8 //Return 64-bits of quasirandomness from a Sobol64 generator.
     9 
    10 __device__​ unsigned int     curand ( curandStateScrambledSobol32_t* state )
    11 //Return 32-bits of quasirandomness from a scrambled Sobol32 generator.
    12 
    13 __device__​ unsigned int     curand ( curandStateSobol32_t* state )
    14 //Return 32-bits of quasirandomness from a Sobol32 generator.
    15 
    16 __device__​ unsigned int     curand ( curandStateMRG32k3a_t* state )
    17 //Return 32-bits of pseudorandomness from an MRG32k3a generator.
    18 
    19 __device__​ unsigned int     curand ( curandStatePhilox4_32_10_t* state )
    20 //Return 32-bits of pseudorandomness from an Philox4_32_10 generator.
    21 
    22 __device__​ unsigned int     curand ( curandStateXORWOW_t* state )
    23 //Return 32-bits of pseudorandomness from an XORWOW generator.

    依旧以XORWOW为例,我们来使用上述代码生成若干可以被2或3整除的随机数:

     1 __global__ void use_generator(curandStateXORWOW_t* states, unsigned int* arr, size_t size) {
     2     size_t Idx = threadIdx.x + blockDim.x * blockIdx.x;
     3     if(Idx >= size) return;
     4 
     5     arr[Idx] = curand(&states[Idx]);
     6     while((arr[Idx] % 2 != 0) && (arr[Idx] % 3 != 0)) {
     7         arr[Idx] = curand(&states[Idx]);
     8     }
     9 }
    10 
    11 unsigned int* arr;
    12 assert(cudaSuccess == cudaMalloc(&arr, sizeof(unsigned int) * size));
    13 use_generator<<<block_count, thread_count>>>(states, arr, size);
    14 cudaDeviceSynchronize();
    15 assert(cudaSuccess == cudaGetLastError());

    除curand()函数外,还有若干其他的生成函数,下例代码我们用MTGP32生成器生成[公式]区间内均匀分布的double型变量:

     1 __global__ void use_generator(curandStateMtgp32_t* states, double* arr, size_t size) {
     2     size_t Idx = threadIdx.x + blockDim.x * blockIdx.x;
     3     if(Idx >= size) return;
     4 
     5     arr[Idx] = curand_uniform_double(&states[Idx]);
     6 }
     7 
     8 double* arr;
     9 assert(cudaSuccess == cudaMalloc(&arr, sizeof(double) * size));
    10 size_t thread_count = 256;
    11 size_t block_count = (size - 1) / thread_count + 1;
    12 use_generator<<<block_count, thread_count>>>(states, arr, size);
    13 cudaDeviceSynchronize();
    14 assert(cudaSuccess == cudaGetLastError());

    其他常用的分布函数如下(注意,下文中gen_type可被替换为任何一种curand_kernel库的生成器类型):

     1 __device__​ float curand_log_normal ( gen_type* state, float  mean, float  stddev );
     2 //Return a log-normally distributed float
     3 
     4 __device__​ double curand_log_normal_double ( gen_type* state, double  mean, double  stddev );
     5 //Return a log-normally distributed double
     6 
     7 __device__​ float curand_normal ( gen_type* state );
     8 //Return a normally distributed float
     9 
    10 __device__​ double curand_normal_double ( gen_type* state );
    11 //Return a normally distributed double
    12 
    13 __device__​ unsigned int curand_poisson ( gen_type* state, double lambda );
    14 //Return a Poisson-distributed unsigned int
    15 
    16 __device__​ float curand_uniform ( gen_type* state );
    17 //Return a uniformly distributed float
    18 
    19 __device__​ double curand_uniform_double ( gen_type* state );
    20 //Return a uniformly distributed double
  • 相关阅读:
    Java中单态设计模式
    Google地图,Baidu地图数据供应商
    栅格数据和矢量数据
    Java学习之Iterator(迭代器)的一般用法 (转)
    highcharts图表组件入门教程:如何监听柱状图柱子点击事件动态更新当前数据点数值和所对应X轴刻度
    relative、absolute和float
    【转】20个令人敬畏的jQuery插件
    启动tomcat显示成功,但是访问tomcat首页是404错误页
    手机测试Android程序
    Android数据的四种存储方式SharedPreferences、SQLite、Content Provider和File (三) —— SharePreferences
  • 原文地址:https://www.cnblogs.com/ybqjymy/p/15883441.html
Copyright © 2020-2023  润新知