• Matrix and linear algebra in F#, Part IV: profile your program, find the bottleneck and speed it up: using matrix multiplication as an example[z]


    We will use the face recognition program we developed in Part III to illustrate how to profile programs.

    The performance problem of the face recognition program is that it takes about 25 seconds to process 280 faces (each face has 10304=112*92 pixels). I want to find which part takes most of the time. Is it due to the eigen decomposition?

    Profile

    Profiling means knowing how much time and memory every part of your program take. Good profiling tools could give a detail report on the running behavior of a program. In .Net, there are several good tools for this purpose. Some of them are free.

    However, these tools are too big to use in a explorative/interactive programming style. Following Matlab’s tic and toc, I have written them for F#:

    let stopWatch = new System.Diagnostics.Stopwatch()

    let tic() =
    stopWatch.Reset()
    stopWatch.Start()

    let toc(msg:string) =
    stopWatch.Stop()
    printfn "%s %f ms" msg stopWatch.Elapsed.TotalMilliseconds
    stopWatch.Reset()
    stopWatch.Start()

    I also use the same style for my C programs:

    #include <time.h>
    time_t stopWatch;
    void tic()
    {
    stopWatch = clock();
    }
    void toc(const char *msg)
    {
    double s = (clock() - stopWatch) * 1000.0 / CLOCKS_PER_SEC;
    printf("%s: %.2lf ms\n", msg, s);
    stopWatch = clock();
    }

    This implementation enables us to use a sequence of tocs to profile a code block line by line without the need to write tic every time. However this simple profiler can not be nested as there is only a global timer.

    After putting several tocs into my eigCov:

    tic()
    let colm = colMean faces
    toc("mean face")
    let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])
    toc("minus mean face")
    let L = A.Transpose * A
    toc("get L")
    let val1, vec1 = eig L
    toc("lapack eig")
    let v = val1 * (1.0 / float (ncol - 1))
    let u = A * vec1.Transpose
    toc("get u")
    // normalize eigen vectors
    let mutable cnt = 0
    for i=0 to ncol-1 do
    if
    abs v.[i] < 1e-8 then
    u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0
    else
    cnt <- cnt + 1
    let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())
    u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )
    toc("normolize u")

    It gets the following output:

    mean face 639.891700 ms
    minus mean face 195.484600 ms
    get L 12885.751500 ms
    lapack eig 247.015700 ms
    get u 7169.749100 ms
    normolize u 623.919700 ms

    The bottle neck is in "Get L” and “get u”, both of which are matrix multiplication operations. The main computation part “lapack eig” actually costs very little time (as the 280-by-280 matrix is not big).

    The matrix multiplication in “get L” is 280x10304 multiplies 10304x280, which is quite large!

    Matrix Multiplication

    Matrix multiplication is a great example to show that why we need well tuned libraries to perform numerical computations.

    Table: the time cost for self-made(not optimized) implementations

    implementations operator * for matrix type 3 F# loops(use matrix) 3 F# loops(use float Array2D) 3 C loops (native)
    time(seconds) 13.07 38.81 13.52 8.10

    I used the following 3 loops:

    for (i=0; i<N; i++)
    for (j=0; j<N; j++) {
    double s = 0;
    for (k=0; k<M; k++)
    s += a[i][k] * b[k][j];
    c[i][j] = s;
    }

    The two F# programs are similar to the above one in C. We can find that if we use a lot of F# matrix element access operator [,], the performance is very poor. The * operator of matrix type does not have any optimization in it, so it costs the same amount of time as 3-loops (Array2D) does. Native code is faster than the managed ones as it does not do boundary checking for index.

    Matlab, R and Numpy/Python all wrap optimized matrix implementations. Let us next see how they perform:

    Table: the time cost for optimized implementations

    Software Matlab 2009a R 2.10.1 Numpy 1.3.0
    time 0.25 1.98 0.35

    Matlab is the fastest, but it uses 200% CPU. While R and Numpy are single threaded. R takes 2 seconds. Matlab uses Intel MKL, Numpy uses Lapack(maybe optimized version), I don’t know what algebra routines R uses.

    We can see the difference between optimized versions and non-optimized ones is very large. A similar report from a .Net math software vender is here. The conclusion is that we should use optimized matrix multiplication procedure when matrices are large.

    Add optimized matrix multiplication support to math-provider

    The math-provider has made an wrapper for dgemm_, which is a BLAS procedure for computing:

    C  :=
    alpha*op( A )*op( B ) + beta*C
    where op (A) = A or A’.

    This is general case of matrix multiplication. The math-provider also has an wrapper for dgemv_, matrix-vector multiplication.

    However, they are not directly callable from Lapack service provider.

    add the following code to linear_algebra_service.fs:
    let
    MM a b =

      Service().dgemm_(a,b)


    and add the following to linear_algebra.fs:

    let MM A B =
    if HaveService() then
    LinearAlgebraService.MM A B
    else
    A * B

    and add the following to linear_algebra.fsi:
    /// BLAS matrix multiplication

    val MM: A:matrix -> B:matrix -> matrix

    Now we define a local name for it:

    let mm = LinearAlgebra.MM 

    Using BLAS matrix multiplication routine costs about 4 seconds, which is faster than native C’s performance(8 seconds), but worse than Matlab’s or Numpy’s. The reason is that the BLAS(Netlib BLAS 3.1.1) is not optimized for my platform. If I use ATLAS for BLAS and use an Intel Fortran Compiler, the performance will be close to Numpy’s or Matlab’s.

    By using BLAS’s matrix multiplication routine for the face recognition, the total running reduces from 25 seconds to 9 seconds.

  • 相关阅读:
    Java自学-I/O File类
    Java自学-异常处理 自定义异常
    Java自学-异常处理 Throwable
    Java自学-异常处理 异常分类
    Java自学-异常处理 处理
    Java自学-异常处理 Exception
    Java自学-日期 Calendar
    Java自学-日期 日期格式化
    Java自学-日期 Date
    Java自学-数字与字符串 MyStringBuffer
  • 原文地址:https://www.cnblogs.com/begtostudy/p/1800944.html
Copyright © 2020-2023  润新知