Multiple threads -- GetPrimitiveArrayCritical

alexd457 · 6. November 2017 um 17:15

After fixing CUBLAS, there’s no more deadlocks. I fully parallelized computation between threads using streams with full use of pinned memory and async memcopys.

G1GC: Performance on a test workload running 16 threads (16-core CPU) is essentially identical between critical and no_critical (within 0.5%). However, the heap had to be increased from 1100 MB (no_critical) to 1700 MB (critical). Otherwise, G1 throws OutOfMemoryErrors.

ParallelOldGC: Performance is identical on the same workload. Heap size can be reduced to 700 MB for both versions.

The workload launches very small kernels and uses moderate CPU. I’ll update this thread when I test on more workloads. But I don’t launch very long kernels. Those should cause longer GC locks and have a greater impact on performance.

Marco13 · 6. November 2017 um 17:27

That’s some interesting insights. The transition from CUBLAS to “CUBLAS v2” was a little more transparent and silent on CUDA side: They just introduced a “cublas_v2.h” header, which at some point basically replaced the “cublas.h” header. In favor of some sort of backward compatibility, I decided to reflect this in JCuda with “JCublas” and “JCublas2”.

In view of the original issue that caused this thread, I’m not (yet) entirely sure about the conclusions. From what I understood now:

The locks and crashes have not been directly related to the Critical methods
The fixes/changes of the no_critical version are not strictly necessary. Nevertheless, some of them are certainly appropriate, because using the Critical methods for things like 3-element-arrays is certainly not necessary (and might(!) increase the GC stalls)
Some other points (referring to the async methods) still have to be fixed separately

I’m a bit confused that the heap had to be increased for the critical version - or did you confuse these two when writing the above post? Also, the role of the pinned memory (also in view of the Get..Elements vs. Get..Region issue) is not yet entirely clear. And finally, I wonder whether there are other application patterns where the no_critical change might have a greater impact on performance or memory consumption. This still has to be sorted out

In any case, I’m curious about further results of your local tests.

alexd457 · 7. November 2017 um 08:10

The switch to CUBLAS v2 isn’t that transparent. I noticed the SGEMM performance fell 30-50% for my small matrices, and the actual kernels launched switched from maxwell_* to magma_* (running on Kepler). I note that CUBLAS v2 allows choosing the GEMM algorithm, so I’ll refrain from complaining until I test all the algos.

Yes, heap on G1 increased for critical, probably because of how the GC handles memory locking. (Perhaps it locks regions, and throws OOM when all regions are locked instead of blocking).

The role of pinned memory is to use Async memcopies and avoid use of Get*Elements. All my host<->device memcopies are now async using pinned memory. (Get*Elements/Get*Critical is presumably only used for passing parameters to kernels.)

alexd457 · 7. November 2017 um 18:37

Playing with CUBLAS v2 SGEMM algos proved useless. None of them work, much less fix the performance. CUBLAS v1 works fine as long as I synchronize the calls. It might be nice to document this fact.

Marco13 · 7. November 2017 um 20:35

Regarding the performance drop: This is „unusual“, at least. However, CUBLAS uses some very, very black magic internally, to use different kernels for different target platforms, as indicated by the kernel name change that you mentioned - so this might be some sort of a regression…

The question regarding the pinned memory: This referred to the fact that there might be drawbacks of the current changes in the no_critical branch that you no longer notice, because you switched to pinned memory.

Playing with CUBLAS v2 SGEMM algos proved useless. None of them work, much less fix the performance.

It is not entirely clear (and not really documented, from what I see) what exactly the difference between these algorithms is. They are likely some sort of „implementation detail“. There seem to be some constraints for when each algorithm may be used. (For example, there might be an algorithm that only works for certain data types, or only for quadratic matrices, or only matrices that have a size that is a power of 2, or whatnot…)

Here on my GTX970/Win8.1, the following sample works, but the performance differences are not really significant:

/*
 * JCuda - Java bindings for NVIDIA CUDA
 *
 * Copyright 2008-2016 Marco Hutter - http://www.jcuda.org
 */
package jcuda.jcublas.samples;

import static jcuda.cudaDataType.CUDA_R_32F;
import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasGemmEx;
import static jcuda.jcublas.JCublas2.cublasGetVector;
import static jcuda.jcublas.JCublas2.cublasSetVector;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO0;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO2;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO4;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO5;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO6;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO7;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.samples.utils.JCudaSamplesUtils;

/**
 * This is a sample class demonstrating the application of JCublas2 for
 * performing a BLAS 'sgemm' operation, i.e. for computing the matrix <br>
 * <code>C = alpha * A * B + beta * C</code> <br>
 * for single-precision floating point values alpha and beta, and matrices 
 * A, B and C, using the extended CUBLAS GEMM function
 */
public class JCublas2SgemmExSample
{
    public static void main(String args[])
    {
        JCublas2.setExceptionsEnabled(true);
        testSgemm(2000);
    }
    
    // The list of CUBLAS GEMM algorithms to use. Note that the set of
    // supported algorithms will likely depend on the platform, the
    // size of the matrix, and other factors.
    private static final List<Integer> GEMM_ALGORITHMS = Arrays.asList(
        CUBLAS_GEMM_ALGO2,
        CUBLAS_GEMM_ALGO4,
        CUBLAS_GEMM_ALGO5,
        CUBLAS_GEMM_ALGO6,
        CUBLAS_GEMM_ALGO7
    );
    private static int GEMM_ALGO = CUBLAS_GEMM_ALGO0;

    /**
     * Test the JCublas sgemm operation for matrices of size n x x
     * 
     * @param n The matrix size
     */
    public static void testSgemm(int n)
    {
        float alpha = 0.3f;
        float beta = 0.7f;
        int nn = n * n;

        System.out.println("Creating input data...");
        float h_A[] = JCudaSamplesUtils.createRandomFloatData(nn);
        float h_B[] = JCudaSamplesUtils.createRandomFloatData(nn);
        float h_C[] = JCudaSamplesUtils.createRandomFloatData(nn);

        System.out.println("Performing Sgemm with JCublas...");
        for (int i : GEMM_ALGORITHMS)
        {
            GEMM_ALGO = i;
            try
            {
                sgemmJCublas(n, alpha, h_A, h_B, beta, h_C);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }

    }

    /**
     * Implementation of sgemm using JCublas
     */
    private static void sgemmJCublas(
        int n, float alpha, float A[], float B[], float beta, float C[])
    {
        int nn = n * n;

        // Create a CUBLAS handle
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, nn * Sizeof.FLOAT);
        cudaMalloc(d_B, nn * Sizeof.FLOAT);
        cudaMalloc(d_C, nn * Sizeof.FLOAT);

        // Copy the memory from the host to the device
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(A), 1, d_A, 1);
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(B), 1, d_B, 1);
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(C), 1, d_C, 1);

        // Execute sgemm
        Pointer pAlpha = Pointer.to(new float[] { alpha });
        Pointer pBeta = Pointer.to(new float[] { beta });
        
        long before = System.nanoTime();
        
        cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, 
            pAlpha, d_A, CUDA_R_32F, n, d_B, CUDA_R_32F, n, 
            pBeta, d_C, CUDA_R_32F, n, CUDA_R_32F, GEMM_ALGO);
        
        cudaDeviceSynchronize();
        
        long after = System.nanoTime();
        double durationMs = (after - before) / 1e6;
        System.out.println(
            "Algorithm " + GEMM_ALGO + " took " + durationMs + " ms");

        // Copy the result from the device to the host
        cublasGetVector(nn, Sizeof.FLOAT, d_C, 1, Pointer.to(C), 1);

        // Clean up
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        cublasDestroy(handle);
    }

}

The output is

Creating input data...
Performing Sgemm with JCublas...
Algorithm 2 took 6.009423 ms
Algorithm 4 took 5.844668 ms
Algorithm 5 took 6.196667 ms
Algorithm 6 took 5.925823 ms
Algorithm 7 took 6.093024 ms

Maybe the differences are larger for other configurations…?

In any case, you mentioned that you have small matrices. If you have many of them: Did you already consider using the batched SGEMM? I just added a sample that I still had lying around here, at jcuda-samples/JCudaSamples/src/main/java/jcuda/jcublas/samples/JCublas2SgemmBatched.java at master · jcuda/jcuda-samples · GitHub (but of course, I did not do any benchmarks here yet…)

alexd457 · 10. November 2017 um 10:26

Thank you for the sample. It turns out that cublasSetAtomicsMode (handle, cublasAtomicsMode.CUBLAS_ATOMICS_ALLOWED) breaks cublasGemmEx. I blindly included it because the cuBLAS docs said it could boost performance.

I expanded the sample to be a bit more thorough and to use CUDA events:

/*
 * JCuda - Java bindings for NVIDIA CUDA
 *
 * Copyright 2008-2016 Marco Hutter - http://www.jcuda.org
 */
package jcuda.jcublas.samples;

import static jcuda.cudaDataType.CUDA_R_32F;
import static jcuda.jcublas.JCublas2.*;
import static jcuda.jcublas.cublasGemmAlgo.*;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_T;
import static jcuda.runtime.JCuda.*;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.jcublas.cublasAtomicsMode;
import jcuda.jcublas.cublasPointerMode;
import jcuda.runtime.cudaStream_t;
import jcuda.runtime.cudaEvent_t;
import jcuda.runtime.JCuda;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.JCudaDriver;
import jcuda.samples.utils.JCudaSamplesUtils;

/**
 * This is a sample class demonstrating the application of JCublas2 for
 * performing a BLAS 'sgemm' operation, i.e. for computing the matrix <br>
 * <code>C = alpha * A * B + beta * C</code> <br>
 * for single-precision floating point values alpha and beta, and matrices 
 * A, B and C, using the extended CUBLAS GEMM function
 */
public class JCublas2SgemmExSample
{
    public static void main(String args[])
    {
        JCuda.setExceptionsEnabled(true);
        
        CUdevice device = new CUdevice();
        CUcontext context = new CUcontext();
    
        JCudaDriver.cuInit(0);
        JCudaDriver.cuDeviceGet(device, 0);
        JCudaDriver.cuCtxCreate(context, 0, device);
            
        testSgemm();
        
        JCudaDriver.cuCtxDestroy (context);
    }
    
    // Remember, matrices are in column-major order
    // and pitch is also column-major
    final static boolean TRANSPOSE_A = true;
    final static boolean TRANSPOSE_B = true;
    final static int M = 2048; // number of rows of matrix op(A) and rows of matrix C
    final static int N = 2048; // number of columns of matrix op(B) and number of columns of C
    final static int K = 2048; // number of columns of matrix op(A) and number of rows of op(B)
    final static int LDA = 2048; // pitch of matrix A (before the transpose)
    final static int LDB = 2048; // pitch of matrix B (before the transpose)
    final static int LDC = 2048; // pitch of matrix C
    final static float ALPHA = 1f;
    final static float BETA = 0f;
    
    // The list of CUBLAS GEMM algorithms to use. Note that the set of
    // supported algorithms will likely depend on the platform, the
    // size of the matrix, and other factors.
    private static final List<Integer> GEMM_ALGORITHMS = Arrays.asList(
        CUBLAS_GEMM_DFALT,
        CUBLAS_GEMM_ALGO0, // doesn't seem to ever work
        CUBLAS_GEMM_ALGO1, // doesn't seem to ever work
        CUBLAS_GEMM_ALGO2,
        CUBLAS_GEMM_ALGO3,
        CUBLAS_GEMM_ALGO4,
        CUBLAS_GEMM_ALGO5,
        CUBLAS_GEMM_ALGO6,
        CUBLAS_GEMM_ALGO7
    );

    /**
     * Test the JCublas sgemm operation for matrices of size n x x
     * 
     * @param n The matrix size
     */
    public static void testSgemm ()
    {
        System.out.println("Creating input data...");
        float h_A[] = JCudaSamplesUtils.createRandomFloatData(K * LDA);
        float h_B[] = JCudaSamplesUtils.createRandomFloatData(N * LDB);
        float h_C[] = JCudaSamplesUtils.createRandomFloatData(N * LDC);

        System.out.println("Performing Sgemm with cuBLAS v1...");
        {
            try
            {
                sgemmJCublas (h_A, h_B, h_C, false, 0);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }
        System.out.println("Performing Sgemm with cuBLAS v2...");
        for (int algo : GEMM_ALGORITHMS)
        {
            try
            {
                sgemmJCublas (h_A, h_B, h_C, true, algo);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }
        System.out.println();
        System.out.println("Timing is inaccurate because the GPU doesn't have time to enter high-performance state.");
        System.out.println("Abnormally quick execution time indicates failure and incompatibility of the given algorithm.");
    }
    
    /**
     * Implementation of sgemm using JCublas
     */
    private static void sgemmJCublas (float A[], float B[], float C[], boolean useJCublas2, int algo)
    {
        // Create a CUBLAS handle
        cublasHandle handle = null;
        if (useJCublas2)
        {
            handle = new cublasHandle();
            cublasCreate(handle);
            // Causes CUBLAS v2 to fail (Linux, CUDA 8)
//            cublasSetAtomicsMode (handle, cublasAtomicsMode.CUBLAS_ATOMICS_ALLOWED);
            cublasSetPointerMode (handle, cublasPointerMode.CUBLAS_POINTER_MODE_HOST);
        }
        else
            JCublas.cublasInit();

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, K * LDA * Sizeof.FLOAT);
        cudaMalloc(d_B, N * LDB * Sizeof.FLOAT);
        cudaMalloc(d_C, N * LDC * Sizeof.FLOAT);

        // Copy the memory from the host to the device
        cublasSetVector (K * LDA, Sizeof.FLOAT, Pointer.to(A), 1, d_A, 1);
        cublasSetVector (N * LDB, Sizeof.FLOAT, Pointer.to(B), 1, d_B, 1);
        cublasSetVector (N * LDC, Sizeof.FLOAT, Pointer.to(C), 1, d_C, 1);

        // Execute sgemm
        Pointer pAlpha = Pointer.to(new float[] { ALPHA });
        Pointer pBeta = Pointer.to(new float[] { BETA });
        
        cudaEvent_t startEvent = new cudaEvent_t();
        cudaEvent_t stopEvent = new cudaEvent_t();
        
        cudaEventCreate (startEvent);
        cudaEventCreate (stopEvent);
        
        // run twice for a bit more stable numbers
        for (int i = 0; i < 2; i++)
        {
            cudaEventRecord (startEvent, cudaStreamLegacy);
            if (useJCublas2)
                cublasGemmEx(handle, 
                    TRANSPOSE_A ? CUBLAS_OP_T : CUBLAS_OP_N, 
                    TRANSPOSE_B ? CUBLAS_OP_T : CUBLAS_OP_N, 
                    M, N, K, 
                    pAlpha, d_A, CUDA_R_32F, LDA, d_B, CUDA_R_32F, LDB, 
                    pBeta, d_C, CUDA_R_32F, LDC, CUDA_R_32F, algo);
            else
                JCublas.cublasSgemm
                    (TRANSPOSE_A ? 't' : 'n',
                    TRANSPOSE_B ? 't' : 'n',
                    M, N, K, 
                    ALPHA, d_A, LDA, d_B, LDB, 
                    BETA, d_C, LDC);
            cudaEventRecord (stopEvent, cudaStreamLegacy);
        }
        cudaEventSynchronize (stopEvent);
        
        float[] time = new float [1];
        cudaEventElapsedTime (time, startEvent, stopEvent);
        
        if (useJCublas2)
            System.out.println(
                "cuBLAS v2 algorithm " + algo + " took " + time[0] * 1000 + " us");
        else
            System.out.println(
                "cuBLAS v1 took " + time[0] * 1000 + " us");

        // Copy the result from the device to the host
        cublasGetVector (N * LDC, Sizeof.FLOAT, d_C, 1, Pointer.to(C), 1);

        // Clean up
        cudaEventDestroy (startEvent);
        cudaEventDestroy (stopEvent);
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        if (useJCublas2)
            cublasDestroy(handle);
        else
            JCublas.cublasShutdown();
    }

}

There does seem to be a bit of a performance difference between the algos, especially when matrices are transposed:

Performing Sgemm with cuBLAS v1...
cuBLAS v1 took 1600.096 us
Performing Sgemm with cuBLAS v2...
cuBLAS v2 algorithm -1 took 1571.264 us
cuBLAS v2 algorithm 0 took 7.328 us
cuBLAS v2 algorithm 1 took 6.784 us
cuBLAS v2 algorithm 2 took 1523.84 us
cuBLAS v2 algorithm 3 took 1593.0559 us
cuBLAS v2 algorithm 4 took 1719.872 us
cuBLAS v2 algorithm 5 took 1660.3201 us
cuBLAS v2 algorithm 6 took 2125.248 us
cuBLAS v2 algorithm 7 took 2459.808 us

Timing is inaccurate because the GPU doesn't have time to enter high-performance state.
Abnormally quick execution time indicates failure and incompatibility of the given algorithm.

Marco13 · 10. November 2017 um 18:48

Good catch, the transposed/non-transposed cases are likely to influence the performance if the different algorithms here as well. If you don’t mind, I’ll consider extending the example and maybe add it to the samples.

More generally: A benchmark of certain (J)CUBLAS functions has already been discussed in JCublas DSYRK and DGEMM benchmark . Maybe it contains some information relevant for you, but it’s a rather long thread with some discussion. It would probably really be good to condense this into a sample and/or a “Benchmarks” page on the website

There, I could also try to cover the batched version. I’d really like to know where the “break even” point is of individual SGEMMs vs. batched SGEMM.

It’s difficult to allocate the time though, but might be of general interest after all.