Byte-Welt Forum

Zurück   Byte-Welt Forum > Projekte > Swogl / JCuda / JOCL > JOCL

Antwort
 
Themen-Optionen Thema durchsuchen
Alt 09.07.2010, 10:01   #1
Unregistered starter
Gast
 
Beiträge: n/a
Standard cern.colt.matrix and JOCL

Hi,

I want to paralyze with OPENCL / JOCL this matrix calculation :

static void cal(DoubleMatrix1D Ri_Y, DoubleMatrix2D Mi_YY ) {
for (int r = Ri_Y.size() - 1; r >= 0; r--) {
Ri_Y.setQuick(r, expE(Ri_Y.getQuick(r)));
if (Mi_YY != null) {
for (int c = Mi_YY.columns() - 1; c >= 0; c--) {
Mi_YY.setQuick(r, c, expE(Mi_YY.getQuick(r, c)));
}
}
}
}


The matrix are from cern.colt.matrix.

My first question is how to map this java object matrix with a openCl stucture.

regards
  Mit Zitat antworten
Alt 09.07.2010, 13:42   #2
Marco13
 
Registriert seit: 05.08.2008
Beiträge: 378
Marco13 befindet sich auf einem aufstrebenden Ast
Standard Re: cern.colt.matrix and JOCL

Hello

So you want to replace all elements of a Vector and a Matrix with the value expE(element) ?!

Note that, as far as I know, there currently are no implementations of OpenCL that support double precision (except, maybe, on MacOS?). So the values of the matrix will probably have to be converted to an 1D array of float values for the computation.

There may be several ways to achieve this, and it's hard to tell which is the best one beforehand. A first approach would be to simply walk through the matrix and write the values into a float array
Code:
for (int r=0; r<rows; r++)
{
    for (int c=0; c<cols; c++)
    {
        floatArray[c+r*cols] = (float)matrix.getQuick(r,c);
    }
}
then copy this array into a cl_mem object, and pass this to the OpenCL kernel, which may be executed by "floatArray.length" threads. Afterwards, the values may be written back from the cl_mem into the array, and finally back to the Matrix using setQuick.

Since you did not use anything like getNonZeros in the exsiting code, I assume that the Matrices are Dense (or more specifically: That they are of the specific type DenseDoubleMatrix1D/2D). IF the OpenCL implementation supported double values, you could even consider to use a specific subclass of DenseDoubleMatrix, which exposes the array of values which is used internally (via a get-Method - this is possible since this array is only protected and not private). This would save the effort of the loop from above, since you could copy this array directly into a cl_mem object.

bye
Marco13 ist offline   Mit Zitat antworten
Alt 13.07.2010, 07:29   #3
Unregistered starter
Gast
 
Beiträge: n/a
Standard Re: cern.colt.matrix and JOCL

Hi Marko,

First thanks a lot for your very detailed response.

So you want to replace all elements of a Vector and a Matrix with the value expE(element) ?!

Yes, this task is call 200 000 by hour in a artificial learning program.

Thank for all
  Mit Zitat antworten
Alt 13.07.2010, 10:42   #4
Marco13
 
Registriert seit: 05.08.2008
Beiträge: 378
Marco13 befindet sich auf einem aufstrebenden Ast
Standard Re: cern.colt.matrix and JOCL

Some more details might be helpful, e.g.
- whether this is a sparse or a dense matrix
- whether it HAS to be stored and/or computed in double precision
- whether this step or addidional operations may be processed solely on the graphics card
For example, if you have a large sparse matrix which HAS to be in double precision, and the operation you described is the only one that may be performed on the GPU, the speedup might not be so great. But if you have a dense matrix with float entries, and you do NOT have to copy the data between the host and the device in each step, this could be more beneficial.
Marco13 ist offline   Mit Zitat antworten
Alt 15.07.2010, 10:32   #5
Unregistered starter
Gast
 
Beiträge: n/a
Standard Re: cern.colt.matrix and JOCL

Hi Marco,

I use DenseMatrix.
The matrices use double, i need to estimate if can use float inside double.


I had make a very simple benchmark :

1 convert the 1D martix to a float array

2 make the exp calculation on GPU with this openCL code.

private static String programSource =
"__kernel void "
+ "sampleKernel(__global const float *a,"
+ " __global float *c)"
+ "{"
+ " int gid = get_global_id(0);"
+ " c[gid] = exp(a[gid]) ;"
+ "}";

the first result is bad, the opencl code is 10 time slower, but my configuration is pore two pseudo GPU (ATI stream).

The code for this execution time :

// Set the arguments for the kernel
clSetKernelArg(kernel, 0,
Sizeof.cl_mem, Pointer.to(memObjects[0]));
clSetKernelArg(kernel, 1,
Sizeof.cl_mem, Pointer.to(memObjects[1]));
System.out.println("clSetKernelArg");

// Set the work-item dimensions
long global_work_size[] = new long[]{nb};
long local_work_size[] = new long[]{1};

// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
global_work_size, local_work_size, 0, null, null);
System.out.println("Execute the kernel");
// Read the output data
clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
n * Sizeof.cl_float, dst, 0, null, null);


I have a question : how can i use and reuse the same openCL program at each iteration without re initialize the onenCL context ?

I will make this benchmark with a nvidia card.

Thanks a lot

kim
  Mit Zitat antworten
Alt 15.07.2010, 19:00   #6
Marco13
 
Registriert seit: 05.08.2008
Beiträge: 378
Marco13 befindet sich auf einem aufstrebenden Ast
Standard Re: cern.colt.matrix and JOCL

Hello,

As I mentioned, there are several aspects that may influence the speedup that can be achieved. When you have a Java Code like
Java Code:
  1. static void cal(DoubleMatrix1D Ri_Y) {
  2.     for (int r = Ri_Y.size() - 1; r >= 0; r--) {
  3.         Ri_Y.setQuick(r, expE(Ri_Y.getQuick(r)));
  4.     }
  5. }
and convert it to use OpenCL into something like
Java Code:
  1. static void cal(DoubleMatrix1D Ri_Y)
  2. {
  3.  
  4.     // Copy matrix into array
  5.     float array[] = new float[Ri_Y.size()]
  6.     for (int r = Ri_Y.size() - 1; r >= 0; r--) {
  7.         array[r] = (float)Ri_Y.getQuick(r);
  8.     }
  9.  
  10.     // Create memory object from the array
  11.     cl_mem mem = clCreateBuffer(... Pointer.to(array), ...);
  12.  
  13.     // Set up arguments and call the kernel
  14.     ...
  15.     // Copy back the result to the array
  16.     clEnqueueReadBuffer(..., mem, Pointer.to(array)...);
  17.  
  18.     // Array into matrix
  19.     for (int r = Ri_Y.size() - 1; r >= 0; r--) {
  20.         Ri_Y.setQuick(array[r]);
  21.     }
  22. ]
then it will most likely be slower than the plain Java implementation. The GPU is especially fast for computations that require lots of artihmetics (or the built-in functions, like the ones for trigonometry). In the example above, the computation is memory bound, and most time will be used for copying the memory between the host and the device. That's why I mentioned that it would be good when...you do NOT have to copy the data between the host and the device in each step.

Zitat:
I have a question : how can i use and reuse the same openCL program at each iteration without re initialize the onenCL context ?
Yes, of course you can call the same program multiple times. And you should definitely do that. The initialization of a new context might be very time-consuming. The basic structure of your program could probably roughly (!) be like that:
Java Code:
  1. class CLCode
  2. {
  3.     // Private CL specific variables
  4.     private cl_command_queue commandQueue;
  5.     private cl_context context;
  6.     private cl_kernel kernel;
  7.  
  8.     // Possibly you could also declare the cl_mem object here
  9.     cl_mem mem;
  10.  
  11.     public void initialize()
  12.     {
  13.         // Initialize the context, command queue and kernel here
  14.         ...
  15.         // If the size of the cl_mem does not change between
  16.         // the calls, you could also initialize the memory object here
  17.         ...
  18.     }    
  19.  
  20.     public void compute(float array[])
  21.     {
  22.         // Write the array data into the cl_mem object
  23.         clEnqueueWriteBuffer(..., mem, Pointer.to(array)...);
  24.  
  25.         // Set up the arguments and execute the kernel
  26.         clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
  27.         ...        
  28.         clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
  29.             global_work_size, local_work_size, 0, null, null);
  30.        
  31.         // Read the cl_mem object back into the array
  32.         clEnqueueReadBuffer(..., mem, Pointer.to(array)...);
  33.     }
  34.  
  35. }

So that in the actual "compute" method, you only have to copy the data to the device, execute the kernel, and copy the data back to Java.

bye
Marco13 ist offline   Mit Zitat antworten
Antwort

Lesezeichen

Stichworte
-


Aktive Benutzer in diesem Thema: 1 (Registrierte Benutzer: 0, Gäste: 1)
 
Themen-Optionen Thema durchsuchen
Thema durchsuchen:

Erweiterte Suche

Forumregeln
Es ist Ihnen erlaubt, neue Themen zu verfassen.
Es ist Ihnen erlaubt, auf Beiträge zu antworten.
Es ist Ihnen nicht erlaubt, Anhänge hochzuladen.
Es ist Ihnen nicht erlaubt, Ihre Beiträge zu bearbeiten.

BB-Code ist an.
Smileys sind an.
[IMG] Code ist an.
HTML-Code ist aus.


Ähnliche Themen
Thema Autor Forum Antworten Letzter Beitrag
About JOCL Marco13 JOCL 3 18.08.2010 16:06
JOCL Hello World program Soyeed JOCL 2 05.07.2010 11:09
JOCL NVIDIA ArrayIndexOutOfBoundsException Marcin JOCL 3 24.03.2010 19:35
Matrix, Crawler und Verknüpfung Unregistriert Hausaufgaben 9 31.01.2010 12:34
Hacker im CERN-Teilchenbeschleuniger Revenant Sicherheit 1 15.09.2008 06:02


Alle Zeitangaben in WEZ +1. Es ist jetzt 01:53 Uhr.


Powered by vBulletin® Version 3.8.2 (Deutsch)
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.