Multiplying and inverting matrices

EDITED: The first posts had been moved here by splitting this thread

Thank you very much. I will find a visual study to finish me first cuda implementation.

I don´t want to disturb you but I am going to tell you what I need to finish a work (maybe you have something similar):

  • a java method to multiply two matrix with cuda paralelism (N multicore processing where n is a parameter of the method): public double[][]multiply(double a[][], b[][], int N)
  • a java method to calculate the inverse of one matrix (N multicore processing where n is a parameter of the method): public double[][]inverse(double a[][], int N)

Now I will work in the compilation with visual studio but I would appreciate if you helped me in my real problem.

Thanks

Hello Franja,

First of all: If you want efficient matrix operations, you should have a look at CUBLAS: It already offers all the BLAS functions. Additionally, JCublas is a runtime library, so you don’t have to write or compile your own kernel to use it.

But if you want to want to (or have to) write the kernel on your own, a few hints:

  • If you want to use ‚double‘ precision, you need one of the newest GPUs - the older ones only support single ‚float‘ precision.

  • This could also become a FAQ entry :wink: : It’s usually not so beneficial to work with 2D Java arrays. I wrote about 2D arrays in this post. For short: On CUDA side, the matrices have to be 1D arrays. It would be easier if you did not have to convert the 2D Java Arrays to 1D arrays for CUDA, but instead could represent the matrices as 1D arrays, like in
    public double[] multiply(double a[], b[], int M, int N)

  • For an own implementation of a matrix multiplcation, you may want to have a look at the corresponding NVIDIA CUDA sample, which could serve as a starting point.

  • For the matrix inversion, I currently do not know a good approach. I know that this function is contained in the CULA library (or as OpenCL version in ViennaCL, which might possibly be ported to CUDA), but don’t know the specific implementation or whether it may be helpful for your task.

bye
Marco

Hi Marco:

I really do not need to implement the two methods on my own, simply I need to use them for my project.

I think Cubla library would be perfect but I have no idea what that is, how to implement those method calls that, according to you, are created …

If you can, do me a favor if you gave me a small class (main) with the import library and an example of calling the methods.

Thank you very much
Fran

Hello Franja,

The JCublas library is already included in JCuda. You might want to have a look at the JCublas Sample from the website: It performs a multiplication
C = alpha * A * B + beta * C
where A,B and C are matrices and ‘alpha’ and ‘beta’ are scalar values.

The functions of CUBLAS have … somehow ‘wierd’ names (for historical reasons) : This function is called “Sgemm” (or “Dgemm” for double precision). The full list of available functions is listed in this PDF file with the CUBLAS dcumentation or in the JCublas JavaDoc.

So for your first task, the matrix multiplication, you could use “cublasDgemm”, and this shuld not be so hard.

The second task, the matrix inversion, may be more challenging. The function that is necessary for that would be “DGETRF”, which is unfortunately not contained in CUBLAS. It’s a hot topic, and requested frequently. It should be possible to create it using the CUBLAS functions, by porting the existing Fortran code. Maybe I’ll find the time to give it a try. It might be interesting, but I can’t promise anything.

bye
Marco

Thank you very much Marco:

I will look carefully jcubla library and use the method you propose to matrix multiplication.

And for the inverse matrix I wil try asking a friend who studies math if I can calculate the inverse using some combination of methods which cubla has.

I’ll keep you informed of my progress but I would highly appreciate if you could continue helping me with the example that I asked for yesterday (If you can, of course).

Thanks
Fran

Hi Marco:

I have found a simple example of the use the multiply cubla function.

This implementation calls the cublasSgemm method and implements the same method in java (without cuda paralelism). I have added a routine for timing de functions. Here you are the results:

Creating input data…
Performing Sgemm with Java…
Tiempo de ejecución igual a 7082 milisegundos
Performing Sgemm with JCublas…
Tiempo de ejecución igual a 407 milisegundos
testSgemm PASSED //The implementation test the results too

That´s great.

Here you are the implementation:

import java.util.Random;

import jcuda.*;
import jcuda.jcublas.JCublas;


public class JCublasSample
{
    public static void main(String args[])
    {
        testSgemm(1000);
    }

    public static void testSgemm(int n)
    {
        float alpha = 1.0f;
        float beta = 0.0f;
        int nn = n * n;

        System.out.println("Creating input data...");
        float h_A[] = createRandomFloatData(nn);
        float h_B[] = createRandomFloatData(nn);
        float h_C[] = createRandomFloatData(nn);
        float h_C_ref[] = h_C.clone();

        System.out.println("Performing Sgemm with Java...");
        long A=System.currentTimeMillis();
        sgemmJava(n, alpha, h_A, h_B, beta, h_C_ref);
        long B=System.currentTimeMillis()-A;
        System.out.println("Tiempo de ejecución igual a "+B+" milisegundos");
        System.out.println("Performing Sgemm with JCublas...");
        A=System.currentTimeMillis();
        sgemmJCublas(n, alpha, h_A, h_B, beta, h_C);
        B=System.currentTimeMillis()-A;
        System.out.println("Tiempo de ejecución igual a "+B+" milisegundos");
        boolean passed = isCorrectResult(h_C, h_C_ref);
        System.out.println("testSgemm "+(passed?"PASSED":"FAILED"));
    }

    private static void sgemmJCublas(int n, float alpha, float A[], float B[],
                    float beta, float C[])
    {
        int nn = n * n;

        // Initialize JCublas
        JCublas.cublasInit();

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        JCublas.cublasAlloc(nn, Sizeof.FLOAT, d_A);
        JCublas.cublasAlloc(nn, Sizeof.FLOAT, d_B);
        JCublas.cublasAlloc(nn, Sizeof.FLOAT, d_C);

        // Copy the memory from the host to the device
        JCublas.cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(A), 1, d_A, 1);
        JCublas.cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(B), 1, d_B, 1);
        JCublas.cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(C), 1, d_C, 1);

        // Execute sgemm
        JCublas.cublasSgemm(
            'n', 'n', n, n, n, alpha, d_A, n, d_B, n, beta, d_C, n);

        // Copy the result from the device to the host
        JCublas.cublasGetVector(nn, Sizeof.FLOAT, d_C, 1, Pointer.to(C), 1);

        // Clean up
        JCublas.cublasFree(d_A);
        JCublas.cublasFree(d_B);
        JCublas.cublasFree(d_C);

        JCublas.cublasShutdown();
    }

    private static void sgemmJava(int n, float alpha, float A[], float B[],
                    float beta, float C[])
    {
        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                float prod = 0;
                for (int k = 0; k < n; ++k)
                {
                    prod += A[k * n + i] * B[j * n + k];
                }
                C[j * n + i] = alpha * prod + beta * C[j * n + i];
            }
        }
    }


    private static float[] createRandomFloatData(int n)
    {
        Random random = new Random();
        float x[] = new float[n];
        for (int i = 0; i < n; i++)
        {
            x** = random.nextFloat();
        }
        return x;
    }


    private static boolean isCorrectResult(float result[], float reference[])
    {
        float errorNorm = 0;
        float refNorm = 0;
        for (int i = 0; i < result.length; ++i)
        {
            float diff = reference** - result**;
            errorNorm += diff * diff;
            refNorm += reference** * result**;
        }
        errorNorm = (float) Math.sqrt(errorNorm);
        refNorm = (float) Math.sqrt(refNorm);
        if (Math.abs(refNorm) < 1e-6)
        {
            return false;
        }
        return (errorNorm / refNorm < 1e-6f);
    }
}


You can see that the function really do that:
C = alpha * A * B + beta * C
so, in my example alpha =1.0f and beta = 0.0f (I think it´s sufficient but in a future I won´t inizialize C).

I don´t know if you wil be able to see something about matrix inverse but in cubla library I think there is nothing.

Bye
Fran

Hello Franja,

That’s what I already wrote in this post :wink:

The example uses quadratic matrices. I’ve already tried to generalize it, so that it is closer to the function you are looking for

    /**
     * Multiplies the single-precision floating point matrices A and B and
     * writes the result into C.
     * 
     * @param A Matrix A
     * @param rA Number of rows of matrix A
     * @param cA Number of columns of matrix A
     * @param B Matrix B
     * @param cB Number of columns of matrix B
     * @param C Matrix C - must have rA rows and cB columns!
     */
    private static void multiply(float A[], int rA, int cA, float B[], int cB, float C[])
    {
        int rB = cA;
        int rC = rA;
        int cC = cB;

        Pointer dA = new Pointer();
        Pointer dB = new Pointer();
        Pointer dC = new Pointer();

        JCublas.cublasInit();

        JCublas.cublasAlloc(rA*cA, Sizeof.FLOAT, dA);
        JCublas.cublasAlloc(rB*cB, Sizeof.FLOAT, dB);
        JCublas.cublasAlloc(rC*cC, Sizeof.FLOAT, dC);
        JCublas.cublasSetVector(rA*cA, Sizeof.FLOAT, Pointer.to(A), 1, dA, 1);
        JCublas.cublasSetVector(rB*cB, Sizeof.FLOAT, Pointer.to(B), 1, dB, 1);

        // C = 1 * A * B + 0 * C,
        JCublas.cublasSgemm('n', 'n', rA, cB, cA, 1, dA, rA, dB, rB, 0, dC, rC);

        JCublas.cublasGetVector(rC*cC, Sizeof.FLOAT, dC, 1, Pointer.to(C), 1);
        JCublas.cublasFree(dA);
        JCublas.cublasFree(dB);
        JCublas.cublasFree(dC);
        JCublas.cublasShutdown();
    }

But of course, you should consider two things:

  • If you intend to call this function several times with matrices of the same size, you could allocate the memory only once and use it multiple times (you would not have to call cublasMalloc and cublasFree each time)
  • If you intend to call this function or any other function which contains JCublas functions several times, then you should call cublasInit and cublasShutdown only once, outside of the functions.

Concerning the matrix inversion: This is far from trivial (maybe that’s why you don’t find it with a simple web search :wink: ). I started porting the fortran code, but it contains further functions, which have to be ported as well. And some of them should probably be written as a Kernel (so it might be better to use an own CUBIN file for these, because they are not contained in CUBLAS). For a first version, they could be done on the host, although this would not be as efficient as a pure CUDA solution.
I’ll possibly try to continue with this (since it would be a very nice sample application for JCuda :slight_smile: ) but I’m quite busy at the moment and have a lot of other tasks, so I’m not sure when I can finish it…

bye
Marco

Hello Marco:
Bufff your implementation is really perfect. I will use yours of course.
About your consideration, I think that this method will be used only one time in my total implementation so I will use your code the same as you have written it.
About matrix inversion… it´s too difficult; I have continued with my investigation. I asked about that in a cula forum and someone has asked me that: “This is not officially supported, but I would imagine you could do it with JNI because CULA is a C-language DLL / shared library.”
However I thing that you are right and the best option is to use an own CUBIN file.
Thank you very much for your help.
Fran

Hi

Well, if you intend to write it completely on your own in an own CUBIN, it’s more than challenging (I think even more than when you try to assemble it from some CUBLAS functions). You probably can not “just write down” an optimized massively parallel LU-decomposition with pivotizing, coalescing and shared memory usage on a few rainy afternoons - that’s a lot of work :o
However, I will at least try to do it with the help of some CUBLAS functions in the next few days - but again: I can’t promise anything.

BTW: How large are the matrices that you want to multiply and invert?

bye
Marco

Hello Marco:

You are right. In fact I wanted to say exactly what you posted last night, but my English is very bad.

I thing, like you, that this method can be done by anothers method cubla and, maybe, some method in cubin implementation (if it is necesary).

About your questions, the matrix that I will use to multiply and invert will be very large (maybe hundreds thousands rows and columns)… why???

Bye
Fran

Hi

When the matrix is so large, you will run into problems anyway. Earlier versions of CUBLAS had a limitation to ~4000x4000 Matrices, but even if newer version do no longer have this limit, you will run out of memory. When your GPU has 1 GB RAM, you could not handle matrices that are larger than ~30000x30000 (or 20000x20000 with double precision). Even that would not be realistic - it would not surprise me if the limit was something like 15000x15000 or so…

This may not be such a big problem for the multiplication: You could, for example, use the Strassen Algorithm and apply the CUBLAS functions only to the submatrices. But I have no idea how you could invert such a large matrix…

An interesting question would then also be, whether your matrices are sparse or dense. When they are so large, I assume that it will be sparse matrices. CUBLAS can only handle dense matrices, and for sparse matrices, the algorithms are even more challenging…

BTW: I think I will move the posts that are about matrix operations to an own thread, since this has nothing to do with the UnsatisfiedLinkError, and the discussion about matrix multiplication and inversion may also be interesting for others.

bye
Marco

Hi Marco:

The size of the matrix is not a problem (I wil calculate the inverse of submatrices and multiply submatrices). Your method is perfect.

I don´t know what it´s “sparse matrices” and “dense matrices”. I asked it to a friend (who is a math teacher) and he doesn´t know it. My matrices have always inverse…

Bye

Hi

I have just read something about dense and sparse matrices and I can answered that my matrices are dense.

Bye

Hi Marco

I have been investigating about the most optimal inverse algorithm and I have found two good ones.

The first one is Shiplay’s method:

		float[][] resul = a.clone();
		if (a.length == a[0].length) {
			int n = resul.length;
			int k, j, i;
			for (k = 0; k < n; k++) {
				for (i = 0; i < n; i++)
					for (j = 0; j < n; j++) {
						if ((i != k) && (j != k))
							resul**[j] -= (resul**[k] * resul[k][j]) / resul[k][k];
					}
				for (j = 0; j < n; j++) {
					if (j != k)
						resul[k][j] = -resul[k][j] / resul[k][k];
				}
				for (i = 0; i < n; i++) {
					if (i != k)
						resul**[k] = resul**[k] / resul[k][k];
				}
				resul[k][k] = 1 / resul[k][k];
			}
			return resul;
		} else
			System.out.println("no se pudo");
		return null;
	}```

And the second one is:

import java.util.*;

public class Matriz{

public static double[][] matrizInversa(double[][] matriz) {
double det=1/determinante(matriz);
double[][] nmatriz=matrizAdjunta(matriz);
multiplicarMatriz(det,nmatriz);
return nmatriz;
}

public static void multiplicarMatriz(double n, double[][] matriz) {
for(int i=0;i<matriz.length;i++)
for(int j=0;j<matriz.length;j++)
matriz**[j]*=n;
}

public static double[][] matrizAdjunta(double [][] matriz){
return matrizTranspuesta(matrizCofactores(matri…
}

public static double[][] matrizCofactores(double[][] matriz){
double[][] nm=new double[matriz.length][matriz.length];
for(int i=0;i<matriz.length;i++) {
for(int j=0;j<matriz.length;j++) {
double[][] det=new double[matriz.length-1][matriz.length-1]…
double detValor;
for(int k=0;k<matriz.length;k++) {
if(k!=i) {
for(int l=0;l<matriz.length;l++) {
if(l!=j) {
int indice1=k<i ? k : k-1 ;
int indice2=l<j ? l : l-1 ;
det[indice1][indice2]=matriz[k][l];
}
}
}
}
detValor=determinante(det);
nm**[j]=detValor * (double)Math.pow(-1, i+j+2);
}

}
return nm;
}

public static double[][] matrizTranspuesta(double [][] matriz){
double[][]nuevam=new double[matriz[0].length][matriz.length];
for(int i=0; i<matriz.length; i++){
for(int j=0; j<matriz.length; j++)
nuevam**[j]=matriz[j]**;
}
return nuevam;

}

public static double determinante(double[][] matriz){
double det;
if(matriz.length==2){
det=(matriz[0][0]*matriz[1][1])-(matri…

return det;
}
double suma=0;
for(int i=0; i<matriz.length; i++){
double[][] nm=new double[matriz.length-1][matriz.length-1]…
for(int j=0; j<matriz.length; j++){
if(j!=i){
for(int k=1; k<matriz.length; k++){
int indice=-1;
if(j<i)
indice=j;
else if(j>i)
indice=j-1;
nm[indice][k-1]=matriz[j][k];

}
}

}
if(i%2==0)
suma+=matriz**[0] * determinante(nm);
else
suma-=matriz**[0] * determinante(nm);
}
return suma;
}

public static void imprimirMatriz(double[][] mat) {
for(int i=0;i<mat.length;i++) {
System.out.println(Arrays.toString(mat…
}
}

public static void main(String[]args){
double[][] deltaS= { {1,1,1,1,1 }, {0,0,1,0,1 }, {0,-1,0,1,0 }, {1,-2,0,0,0 }, {0,1,1,0,0 } };
double[][] deltaA= { {30,1,1,1,1 }, {14,0,1,0,1 }, {1,-1,0,1,0 }, {-1,-2,0,0,0 }, {10,1,1,0,0 } };
double[][] deltaB= { {1,30,1,1,1 }, {0,14,1,0,1 }, {0,1,0,1,0 }, {1,-1,0,0,0 }, {0,10,1,0,0 } };
double[][] deltaC= { {1,1,30,1,1 }, {0,0,14,0,1 }, {0,-1,1,1,0 }, {1,-2,-1,0,0 }, {0,1,10,0,0 } };
double[][] deltaD= { {1,1,1,30,1 }, {0,0,1,14,1 }, {0,-1,0,1,0 }, {1,-2,0,-1,0 }, {0,1,1,10,0 } };
double[][] deltaE= { {1,1,1,1,30 }, {0,0,1,0,14 }, {0,-1,0,1,1 }, {1,-2,0,0,-1 }, {0,1,1,0,10 } };
System.out.println( determinante(deltaA)/determinante( deltaS));
System.out.println( determinante(deltaB)/determinante( deltaS));
System.out.println( determinante(deltaC)/determinante( deltaS));
System.out.println( determinante(deltaD)/determinante( deltaS));
System.out.println( determinante(deltaE)/determinante( deltaS));

}
}```

I thing the first one is very clear.
However the second one is more modular and maybe there would be some methods implemented in jcubla libraries.

What do you thing is the best to try to implements in jcuda???

Thanks

Fran

Hello Franja,

There are many approaches for inverting matrices, and I’m not so well versed with the details of advanced matrix inversion algorithms. From the source code it is hard to understand what both algorithms are doing. I did not find a description of the Shipley method. And what they are doing with the adjugate matrix in the second algorithm is not so obvious for me (however, the web sites I found in general said that computing the inverse by using the adjugate should only be done with small matrices…)

But regardless of the “core” algorithm itself, you need a method for computing the inverse of a large matrix. Even if you ported one of these methods to CUDA/CUBLAS, you could not apply it to a 50000x50000 matrix because of the limited memory on the graphics card. This may be the primary problem: Not matter which algorithm you are going to use, how do you want to handle such large matrices?

May I ask for wich task you are going to use the inverse matrix? In many cases, you do not really need the inverse matrix, but only have to solve a system of linear equations. What exactly are you going to do with the inverse after you have computed it?

bye
Marco

Hi Marco:

In certain aspects of my implementation I use the inverse of the matrix to solve equations but for the most important ones I need the inverse of the matrix for others calculations and I have to use de inverse.

As I told you yesterday, I do not think there are problems with the size of the matrix because before inverting the matrixes and doing the multiplication, I decompose the problem that I must solve. Then, the matrices will be never larger than 10000x10000 (I have said an huge quantity that will probably not).

I know you’re busy but I’m completely lost so if you have any idea, please tell me. So I will be able to investigate and I will tell you all what I find.

Thanks
Fran

Hello Franja,

I’m not sure if it is so easy to decompose the problem. When you have a problem size of 1000x1000, and decomposing it was so easy, then you could decompose it into 1000x1000 sub-problems, because inverting a 1x1 matrix is not so difficult :wink:

However, if you say that it can be decomposed, then the question is still how to efficiently invert a matrix with CUDA or CUBLAS. As I said, I’ve already started working on this, by trying to port the SGETRF code to CUBLAS, but it’s not easy. I hope that I can make some progress there. If it turns out to be too complicated, I would first try to write a single kernel for the Shiplay method you posted, maybe it’s possible to achieve some results there.

bye
Marco

Hello Marco

I have found a short program in CUDA to invert complex matrices through the LU decomposition (without permutations) with back substitution in device mode. This means that the whole inversion is designed to run in each thread.

This code was adapted from a C code in real double precision.

The code is standalone in the sense that it only requires the Standard tools provided by CUDA (without CuBlas) and tested in the NVIDIA GeForce 240M on a Linux Fedora 11.

The program can be organized in three files

* cumatrixtools.h: which contains the method itself.
* cuinverse_kernel.cu: which contains the kernel function, which maps the method on the threreads.
* main.cu: which administers the memory and calls kernel function.

The purpose of this program is to invert several variations of a matrix A. The matrix A is passed to the kernel and the inverse is evaluated after modifying the first element of A. Each branch executes a different matrix. The kernel is called with <<<2,32>>>, which means that the threads are organized in two blocks, each bloch with 32 threads. The more physical processors one has, the more threads per block one could assign in order to gain performance.

main.cu looks like:

#include <complex.h>

#include<cuComplex.h>

#include"cumatrixtools.h"
#include"cuinverse_kernel.cu"

void cuPrintMatrix( cuFloatComplex  *C ,int N, int M )
{

int i,j;
for(i=0;i<N;i++){

for(j=0;j<M;j++) printf(" (%f,%f)	 ", cuCrealf(C[i*N + j]) , cuCimagf(C[i*N + j]) );

printf(" 
 ");
    }
}

///////////////////////////////////////////////////////////////////////////////
const int N=3;     //Matrix dimension

const int NBranches = 64;
///////////////////////////////////////////////////////////////////////////////
// Main program
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv){

  int i;
  float  complex A[9] =  //  Base matrix
    {

  0.f+I/sqrtf(2.f), 0.0f+ I/sqrt(2.0f),  0.+ 0.*I,

  0. -I/2.     , 0.+I/2       ,   0.+ I/sqrt(2.),
  -0.5+0.*I ,  0.5+0.*I  ,  -1./sqrt(2.)+.0*I

  };

  cuFloatComplex  *h_A, *h_invA;
  cuFloatComplex  *d_A, *d_invA,   *d_WorkSpace;

   
  printf("...allocating CPU memory.
");
  h_A =          (cuFloatComplex *)   malloc(                              N*N*sizeof(cuFloatComplex ));
  h_invA =       (cuFloatComplex *)   malloc(                    NBranches*N*N*sizeof(cuFloatComplex ));
  
  printf("...allocating GPU memory.
");

  cudaMalloc((void **)&d_A,                                      NBranches*N*N*sizeof(cuFloatComplex ));

  cudaMalloc((void **)&d_invA,                                   NBranches*N*N*sizeof(cuFloatComplex ));

  cudaMalloc((void **)&d_WorkSpace, NBranches*cgeMatrixInverse_WorkSpace()*N*N*sizeof(cuFloatComplex ));


  printf("...Copying memory.
 ");
  for(i=0;i<N*N;i++ ) h_A** = make_cuFloatComplex( crealf(A**) , cimagf(A**)  );

  cudaMemcpy(d_A, h_A, N*N*sizeof(cuFloatComplex) , cudaMemcpyHostToDevice);


  printf("...The base matrix is:
");
  cuPrintMatrix( h_A , N, N );


  printf("
...Calling the kernel.
");

  cudaThreadSynchronize();
  cgeMatrixInverse_kernel<<<2,32>>>(d_invA, d_A , N ,d_WorkSpace);   // Divinding the 64 branches in 2 blocks of 32 threads

    cudaThreadSynchronize();

  cudaMemcpy(h_invA, d_invA, NBranches*N*N*sizeof(float), cudaMemcpyDeviceToHost);


  printf("
 The inverse of the first branch is 
");
  cuPrintMatrix( h_invA , N, N );


   printf("
 The inverse of the second branch is 
");
  cuPrintMatrix( h_invA + N*N , N, N );


  printf("
 and so on ..
");

  free(h_A);
  free(h_invA);


  cudaFree(d_A);
  cudaFree(d_invA);

  cudaThreadExit();
  printf("
-------------------------------------------------------
");
}

cumatrixtools.h is:

#ifndef _CUMATRIXTOOLS_H_
#define _CUMATRIXTOOLS_H_


////////////////////////////////////////////////////////////////////
__device__ __host__ static __inline__ void
cgeSquareMatrixProduct(cuFloatComplex *out, cuFloatComplex *A, cuFloatComplex *B, int N  )
{

int i,j,k;
cuFloatComplex sum;

for(i=0;i<N;i++)

for(j=0;j<N;j++){
  //sum = 0. + 0.*I;
    sum  = make_cuFloatComplex(0.,0.);

  for(k=0;k<N;k++){
    sum =  cuCaddf(sum  , cuCmulf(A[i*N+k],B[k*N+j]) );
  }

  out[i*N+j] = sum;
}


}
//////////////////////////////////////////////////////////////////////

__device__ __host__ static __inline__ void cgeTranspose(cuFloatComplex  *At, cuFloatComplex  *A, int N)
{

int i,j;
for( i = 0; i<N; i++)

for( j = 0; j<N; j++)
  At[i+j*N] = A[i*N+j] ;
}

//////////////////////////////////////////////////////////////////////////////////////
//                LU decomposition of complex matrices
/////////////////////////////////////////////////////////////////////////////////////
__device__ __host__ static __inline__ void cgeDoolittle_LU_Decomposition(cuFloatComplex *LU, cuFloatComplex *A, int n)
{

 int i, j, k, p;
 cuFloatComplex *p_k, *p_row, *p_col;

for(k=0; k<n*n; k++) LU[k]=A[k];


for (k = 0, p_k = LU; k < n; p_k += n, k++) {

    for (j = k; j < n; j++) {

       for (p = 0, p_col = LU; p < k; p_col += n,  p++)

//           *(p_k + j) -= *(p_k + p) * *(p_col + j);
       *(p_k + j) = cuCsubf( *(p_k + j) , cuCmulf( *(p_k + p) , *(p_col + j) ));
    }

    if ( cuCabsf(*(p_k + k)) != 0.0 )  //return -1;


    for (i = k+1, p_row = p_k + n; i < n; p_row += n, i++) {

       for (p = 0, p_col = LU; p < k; p_col += n, p++)

         // *(p_row + k) -= *(p_row + p) * *(p_col + k);
    *(p_row + k) = cuCsubf( *(p_row + k) , cuCmulf(  *(p_row + p) , *(p_col + k) ));
         //*(p_row + k) /= *(p_k + k);


  *(p_row + k) = cuCdivf( *(p_row + k) , *(p_k + k)  );
    }
 }

}

//////////////////////////////////////////////////////////////////////////////////////////////////
// Back substitution for lower triangular matrices assuming that the diagonal is filled with 1's
// Given  T x = b ,
// where T is a lower triangula matrix NxN with 1's in the diagonal
// b is a known vector
// x is an unknown vector
// the routine otputs x = xout
//////////////////////////////////////////////////////////////////////////////////////////////////
__device__ __host__ static __inline__ void cgeBackSubstitutionLowerDiagOne(cuFloatComplex *xout, cuFloatComplex *T , cuFloatComplex *b ,int N  )
{

cuFloatComplex bTemp;
int i,j;

for(i=0;i<N;i++){

  bTemp = b**;
  for(j=0;j<i;j++) bTemp = cuCsubf( bTemp , cuCmulf(T[ i*N + j], xout[j] ) );

  xout** = bTemp ;    
}
}

/////////////////////////////////////////////////////////////////////////////////////
// Back substitution for upper triangular matrice
////////////////////////////////////////////////////////////////////////////////////
__device__ __host__ static __inline__ void cgeBackSubstitutionUpper(cuFloatComplex *xout, cuFloatComplex *T , cuFloatComplex *b ,int N  )
{

cuFloatComplex bTemp;
int i,j;

for(i=N-1;i>=0;i--){

  bTemp = b**;
  for(j=i+1;j<N;j++) bTemp = cuCsubf( bTemp , cuCmulf(T[ i*N + j], xout[j] ) );

  xout** = cuCdivf( bTemp , T[ i*N + i] );    
}
}

///////////////////////
__device__ __host__ static __inline__ void cgeIdentity(cuFloatComplex *Id, int N)
{

int i,j;
for(i=0;i<N;i++)

for(j=0;j<N;j++){
Id[i*N+j] = make_cuFloatComplex(0.f,0.f);
}


for(i=0;i<N;i++){
  Id[i*N+i] = make_cuFloatComplex(1.f,0.f);
}
}

//////////////////////////////////////////////////////////////////////////////////////
//  Inverse of a matrix using the triangularization of matrices
//  Warning:
//          It does not destroys the original matrix A
//          A work space for 3 complex matrices must be supplied in W
//////////////////////////////////////////////////////////////////////////////////////
__device__ __host__ static __inline__ int cgeMatrixInverse_WorkSpace(){ return 3;}

__device__ __host__ static __inline__ void cgeMatrixInverse(cuFloatComplex *inv , cuFloatComplex *A , int N , cuFloatComplex *W)
{
int i;
cuFloatComplex *Id;
cuFloatComplex *invL, *invU;
Id = W;      //Double purpose work space
  invL = W + N*N;
invU = W + 2*N*N;
cgeIdentity( Id, N);
 cgeDoolittle_LU_Decomposition(inv,A,N);
for(i=0;i<N;i++)   cgeBackSubstitutionLowerDiagOne( invL + i*N , inv , Id + i*N ,  N  );
for(i=0;i<N;i++) cgeBackSubstitutionUpper( invU + i*N , inv , Id + i*N ,  N  );
cgeSquareMatrixProduct(Id , invL , invU , N  );
cgeTranspose(inv , Id , N);

}

Here is cuinverse_kernel.cu:

{

int i;
int idx = blockDim.x * blockIdx.x + threadIdx.x;

cuFloatComplex * ThreadWorkSpace;

ThreadWorkSpace = Work + idx*cgeMatrixInverse_WorkSpace()*N*N;

for(i=0;i<N*N;i++) A[ i + idx*N*N ] =  A**;

A[ idx*N*N ] = make_cuFloatComplex( (float) idx , 1./sqrtf(2));

cgeMatrixInverse(invA + idx*N*N , A + idx*N*N , N , ThreadWorkSpace);


}```

and the makefile is:

#  edit the path of your working directory
BASE_PATH = /home/rcabrera/Documents/source/c/inverseCUDA

#

CUDA_PATH = /usr/local/cuda

CC = gcc
NVCC    := $(CUDA_PATH)/bin/nvcc

CUDA_INCLUDE_PATH = /usr/local/cuda/include
CUDA_LIB_PATH := $(CUDA_PATH)/lib64
CUDA_LIBS = -lcuda -lcudart

a.out: main.cu cuinverse_kernel.cu cumatrixtools.h
$(NVCC)  -I$(CUDA_INCLUDE_PATH) -L$(CUDA_LIB_PATH) $(CUDA_LIBS) -lm  main.cu -o a.out

# $(NVCC) -deviceemu  -I$(CUDA_INCLUDE_PATH) -L$(CUDA_LIB_PATH) $(CUDA_LIBS) -lm  main.cu -o a.out

clean :
rm -f a.out

I think that we can use this implementation for our porpose, can´t we???
Bye
Fran

Hello Franja,

Yes, if it is really standalone, it should be possible to compile the kernel into a CUBIN file and load and execute it with the Driver API. (BTW: A cuFloatComplex could be sent to the kernel simply as TWO float values). I will give it a try, but will probably not have the chance to do this before monday. (But will be here in the forum occasionally, for the case that you are trying it on your own)

bye
Marco

Hi Marco

I really need it so much but I prefer waiting until next week to try you to give me the java code, the file.cu and file.cubin because you know much more of this (I am really lost).

In fact I haven´t been able to compile mi file.cu to file.cubin yet. I have installed Visual Studio but the famouse error continues…

Thank you very much for all the work you’re doing.

Fran