Main Content

gpucoder.batchedMatrixMultiply

Optimized GPU implementation of batched matrix multiply operation

Description

[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2)performs matrix-matrix multiplication of a batch of matricesA1,B1andA2, B2. Thegpucoder.batchedMatrixMultiplyfunction performs matrix-matrix multiplication of the form:

D = α A B

where α is a scalar multiplication factor,A,B, andDare matrices with dimensionsm-by-k,k-by-n, andm-by-nrespectively. You can optionally transpose or hermitian-conjugateAandB. By default, α is set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use theName,Valuepair arguments.

All the batches passed to thegpucoder.batchedMatrixMultiplyfunction must be uniform. That is, all instances must have the same dimensionsm,n,k.

[D1,...,DN] = gpucoder.batchedMatrixMultiply(A1,B1,...,AN,BN)performs matrix-matrix multiplication of multipleA,Bpairs of the form:

D i = α A i B i i = 1 N

example

___= gpucoder.batchedMatrixMultiply(___,Name,Value)performs batched matrix multiply operation by using the options specified by one or moreName,Valuepair arguments.

Examples

collapse all

Perform a simple batched matrix-matrix multiplication and use thegpucoder.batchedMatrixMultiplyfunction to generate CUDA®code that calls appropriatecublasgemmBatchedAPIs.

In one file, write an entry-point functionmyBatchMatMulthat accepts matrix inputsA1,B1,A2, andB2. Because the input matrices are not transposed, use the'nn'option.

function[D1,D2] = myBatchMatMul(A1,B1,A2,B2,alpha) [D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2,...'alpha',alpha,'transpose','nn');end

To create a type for a matrix of doubles for use in code generation, use thecoder.newtypefunction.

A1 = coder.newtype('double',[15,42],[0 0]); A2 = coder.newtype('double',[15,42],[0 0]); B1 = coder.newtype('double',[42,30],[0 0]); B2 = coder.newtype('double',[42,30],[0 0]); alpha = 0.3; inputs = {A1,B1,A2,B2,alpha};

To generate a CUDA library, use thecodegenfunction.

cfg = coder.gpuConfig('lib');cfg.GpuConfig。EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen-configcfg-argsinputsmyBatchMatMul

The generated CUDA code contains kernelsmyBatchMatMul_kernelNNfor initializing the input and output matrices. The code also contains thecublasDgemmBatchedAPI calls to the cuBLAS library. The following code is a snippet of the generated code.

// // File: myBatchMatMul.cu // ... void myBatchMatMul(const double A1[630], const double B1[1260], const double A2 [630], const double B2[1260], double alpha, double D1[450], double D2[450]) { double alpha1; ... myBatchMatMul_kernel1<<>>(*gpu_A2, *gpu_A1, *gpu_input_cell_f2, *gpu_input_cell_f1); cudaMemcpy(gpu_B2, (void *)&B2[0], 10080UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_B1, (void *)&B1[0], 10080UL, cudaMemcpyHostToDevice); myBatchMatMul_kernel2<<>>(*gpu_B2, *gpu_B1, *gpu_input_cell_f4, *gpu_input_cell_f3); myBatchMatMul_kernel3<<>>(gpu_r3, gpu_r2); myBatchMatMul_kernel4<<>>(gpu_r2, *gpu_out_cell); myBatchMatMul_kernel5<<>>(gpu_r3, *gpu_out_cell); ... cublasDgemmBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 15, 30, 42, (double *)gpu_alpha1, (double **)gpu_Aarray, 15, (double **)gpu_Barray, 42, (double *)gpu_beta1, (double **) gpu_Carray, 15, 2); myBatchMatMul_kernel6<<>>(*gpu_D2, *gpu_out_cell, *gpu_D1); ... }

Input Arguments

collapse all

Operands, specified as vectors or matrices.AandBmust be 2-D arrays. The number of columns inAmust be equal to the number of rows inB.

Data Types:double|single|int8|int16|int32|int64|uint8|uint16|uint32|uint64
Complex Number Support:Yes

Name-Value Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2,'alpha',0.3,'transpose','CC');

Value of the scalar used for multiplication withA. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matricesAandBprior to matrix multiplication. Possible values are normal (“N”), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. ArrayDhas the same number of rows as inputAand the same number of columns as inputB.

介绍了R2020a