Main Content

gpucoder.stencilKernel

CreateCUDAcode for stencil functions

Description

example

B = gpucoder.stencilKernel(FUN,A,[M N],shape,param1,param2...)applies the functionFUNto each[M,N]sliding window of the inputA. FunctionFUNis called for each[M,N]submatrix ofAand computes an element of outputB. The index of this element corresponds to the center of the[M,N]window.

FUNis the handle to a user-defined function that returns a scalar output of the same type as the input.

C= FUN(X,param1,param2, ...)

Xis the[M,N]submatrix of the original inputA.Xcan be zero-padded when necessary, for instance at the boundaries of inputA.Xand the window can also be 1-D.

Cis a scalar valued output ofFUN. It is the output computed for the center element of the[M,N]arrayXand is assigned to the corresponding element of the output arrayB.

param1,param2are optional arguments. Pass these arguments ifFUNrequires any additional parameters in addition to the input window.

The window[M,N]must be less than or equal to the size ofA, with the same shape asA.

IfAis 1-D row vector, the window must be[1,N].

IfAis 1-D column vector, the window must be[N,1].

shapedetermines the size of the output arrayB. It can have one of three possible values:

  • 'same'- Returns outputBthat is the same size asA.

  • 'full'- (default) Returns the full output. Size ofB> size ofA, that is, ifAis of size (x,y). Size ofB = [x + floor(M/2), y + floor(N/2)]

  • 'valid'- Returns only those parts of the output that are computed without the zero-padded edges ofA. Size ofB = [x - floor(M/2), y - floor(N/2)]

The inputAmust be a vector or matrix with a numeric type supported byFUN. The class ofBis the same as the class ofA.

Code generation is supported only for fixed size outputs. Shape and window must be compile-time constants because they determine the size of the output.

Examples

collapse all

This example shows how to use thegpucoder.stencilKerneland generate CUDA®kernels that perform filtering of an image by using stencil operations.

This example performs mean filtering of a 2-D image. In one file, write the entry-point functiontestthat accepts an image matrixA. Create a subfunctionmy_meanthat computes the mean of the3x3submatrix.

functionB = meanImgFilt(A)%#codegenB = gpucoder.stencilKernel(@my_mean,A,[3 3],'same');functionout = my_mean(A) out = cast(mean(A(:)), class(A));endend

Set up the test input image for themeanImgFiltfunction.

inImage = im2double(imread('cameraman.tif'));

Use thecodegenfunction to generate CUDA MEX function.

codegen-configcoder.gpuConfig('mex')-args{inImage}-reportmeanImgFilt

GPU Coder creates three kernels:meanImgFilt_kernel1for initializing memory,meanImgFilt_kernel2for optimizing the input memory structure, andmeanImgFilt_kernel3for mean filtering operation. The following is a snippet of the generated code.

cudaMalloc (&gpu_B, 524288妳);cudaMalloc (&gpu_A524288ULL); cudaMalloc(&gpu_expanded, 532512ULL); meanImgFilt_kernel1<<>>(gpu_expanded); cudaMemcpy((void *)gpu_A, (void *)&A[0], 524288ULL, cudaMemcpyHostToDevice); meanImgFilt_kernel2<<>>(gpu_A, gpu_expanded); meanImgFilt_kernel3<<>>(gpu_expanded, gpu_B); cudaMemcpy((void *)&B[0], (void *)gpu_B, 524288ULL, cudaMemcpyDeviceToHost);

meanImgFilt_kernel3uses shared memory (__shared__qualifier) to improve memory bandwidth and data locality.

Limitations

  • For very large input sizes, thegpucoder.stencilKernelfunction may produce CUDA code that does not numerically match the MATLAB®simulation. In such cases, consider reducing the size of the input to produce accurate results..

Version History

Introduced in R2017b