Most of the functionality that Hydra Image Processor (HIP) provides comes from architecture that removes the burden of data partitioning, device scheduling, and low level data iteration. Through the use of object oriented programming and template meta-programming, adding a new function to the HIP library is as easy as starting from a template and changing a few lines of code. This procedure will be outlined here.
This is the perfect starting point for any new function. Every function that exist in the HIP library has and should start from this file. This template can be found in the project at /src/c/Cuda/_TemplateKernel.cuh or in the repository here.
Some items of note:
The file name ends with cuh to denote that this is a header file intended to be compiled with nVidea's compiler.
The implementation is also contained within this file and not a separate file. This is because that all functions in HIP are templated to support any data type.
There is a fair amount of code that has not been abstracted away. This is intentional. The code contained in the template file works for most use cases. However, there are examples where this code needs to be modified. Abstracting away functionality made working with those use cases prohibitively complicated. The code found within the template attempts to minimize the amount of code needed yet allows a programmer to tailor HIP's functionality to their needs.
Files should be organized by placing the function(s) that will run on the GPU (kernel) at the top of the file. Code that will run on the host computer should be placed after. This keeps consistency between files and is easier to maintain. This is also necessary for proper compilation.
Code Walk Through
This documentation will use the term "Host" to mean code to be run on the CPU of the computer attached to the GPU. The term "Device" will indicate a particular GPU where a function is running. "Kernel" will sometimes be used to indicate the function that is to be run on the device.
Here "host code" refers to the C/C++ function that sets up the input and output of data that will be run on the device. It could also be thought of a wrapper function for a CUDA function call. HIP provides classes to assist with output memory management, device detection, GPU buffer management, and scheduling. This walk through will start from the _TemplateKernel.cuh file and explain the key features, what to reuse, and what needs to be changed for your new function.
Each function in HIP is templated. This is to enable support for each data type. Look at the library header file section to ensure that each data type gets instantiated.
Functions should have a minimum of three parameters and be in the order: input, output, and device. More often then not, a forth parameter is needed and called a kernel. This is not to be confused with the device kernel code mentioned elsewhere. Lastly, a fifth parameter, numIterations is used to indicate the number of times this operation should be done in sequence. Each of these parameters are described in more detail below.
Both input and output parameters are templated ImageContainer classes to allow the caller to use any combination of input/output types. In practice, and in the library header file, the output is often the same type as the input. But this is not strictly necessary. These two parameters hold the memory space that contains the input data (or image) and output data. The output data can be empty in which case memory for the output will be allocated within the function described here. (See Setting Up Buffers).
HIP is capable of automatically scheduling and distributing work across as many devices as present. However, there may be times where only one device is desired. For example when parallelization is being handled elsewhere. This parameter should default to the value -1 and be optional. A negative value indicates that all devices should be used.
The kernel parameter should not be confused with kernel code. Here a kernel defines the neighborhood or support of the operation and the weights that each neighboring value should get. The ImageContainer class is reused here for simplicity and will always contain floating point values. The kernel can be though of a mini image that should have up-to three dimensions (this could change in future implementations).
The most simplistic form of this kernel is a block of ones. The dimensions of this block indicated the neighborhood size or support of the operation. Having all one values indicate that each neighboring value should be used directly (no weighting applied or more correctly a weight of one applied).
Another common use of the kernel is to have a shaped support. Often times this kernel will be a structuring element where the support or neighborhood has some shape to it (typically an ellipse).
The other most common kernel to be used is to weight the neighborhood. Take for example a Gaussian smoothing. The kernel will define the support and wights each value should contribute to the output value. This is similar to a convolution filter where the values are multiplied to the support and summed together. It is important to note here that HIP treats edges of the image differently than most other software solutions. This type of kernel needs to be treated carefully as to be energy preserving. See Data Type & Normalization for further details.
Number of Iterations Parameter
This optional parameter is used when a operation my be desired to be run more than once in succession. For instance, one might want to smooth an image using a small structuring element in a manner close to that of a larger structuring element. This preserves a more textured edge than the equivalent large structuring element. Setting this parameter to a value greater than one will run successive iterations on a chunk while it is on the device. This mitigates memory copy overhead between each iteration.
Automatic Device Discovery
One of the best features in HIP is the ability to dynamically adjust to the GPU hardware at runtime. This is done mainly with the CudaDevices class. The constructor of this class takes a function handle and an optional device number. Internally this class will calculate the optimal thread count for best register occupancy and retrieve the capabilities of each device. The Cuda device variable is used when setting up the data partitioning scheme and starting threads for each of the devices.
Setting Up Buffers Output Buffer
The function setUpOutIm takes the desired output image dimensions and the output image container pointer. This function will then check to see if memory has been allocated. If it has not, it will do so at this time.
Creating a Partitioning Scheme
Initially HIP was created when vRAM (or memory on the device) was quite limited. This meant that any appreciable amount of image data needed to be partitioned into chunks and loaded onto the device sequentially. This functionality still exists within HIP. Even though there are devices shipping with more than 32GB of vRAM, microscope data is ever increasing.
The function calculateBuffers is used calculate the optimal chunk size given the input dimensions, kernel dimensions, and the available device capabilities. If the first three dimensions will fit into the vRAM available to the device, there is no chunking necessary for that volume. Otherwise, chunks sizes will be calculated so that each chunk has an overlapping section. This section is large enough that the kernel will have enough support with inter-chunk communication. Currently, the forth and fifth dimensions will all be considered individual chunks. Said in another way, each chunk is at most three dimensional. This may be changed in future iterations.