Runtime Scheduling and Memory Management of Arbitrarily Large Datasets With the Ability to Distribute Across Many GPUs
Computation speed is often hindered by data I/O. To mitigate this effect data should be located as close to the processor as possible, in registers ideally like in Hennessy (2011). However, GPUs have limited memory when compared to that available to the CPU. State-of-the-art microscopes can easily produce datasets one to two orders of magnitude larger than GPU memory or vRAM. HIP obscures this from the end user while simultaneously making it easy for programmers to write new image processing algorithms. Ensuring data is close to the GPU is accomplished through a technique we refer to as "chunking.'"
Chunking partitions data input data for transfer to the GPU based on the input data dimensions, kernel operation size, and GPU resources available at runtime. By integrating the resources at runtime, HIP is able to run on a diverse set of hardware, from laptops with a discreet GPU to servers with many GPUs. The optimal chunk size must balance memory transfer speeds while minimizing redundant work and ensuring access to necessary data.
The memory available for processing is additionally limited by the need to store at least input and output buffers on the GPU device. For complex operations, it may be necessary to have additional intermediate buffers. The more buffers that are required for an operation, the smaller the usable chunk size will be. Because the current filter operations are 3-D, 'chunking' always occurs across the channel (λ) and time (t) dimensions. If a single channel and frame of the data in the spatial x,y,z dimensions is too large to fit in the usable GPU memory, additional partitioning is necessary.
In the case that single channel volumes must be partitioned spatially in x, y, or z, spatial `chunks' are computed to minimize redundant data processing across each individual chunks, while remaining within the usable limits of GPU memory. Each spatial chunk overlaps its surrounding chunks. This overlap guarantees that for the output region of interest (ROI) of the chunk the kernel operation to be performed will be guaranteed to return the same values as if it were run on the full volume. By maintaining an appropriate amount of overlap between chunks and discarding output that is outside of the chunk ROI, this partitioning scheme will produce exact results for all neighborhood-based linear and non-linear kernel operations. Additionally, because each chunk is completely independent, chunks can be processed in parallel using multiple GPUs, allowing for near-linear scaling for large-volume processing with additional GPU hardware.
The pseudo-code below summarizes the logic that determines chunk sizes. Note that this technique biases against large overlaps and favors keeping the first dimension (x) intact.
Calculating Buffer Size for Image Chunking
Variables ending with x,y,z should be interpreted as sizes in that dimension. For brevity, only the case that preserves the x dimension has been fully described. The other two cases are exactly the same except the fully preserved dimension is different.
Two possible image partitioning schemes
Colored boxes are distinct memory partitions to be processed independently. White areas are only processed once. Gray areas are processed more than once. Darker gray areas are processed more than light gray areas. Panel A shows a partitioning that preserves the continuity of the (x,y) plane by only partitioning across z. This scheme will reprocess much of the volume. The partitioning scheme in panel B balances memory continuity while reducing the amount of redundant data in overlapping sections. The large volume overlapping sections are at most processed twice. There are sections that will be processed up to eight times. However, these volumes are small cubes in the most interior sections and should not be as numerous as other overlapping sections.
After the buffer dimensionality is calculated, the Image Chunk class then partitions up the data into explicit chunks. A chunk contains the start and end (x,y,z,λ,t) locations to be loaded into the input buffer. In addition, an output ROI is also stored. This is smaller than the input. This ensures that all output has the largest support necessary for the current operation. This full data support partitioning scheme is yet another novel contribution that is provided by HIP. When partitioning data, the full support is available and should be used as is presented here. An additional benefit to this output scheme is that it is completely lock-free. Meaning, memory can be written to many chunks in parallel without worry of collision.