Autotune for GPU Kernels: Ensuring Consistent Peak Performance

Space digital art generated by stable diffusion.

Fri, Dec 15, 2023

Introduction

At the lowest level of abstraction of a deep learning framework lie the kernels. Kernel is a fancy word to describe an algorithm that accomplishes a task of relative simplicity on a tensor and that can be used in a broad range of contexts, like for instance a matrix multiplication (referred to as Matmul hereafter). These algorithms being the most basic building blocks, they often find themselves in the hot loop of an AI model and it's primordial that they execute as fast as possible. That is why their computation is often delegated to the GPU, which is usually faster on highly parallelizable problems like tensor operations, provided their excentricities are respected.

Indeed, there are many types of GPUs, all with their own cache size, number of registers, etc. As a result, some kernel implementations may run better on some GPUs than others [1]. Add to that the fact that some kernels are better on some tensor shapes, as we will see with the Reduction algorithm.

Selecting the appropriate kernel for a given task, depending on the device and the shape, can quickly become a cumbersome task. Hardcoding the choice may pose two significant issues: first, it may not scale well in the future, and second, it might be inaccurate due to unforeseen factors not taken into consideration.

This is why, in Burn, we have chosen to automate the kernel selection process so that the best kernel is always selected, regardless of the device it runs on, the shape of the input, or any other consideration that is relevant for a specific kernel. We call this process Autotune.

After defining Autotune and explaining how we integrated it in Burn, we will look at how it simplifies the kernel selection for two common GPU operations: Reduction and Matmul.

What's Autotune?

You may recognize the word Autotune from the music industry, where Autotune was at first a software by Antares [2] that can adjust a singer's pitch to a precise note in real time. It had such a high impact that the word is now often used for any such pitch-correcting algorithm. If you have a shaky voice -like me-, you will find Autotune to be a life saver when singing. In computer science, Autotune is also a correcting process happening at runtime. Instead of adjusting to a specific voice, it adjusts to a specific machine -your laptop, for instance. And instead of fixing the pitch, it fixes the execution time.

It's particularly relevant in deep learning simply because compute efficiency is such a big deal in that field, although in theory it could be used in any system where high performance is key. To be clear, it is not about tuning hyperparameters or anything that can change an AI model's accuracy or training convergence. It's only about selecting the fastest kernel among kernels that would all output the same result anyway.

Now, how does it work? It's actually rather simple: the first time you encounter an autotunable operation on a specific input when running your model on your machine, it will actually launch the operation on all kernels that are registered for Autotune, benchmark their execution time, and keep the fastest one in a cache for subsequent calls to that operation in similar settings.

As you can guess, any autotuning strategy adds some overhead for the first calls, the goal being to save as much time as possible in the long run. We've picked a straightforward strategy that is guaranteed to adapt directly to any hardware, as being portable is one of our main objectives. Also, we want Autotune to work with automatic kernel fusion, which creates never seen before kernels just-in-time for highly specialized task. Tuning by hand is just not possible in this dynamic context.

Autotune in Burn

In a previous blogpost [3], I presented Burn-Compute's architecture and I mentioned that Autotune was a part of it. As Burn-Compute is a reusable basis for all of our in-house backend, it means that Autotune will by default be part of our future backends. At the moment, we use it only on our WGPU backend, but it will be trivial to enhance our envisaged CUDA backend with it.

The subtlety of Autotune lies in the key used to cache kernels. If it summarizes the setting in which the kernel is needed in a way that is too precise, then there will be cache misses. For instance, an operation on matrices of shapes 231x231 or 233x233 has an extremely high probability of being optimal with the same kernel, so we should not run benchmarks for both. But if we encounter the shape 32x1024 afterwards, it may behave very differently.

More subtle yet, if we meet the shape 512x512 (512 is a power of 2, therefore a very "round" number in computer science), then maybe some kernel will be very fast; but then on a 512x511 matrix, the same kernel may need to add some padding first to transform it into a round 512x512 matrix with zeros at the end of each row. Adding those zeros means shifting all rows, which is a very costly operation.

This is of course very dependant of the nature of the operation. As we will see, the padding problem appears in Matmul but not in Reduction. In Burn, we therefore chose to make the key a custom trait implementation, customizable for every operation in every backend. Here are for instance our keys for the operations we will look at in the next sections:


  pub struct ReduceAutotuneKey {
      reduce_dim_length: usize, // Dimension on which we reduce (closest power of 2)
      reduce_dim_stride: usize, // Stride of reduce dimension (for data locality)
      others_product: usize,    // Product of all other dimensions (closest power of 2)
  }


  pub struct MatmulAutotuneKey {
      round: bool,            // True when all matmul dims are multiples of tile size
      broadcast: bool,        // True when there are differences in inputs batch size
      anchored_m: usize,      // M dimension (closest power of 2)
      anchored_k: usize,      // K dimension (closest power of 2)
      anchored_n: usize,      // N dimension (closest power of 2)
      anchored_batch: usize,  // Number of batches (closest power of 2), topped at 256
  }

Since the cache is always local in our current implementation, we do not bother keeping the device information in the key, although it may be included in the future.

Of note, we do not force Autotune to run all benchmarks on the actual input for which we request a kernel. Rather, we choose a random input on some shape that is representative of the key. For instance, we know that the execution time of our Matmul kernels always scales linearly with the batch size. If we need to run thousands of batches in the model, running Autotune on that batch size would take way more time than actually necessary, for the same information gain.

Also, if for instance all inputs between 512x512 and 1024x1024 shared the same key, should we run benchmarks on 1024x1024 which would likely take longer, or on 512x512 for shorter Autotune time, but at the risk of results being not too reliable at the higher end of the interval? We believe a representant, such as the median size 768x768 would be a great choice. We haven't yet explored the idea of a median representant yet in Burn, but it would certainly be interesting to measure how much more precise Autotune would become.

Tensor Operations on GPU

Remember how I argued that Autotune was a necessity: tensor operation kernels, in particular GPU ones, have execution speeds which highly depend on the system in use and input tensor shapes. In my personal experience, I find the impact of the tensor shape to be understandable, but the impact of the device less predictable -that's why I'm so happy to leave it in the hands of Autotune.

Using Reduction and Matmul as examples, we will see how GPU concepts related to input specifications have an impact on what is the right kernel to choose.

Since comparing devices is not the objective here, but rather the impact of input shape, all benchmarks I will present are run on my personal laptop (MacBook Pro's Apple M2 Pro 19-Core integrated graphics card, with Metal 3 API), using Burn's WGPU backend.

Dividing the Workload

We will look at some fundamental concepts of GPU computing, using the WebGPU nomenclature, which differ somewhat from that of CUDA. First, the computation is divided into a grid, which is a collection of workgroups which are themselves collections of invocations .

The grid is an arbitrarily long collection of workgroups. Typically, if a tensor has n elements and one workgroup is able to process m of them, then the grid consists of n/m workgroups. The grid's role is to ensure that the whole computation is carried, despite the fact that workgroups have limited sizes. The grid does not guarantee any order nor parallelism between workgroup computations, therefore we must consider all workgroups to work in silos.

It's within a workgroup that things get interesting. A workgroup is a fixed-size collection of invocations, which are essentially threads (in Burn we often launch 1024 invocations per workgroup). These threads all work at the same time on the GPU and can share data through a shared memory which leverages the GPU cache. They can also be synchronized with a barrier if needed.

Important Considerations of GPU Algorithm Design

When writing a GPU kernel, for instance using the WGSL (Web GPU Shading Language) [4] as we do in our WGPU backend, you write only the code for one invocation. By using its workgroup id in the grid and invocation id in the workgroup, you can compute what specific tensor input and output elements this invocation should be working on.

Some important GPU concepts can have a high impact on the execution speed in different settings:

The shared memory allows for communication across threads, but not across workgroups. Therefore, if some elements of the input tensor must interact together, they should be managed by the same workgroup. This memory is also closer than the global memory where the input lies, so if a value must be fetched several times (by different threads of the same workgroup), it's probably worth saving it in shared memory.

One must avoid concurrency errors, both across invocations of a workgroup and across the grid, when writing results. While CUDA offers atomic write primitives [5] which forbids two threads to add to a value at the same time, Web GPU does not afford us with this luxury, making it more delicate for threads to collaborate on the same output cells.

Memory coalescing is something that can happen in GPUs, which I personally find magical: if several threads of adjacent IDs access the memory in adjacent spaces at the same time, the GPU can perform a single memory reading transaction for all of them. When writing GPU kernels, maximizing memory coalescing should be a goal. However, verifying its occurrence can be subtle, and it is precisely the type of aspect that can vary across different devices.

Somewhat related to that is the concept of branch divergence, which must be minimized. Branch divergence occurs when threads within a workgroup do not take the same path in the code, usually because of an if statement. The best example I can give is the one I mentioned earlier when I said a kernel can be very good on 512x512 shapes but poor on 512x511. Suppose we have 64 threads working in parallel. If they operate in the middle of the matrix, there is no issue. However, when positioned on the edge, in the rounded case, the 64 threads will operate on indices 448 to 511, whereas in the truncated shape, they should cover indices 448 to 510 (a total of 63 values). The 64th thread can either: do a computation anyway on the data that follows, which will very likely lead to corrupted data, or go through an if to not do the computation like the others. This typically breaks all hopes for good memory coalescing within that workgroup, and threads have to wait for each other to be synchronized again. This is why it may (or may not) be best to pad the 511 elements row with a zero to reach a round number and avoid the need for if statements.

Reduction

In the textbook 1-dimensional case, reducing means computing one value from a whole vector, leveraging the associativity of an underlying binary operation. Many variations exist (sum, product, maximum, etc.) but often share the same algorithmic structures. For instance, [12, 3, 5, 4, 15, 2] can be reduced to 41 in the sum case, or to 15 for the maximum. Many different strategies exist, some with a simple for loop on all values, and some that use recursivity to leverage parallelism in a divide-and-conquer fashion [6].

In the N-dimensional case, we typically reduce one dimension, going for instance from a MxKxN tensor to a 1xKxN one when the reduce dimension is 0. In that case, we reduce M values together, KxN times. We will call the M values that must be reduced together a reduce column. In the figure below, the green block is the ouptputted KxN tensor, and the purple block is one of the reduce columns and consists of M elements. For visualization I've included only one reduce column, but there are in fact one for each element of the output tensor.

What is important here is that in the 0th dimension, the M values of a reduce column fundamentally need to interact together. Therefore, those values cannot be spread across different workgroups, since they would not be able to share information. On the other hand, each of the KxN reduce columns can be treated totally independently from one another.

One Invocation per Reduce Column

Our first reduce kernel always gives the responsibility of computing a whole reduce column to only one invocation, which only executes a for loop on the whole reduce column. This may be a lot of computing for one thread when M is large, but this maximizes parallelization across the threads, who can all work in a well-vectorized way. Also, there is no risk of concurrency error as each output value is managed by only one thread.

In the above, workgroups are separated by solid lines and invocations by dashed lines.

One Workgroup per Reduce Column

Our second reduce kernel rather gives the responsibility of a reduce column to one workgroup. For large M, threads can work together through the shared memory to compute one reduce column much more quickly. Again, in the figure below, workgroups are separated by solid lines and invocations by dashed lines.

There are however two caveats:

If the other dimensions are large, the grid must consist of a lot of workgroups. Launching many workgroups is much slower than launching many threads.
When computing a reduce column, threads can use a divide-and-conquer method. However, as the column gets reduced, more and more threads become idle, likely causing branch divergence.

Autotuning Reduction

Let's see what Autotune says about it. The following benchmarks consider two very different scenarios of reduction. In both examples, the first line is the Autotune key, the following two give the median computation time over a few executions, and the last line gives the selected, fastest kernel.


  Reduce - reduce_dim_length: 2048 reduce_dim_stride: 1024 others_product: 1024
  OneColumnPerInvocationReduce<f32, 3> => 8.98ms
  OneColumnPerWorkgroupReduce<f32, 3> => 3.88ms
  Fastest: OneColumnPerWorkgroupReduce<f32, 3>


  Reduce - reduce_dim_length: 32 reduce_dim_stride: 1 others_product: 65536
  OneColumnPerInvocationReduce<f32, 3> => 1.36ms
  OneColumnPerWorkgroupReduce<f32, 3> => 19.01ms
  Fastest: OneColumnPerInvocationReduce<f32, 3>

In the first example, the reduce dimension is much larger than the product of all others, therefore it makes sense to use one workgroup on each column. The difference in time is also due to the large reduce_dim_stride which forbids memory coalescing to occur. In the second example, assigning one column per workgroup would mean creating tens of thousands of workgroups, each with too many threads, and is therefore very slow in comparison to creating 65536 threads who work independently.

Matmul

Let's move on to our second operation. The point here is not to explain matrix multiplication in detail but to understand its complexity with regards to its inputs. On 2-dimensional input tensors (which are simply matrices), the result of a matrix A of size MxN times a matrix B of size KxN is an MxN output matrix C. In C, all values are the sum of K multiplications of an element of A with an element of B. On N-dimensional tensors, all other dimensions than the last two are simply batch dimensions, which can be seen as other unrelated instances of the kernel. For that reason, we will assume 2-dimensional tensors in the following explanations.

The computation of one output element depends on many elements of the inputs. This makes it impossible to make several threads work on a single output element without just repeating work. For that reason, all our kernels have each thread responsible of one output element (contrarily to our second reduce kernel).

Since all the pair-wise multiplications (one specific element of A with one element of B) are uniquely used - we cannot hope to reuse intermediate computation results several times. Therefore the algorithm must necessarily take at least (size of C)xK = MxNxK computation steps.

Naive Approach (with Memory Coalescing)

The naive approach simply consists in launching one invocation per output element. Then, each thread iteratively reads values from a row of A and from a column of B like in the figure below. The secret ingredient of this kernel is memory coalescing. Conceptually, if threads of the same workgroup with ids 0, 1, 2 and 3 read from memory cells i, i+1, i+2, i+3 at the same time, these read operations can be done all at once. Repeat this pattern everywhere and you get a pretty efficient kernel, considering its simplicity.

However, this approach does not use the GPU cache intelligently.

Padded Tiling 2D Approach

The tiling approach is all about saving time through the use of shared memory, which normally lies in the GPU cache and is way faster to access than the inputs. Like mentioned earlier, all pair-wise multiplications are unique, but a single element of A (or B) is used by many threads, many of which in the same workgroup. We can leverage this.

In the tiling 2D kernel, we divide both input matrices into fixed-size tiles, and we also prepare one tile-size shared memory per input matrix. Working in synchronization with other invocations from its workgroup, a thread first contributes to fill the shared memory with the current input tile, then achieves all the partial computations needed for the output element it is responsible of, helping itself to values in the shared memory that have been loaded by his sibling threads. Then the process is repeated for all the tiles along the inner product dimension, to complete the computation of the output value.

I'm leaving out a lot of details as this gets very technical; more details and schemas can be found in [7]. The important thing to realize is that we add steps for loading values from the global memory to the GPU cache, relying on thread synchronization within a workgroup, in order to gain some data locality when performing the actual additions and multiplications.

As I said, the tiles have fixed sizes, which are configurable, but the optimal number depends on the GPU architecture. Let's say we fix it to 64. Then any input size that is not a multiple of 64 will cause problems, because the last threads in the last tile may wrongly read from the row after theirs.

Hence the need for padding. If the last threads just read zeros instead of the values of the following row, they can do their computation like the other threads but with a value of zero, which will benignly add zero to the output.

Unpadded Tiling 2D Approach

Padding can be extremely costly because it changes the data layout: A new tensor allocation must occur where all rows must be shifted in order to insert zeros in between. In the following picture, we get from a 3x3 tensor to a 4x4 one while keeping contiguousness. Then, after the Matmul, the shifting must be undone to retrieve the original output shape.

Couldn't we just prevent threads that are out of bound to do any computation, using an if statement? In a GPU kernel, it would not necessarily be the fastest way because an if will almost certainly create branch divergence. Still, that's what we do in our unpadded version of the tiling 2D algorithm. To be more precise, the conditional statement is at the stage where we load data to the shared memory, so that the if is only for reading and writing to global memory, while all computations operate without branching.

Autotuning Matmul

Now let's see what benchmarks have to say about which kernel is the best. While running the text classification training of Burn [8], I selected three examples from the autotune log. Again, the first line is the key and the last one is the winning kernel.


  Matmul - Round:false Broadcast:false m:1024 k:256 n:256 batch:32
  NaiveMemoryCoalescingMatmul<f32, 3> => 12.58ms
  PaddedTiling2DMatmul<f32, 3> => 6.29ms
  UnpaddedTiling2DMatmul<f32, 3> => 3.93ms
  Fastest: UnpaddedTiling2DMatmul<f32, 3>


  Matmul - Round:true Broadcast:true m:256 k:1024 n:4 batch:32
  NaiveMemoryCoalescingMatmul<f32, 3> => 21.59ms
  PaddedTiling2DMatmul<f32, 3> => 5.04ms
  UnpaddedTiling2DMatmul<f32, 3> => 5.16ms
  Fastest: PaddedTiling2Dmatmul<f32, 3>


  Matmul - Round:false Broadcast:false m:256 k:128 n:4 batch:32
  NaiveMemoryCoalescingMatmul<f32, 3> => 1.34ms
  PaddedTiling2DMatmul<f32, 3> => 1.43ms
  UnpaddedTiling2DMatmul<f32, 3> => 1.35ms
  Fastest: NaiveMemoryCoalescingMatmul<f32, 3>

In the first example, the unpadded tiling version wins by a large margin. This is actually a common behaviour on my computer. For the padded version to win, it helps if the matrix is round, like in example 2. Still, it is won by a small margin. Finally, the third example shows the naive version winning, which typically happens on smaller inputs, a sign that the tiling algorithm overhead can be overkill.

Conclusion

In theory, it would be feasible to design an algorithm that chooses the right kernel as a function of the input size. However, the speed of a kernel for all input sizes can be hard to predict. Furthermore, some kernels are parameterizable, which means there are actually much more than two or three versions of an operation. But most of all, the GPU device on which the kernel runs adds many uncertainties: when run on some new device, what will be the GPU cache size? Will specialized accelerators be available? How many cores will work in parallel? How will memory coalescing actually occur?

Selecting the right kernel for an extensive AI model computation is an important decision that can save so much compute. But a perfect algorithm that answers the right thing on all machines would be too complex to be handcrafted. Autotune allows to write new kernels for specific scenarios without minding about the logistics of running that kernel.

One may worry about the induced overhead of running all kernels at the beginning. For applications where cold start is crucial, for instance in cloud applications where new instances spawn very often on the same machine to do similar jobs, our Autotune mechanism will rely on cached, pre-computed benchmarks. That way, using Autotune ensures peak performance in all settings.

References

[1]Lex Fridman Podcast: Chris Lattner - Future of Programming and AI

[2]Antares

[3]Creating High-Performance Asynchronous Backends with Burn-Compute

[4]WebGPU Shading Language

[5]CUDA C Programming Guide

[6]Diderot: Divide and Conquer

[7]How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

[8]Text classification on AG News in Burn