Creating High Performance Asynchronous Backends With Burn-Compute
Introduction
When we first published the Burn-Wgpu
crate [1], our
primary focus was on achieving great portability without sacrificing
correctness, postponing efficiency optimizations a little. With the
feedback we soon gathered from our users, we realized that the next
priority was to reduce memory consumption. Indeed, at the early stage
of the WGPU backend, there was never any reuse of the memory dedicated
to tensors that were not used anymore. Each tensor allocated a new
chunk of memory using the Graphics API, and when out of scope, the
memory was deallocated. We did have our optimized in-place operations,
as described in a previous blog
[2],
but using new tensors meant allocating and deallocating memory using
the Graphics API, which remained very slow and costly. In addition, the
deallocations were only called when many kernels were executed, often creating
high memory peaks.
With that in mind, we needed an intelligent reuse of memory that would reduce memory peaks and avoid allocations and deallocations when possible. After contemplating the idea of a major refactor of the Burn-Wgpu crate, we figured we were going to face the same problems with all of our future in-house backends (which are self-contained within the Burn project, as opposed to our third-party backends based on LibTorch [3] or Candle [4] ). So we decided to go for an abstract approach, relying heavily on Rust traits, to write all the logic that applies to every in-house backend in one place. Then the specifics of every backend can be encapsulated within the trait implementations.
This is how we arrived at the concept of Burn-Compute
[5]. It is a crate
within the Burn project that abstracts many backend mechanics, even
beyond memory management. Indeed, we have used this architecture to
separate mutable environments from immutable ones, allow for
transparent asynchronous kernel execution, and even automatic kernel
selection, which we call autotuning.
Client-Server Architecture
In a high-performance backend, asynchronous computation is key for
parallelization and responsiveness [6]. Put another way, the
actual computations on tensors in the model should not interfere with
the normal execution of the framework. Burn-Compute
's
main purpose is to isolate the asynchronous computations from the rest
of the software, using a client-server architecture like the one
below:
In this figure, squares are concrete structs while rounded squares are abstract traits. The Compute Channel is implemented by either a mutex-locking channel, a multi-producer, single-consumer (MPSC) channel, or a RefCell channel, which is used in single-threaded applications, like when in a no-std environment. Memory Management is implemented by the Simple Memory Management strategy (which may be switched easily to other strategies once they exist).
Finally, the Compute Server and the Storage must be implemented in
every backend based on Burn-Compute
. In our WGPU backend,
the server encapsulates WGPU concepts such as Command Encoders, Queues
and Compute Pipelines, while the storage encapsulates the WGPU buffers [7].
As you may have noticed, we have also added an arrow going into the Compute Client to emphasize that using Burn-Compute is done through calls to the client, which has a simple API:
- Empty: allocates a memory space of a given size, and returns a handle to this space;
- Create: similar to empty but filling the space with given data;
- Read: returns the data at a given handle;
- Execute: given a kernel and handles to inputs and outputs of the kernel, asks the server to run the kernel;
- Sync: waits for all asynchronous executions to be over;
- Execute_autotune: see Autotune section.
When implementing operations for a backend, a few calls to the client suffice for the memory to be automatically handled intelligently.
Memory Management
Although we have written the Memory Management as a trait, we do not ask that the backend defines its own struct for it (although it could). Indeed, we already provide what we call the Simple Memory Management Strategy, which can be used transparently on any backend that leverages Burn-Compute.
Concepts
Before we explain the memory management algorithms in details, let us define some concepts:
- Allocating memory means asking the storage to reserve actual memory space for data. In WGPU, it means creating a whole new buffer.
- Deallocating memory means freeing that memory, so that no part of the code should point to it.
- A chunk of memory is a contiguous, fixed-size region of memory that was reserved at once during one allocation.
- A slice of memory is a portion of a chunk, defined by a starting index over the chunk and a size.
- A free chunk is a chunk which is not used by any tensor or any slice. It is important to distinguish between a free chunk, and deallocated memory. The former still takes place in memory, although it is useless if not reused cleverly.
- A free slice is, similarly, a slice which is not used by any tensor.
- Reserving memory means finding a place for new data, either as a chunk or a slice, with either new allocation or reuse.
We also offer two sub-strategies:
- The deallocation strategy allows us to configure how often we should go through the process of converting free chunks to deallocated memory. It can be either every n memory reserves (which we call Deallocation period in the Benchmarks section), every n seconds (unavailable in no-std environment), or simply never. Remember, never deallocating does not mean that we are always allocating new chunks for each new tensor because free chunks can be reused before being deallocated.
- The slice strategy allows us to configure in what setting we can use a slice over a chunk. Suppose we have a free chunk of 1,000 bytes; then we will need two contiguous spaces of 500 bytes each. It would make sense to reuse the chunk with two slices. But consider another scenario where we have first a very small tensor of 10 bytes, followed by a new 1,000 bytes tensor. If we are not careful, then the 10 bytes tensor may take a slice over the chunk, leaving only 990 bytes of memory, which will not be enough for the new 1,000 bytes tensor, and we will need to allocate more memory. The slice ratio in the Benchmarks section is the fraction of a chunk's length that a slice must have in order to be used on this chunk.
Memory Reserve Algorithm
The algorithm for reserving memory takes as input a simple integer representing the size of the tensor for which memory is needed.
It first begins by searching through all allocated chunks, discarding those that are not free or too small. If it finds a free chunk of exactly the same size (which may happen rather often in some predictable scenarios), it stops the search and uses it.
If no exact chunk exists, it will fall back on creating a new slice on the smallest chunk among all chunks that are free and can accept slices (depending on the slice strategy). Only then, if no such chunk exists, a new chunk will be allocated, with the exact needed size.
Cleanup Algorithm
How do we know if slices and chunks are free?
We leverage Rust's reference counting. Indeed, we know that a chunk is free if no tensor nor slice points towards it. Because it will still be held in the memory management, it is therefore free when the strong count is exactly 1. On the other hand, a slice should be free when no tensor points towards it, but it will still be held by both the memory management and the chunk on which it lies. Therefore, a slice is free if its strong count is 2.
Because free slices are trivial to delete, as we only need to delete the slice ID from the memory management and the underlying chunk, we delete them at every new memory reservation. The deletion of free chunks simply depends on the deallocation strategy.
Benchmarks
This simple strategy is already enough to lead to remarkable improvements.
Our benchmarks are done on an NVIDIA graphics card. They compare the peak memory consumption attained for two different models ( static convolution network and dynamic transformer encoder ). We have run the benchmarks on both our previous WGPU implementation (Burn v0.9.0) and on the new WGPU implementation, based on Burn-Compute. We also compare ourselves with the reference LibTorch, which is highly optimized for CUDA devices.
Backend | Deallocation Period | Slice Ratio | Max GPU Memory |
---|---|---|---|
WGPU - v0.9.0 | - | - | 290 MiB |
WGPU | 1000 | 0.9 | 154 MiB |
WGPU | 128 | 0.8 | 146 MiB |
WGPU | Never | 0.9 | 146 MiB |
LibTorch | - | - | 264 MiB |
Backend | Deallocation Period | Slice Ratio | Max GPU Memory |
---|---|---|---|
WGPU - v0.9.0 | - | - | 3723 MiB |
WGPU | 1000 | 0.9 | 3229 MiB |
WGPU | 128 | 0.8 | 2234 MiB |
LibTorch | - | - | 2232 MiB |
As we can see, in both models, before Burn-Compute, the memory usage of WGPU was always worse than LibTorch's. For static graphs, our simple memory management strategy is already very effective, dividing the memory use by 2. In this specific situation, we are already better than LibTorch, the reason being that the combination of in-place operations with Burn-Compute brings near optimality for static graphs.
On the other hand, our strategy is still too simplistic to perform optimally in the dynamic setting. However, it's still significantly more efficient than in our previous version, and almost on par with LibTorch if we use the right configuration.
Autotune
Now, let us explain the autotune mechanism of Burn-Compute, which is
quite independant of the memory management but still within Burn-Compute
as it can be used transparently on any in-house
backend. When calling Execute_autotune
on the Compute Client,
it first goes through the Tuner struct, which uses an autotuning mechanism
for choosing which kernel should be executed. Then, other Compute Client
commands are run depending on the decisions of the Tuner.
The concept of autotuning is rather simple. Suppose you have access to different kernels (or one kernel but with different parameter combinations) that all achieve the same results, albeit at different computing speeds. You will want to use the fastest, but which is it? Some kernels may outperform the others on small inputs but be very slow on large inputs. Furthermore, some may be faster only on specific hardware, making it impossible to have a fixed kernel selection strategy that works for everyone. Autotuning is about finding the fastest kernel in any setting, simply by benchmarking all the possibilities first. For more details, you may refer to [8].
This is a very dynamic and general strategy, but it comes at the cost of some computational overhead for all users. Most of the difficulty in implementing autotune comes from minimizing this overhead.
One obvious strategy is to use a cache for reusing the fastest kernels in already seen settings, without doing the benchmarking again. However, using a cache means it must be in a mutable environment. Then why did we choose to leave the tuner on the client side rather than on the server side, which is designed to be mutable, imposing the use of another mutex? It is because the server works at a lower level of abstraction than the tuner. On the server side, it would have needed to take the handles logic of the memory management into account, giving it very little flexibility. Simply by being on the client side, it is able to benchmark complex operations that may be composed of several kernels.
We will not delve deeper into the autotune mechanism for now. Because it is closely linked to other interesting concepts, such as matrix multiplication algorithms and their benchmarks, we prefer to dedicate a full blog post to it in the near future.
Conclusion
It has now become very straightforward to create a new backend from scratch. It suffices to implement a Compute Server, a storage struct and computation kernels to automatically benefit from clever memory management, asynchronous execution of the kernels, and autotuning. One of our hopes is that any independent chip manufacturer may write a backend for their hardware and have everything in hand to be competitive.