Improve Rust Compile Time by 108X

Digital art generated by stable diffusion.
Wed Jan 15 2025
Nathaniel Simard

Disclamer

Before you get too excited, the techniques used to reduce compilation time are not applicable to all Rust projects. However, I expect the learnings to be useful for any Rust developer who wants to improve their project's compilation time. Now that this is clarified, let's dive into the results.

Results

We started with a compilation time of 108 seconds for the matmul benchmarks, which was reduced to only 1 second after all the optimizations. The most effective optimization was the element-type generics swap, where we instantiated generic functions with predefined "faked" element types to reduce the amount of LLVM[1] code generated. The second optimization also had a major impact, further reducing the compilation time by nearly 3×. This was achieved by using our comptime system instead of associated const generics to represent the matmul instruction sizes. Finally, the last optimization—also the simplest—was to reduce the LLVM optimization level to zero, which is particularly useful for debug builds, such as tests.

2025-01-10T16:31:43.028327 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/

Compilation times are measured using incremental compilation.

Motivation

First, let me explain the situation that led me to investigate our compile time. During the last iteration of CubeCL[2], we refactored our matrix multiplication GPU kernel to work with many different configurations and element types. CubeCL is a dialect that lets you program GPUs for high-performance computing applications directly in Rust. The project supports multiple compiler backends, namely WebGPU[3], CUDA[4], ROCm[5], and Vulkan[6] with more to come.

The refactoring of the matrix multiplication kernel was done to improve tensor cores utilization across many different precisions. In fact, each kernel instance works across 3 different element types: the global memory element type, the staging element type, and the accumulation element type. These are all different since we normally want to accumulate in a higher precision element type, as this is where numerical approximation is most sensitive. Also, the tensor cores don't work across all input precisions; if you have f32 inputs, you need to convert those to tf32 element types (staging) to call the tensor cores instructions. To add to the complexity, tensor cores instructions only work across fixed matrix shapes that also depend on the precisions. For instance, f16 staging matrices work across all shapes from (32, 8, 16), (16, 16, 16), (8, 32, 16). But tf32 only works on (16, 16, 8).

In our first refactoring, we represented the shape of the matrices supported by the instructions using const associated types, since this is the abstraction component that makes the most sense in this case. For the element types, we naturally used generic arguments for traits and functions - pretty much what any developer would do in this scenario. However, with all the possible combinations, we ended up with a compilation time of 1m48s using the cache.

Yes, you read that right: 1m48s just to rebuild the matmul benchmark if we change anything in the bench file.

For the purpose of this optimization, we only consider incremental compilation using cargo caching, since this is the most important one to speed up dev iteration time. Changing one configuration to test if an optimization worked took almost 2 minutes just to create the binary to execute a few matmuls.

Why is it that slow?

Well, we need to understand that the Rust compiler is actually very fast. The slow parts are the linking and LLVM. The best way to improve compilation time is to reduce the amount of LLVM IR generated.

In our specific case, each combination of the matmul would generate a whole new function - this is what zero-cost abstraction means. There is no dynamic dispatch; every type change duplicates the code to improve performance at the cost of a bigger binary. Before all of our optimizations, the binary generated was 29M, and after we reduced it to 2.5M - a huge difference.

To reduce the amount of code generated, we had to use different Rust techniques to make our abstractions for the matmul components. In our case, we don't need zero-cost abstractions, since the code written in Rust for the matmul components actually generates the code that is used to compile at runtime a kernel that will be executed on the GPU. Only the GPU code needs to be fast; the JIT Rust code takes almost no time during runtime. Zero-cost abstraction would actually be optimal in a setting where we would perform ahead-of-time compilation of kernels.

Ever wonder why LibTorch[7] or cuBLAS[8] have executables that are GIGABYTES in size? Well, it's because all kernels for all precisions with all edge cases must be compiled to speed up runtime execution. This is necessary in a compute-heavy workload like deep learning.

However, CubeCL is different - it performs JIT compilation, therefore we don't need to compile all possible variations ahead of time before creating the binary: we can use dynamic abstractions instead! This is one of the two optimizations that we made for the matmul components. Instead of relying on const associated types, we leveraged the comptime system to dynamically have access to the instruction sizes during the compilation of a kernel at runtime. This is actually the second optimization that we made and helped us go from 14s compilation time to around 5s.

However, the biggest optimization was quite hard to pull off and is linked to the generic element types passed in each function. We still wanted to use zero-cost abstraction in this case, since passing around an enum listing what element type operations are on would be terrible in terms of developer experience. However, the hint to improve our compilation time was that when you write a function that will execute on the GPU, we have an attribute on top #[cube].

We want the code to look and feel like normal Rust, but the macro actually parses the Rust code written and generates another function, which we normally call the expand function, where the actual GPU IR is built for the function. That code will actually run during runtime, not the code that the user is writing. The element types generics are only used to convert the generic element type into the enum IR element type. In the expand functions, we also pass a context where all IR is tracked.

So the optimization was to pass a fake generic element type, called FloatExpand instead of the actual element type like f32. When compiling a kernel, we first register the real element type in the context, using the const generic item to differentiate multiple element types if a function has multiple generic element types. Since we always call the expand function with the exact same generic items for all element types, we only generate one instance of that function, and the element types are fetched at runtime using the context.

The most tedious part was actually implementing that optimization while trying not to break our components. The biggest problem caused by that optimization is that we can't support generic dependencies between traits over the element type in launch functions of CubeCL.


  #[cube(launch)]
  // Doesn't compile, since we want to call the function with FloatExpand, and the provided I.
  fn matmul<F: Float, I: Instruction<F>>() { 
      // ...
  } 
  

It makes sense though - we don't want to recompile all the instruction types for all different precisions. Since our optimization is activated at the boundaries of CPU code and GPU code, where cube functions are identified as launchable, we need the generic trait to not have a dependency on the element type. They are going to be switched by our macro. We use generic associated types instead of traits with generic element types.

This is known as the family pattern[9], where a trait is describing a family of types.


  pub trait InstructionFamily {
     type Instruction<F: Float>: Instruction<F>;
  }
  
  pub trait Instruction<F: Float> {
      fn execute(lhs: Matrix, rhs: Matrix, accumulator: Matrix);
  }
  

Using this pattern, we can inject the family type at the boundaries of CPU and GPU code and instantiate the inner instruction type with the expand element type.


  #[cube(launch)]
  fn matmul<F: Float, I: InstructionFamily>() {
     // ...
     I::Instruction::<F>::execute(lhs, rhs, accumulator);
     // ...
  }

  // Launch code generated by the macro #[cube(launch)].
  mod matmul {
    // More code is generated for other stuff ...

    // Expand function generated.
    pub fn expand<F: Float, I: InstructionFamily>(
        context: &mut CubeContext,
    ) {
      // The element type IR can be access like the following.
      let elem = F::as_elem(context);
      // ..
    } 
     
    // The following defines the kernel definition.
    impl<F: Float, I: InstructionFamily, __R: Runtime> Kernel for Matmul<F, I, __R> {
        fn define(&self) -> KernelDefinition {
            let mut builder = KernelBuilder::default();

            // Register the real element type.
            builder
                .context
                .register_elem::<FloatExpand<0u8>>(F::as_elem_native_unchecked());

            // Instantiate the expand function with the expand float element type.
            expand::<FloatExpand<0u8>, I>(&mut builder.context);
            builder.build(self.settings.clone())
        }

        fn id(&self) -> cubecl::KernelId {
            let cube_dim = self.settings.cube_dim.clone();
            // Still use the original element type for the JIT compilation cache.
            KernelId::new::<Self>().info((cube_dim,))
        }
    }
  }
  

Migrating most of the components to the new pattern, we reduced compile time from 1m48s to about 14s.

It was a lot of work, and I don't expect all projects to face cases like this, but it was worth it! Now waiting for about 5 seconds after trying something in the code to see if performance is improved doesn't break the flow, but almost 2 minutes did.

We essentially leveraged the fact that CubeCL is a JIT compiler and not an AOT compiler, which is very appropriate for throughput-focused high-performance applications.

Playing with LLVM optimization settings

Since our benchmarks are compiled with optimization level set to 3, we could still improve the compilation time further to about 1s by reducing the optimization level to zero. Another 5X speedup that we can have by simply adjusting the LLVM optimization level.


  [profile.dev]
  opt-level = 0

We decided not to keep that optimization in production, since we want the benchmarks to have the same LLVM optimization level as user applications. However, we activated it for testing, since we often rerun tests to ensure we don't break correctness when implementing or optimizing kernels.

Not a Rust Compiler Issue

All of our optimizations actually created tons of code - we used proc macros, associated type generics, const generics, and tons of complex features from the Rust type system.

The Rust compiler is actually very fast; the slow part is really the linking and optimizing of the LLVM IR. If there's one thing to take from this post, it's that you shouldn't worry about using complex features of Rust, but make sure you don't generate huge binaries. Reducing the binary size will improve compile time even if you use complex methods to do so! "Less code compiles faster" is not exactly right. "Less generated code compiles faster" is what we have to keep in mind!

References

[1]The LLVM Compiler Infrastructure
[2]CubeCL: Multi-platform high-performance compute language extension for Rust.
[3]WGPU: Cross-platform, safe, pure-rust graphics api
[4]CUDA Toolkit
[5]AMD ROCm Software
[6]Rust implementation of SPIR-V
[7]LibTorch Documentation
[8]cuBLAS: CUDA Basic Linear Algebra Subroutine Library
[9]Generic Associated Types RFC: Associated type constructors of type arguments

Stay connected

Join our community! We'd love to keep you in the loop with our newsletter.

unsubscribed