Burn
Explore Burn
Your go-to destination for machine learning and high-performance computing insights. Stay informed with our latest updates.

Recent

Digital art generated by stable diffusion.

Why Quantization Matters

Modern deep learning models, such as large language models (LLMs), are heavily constrained by memory bandwidth. GPUs can execute floating-point operations (FLOPs) much faster than they can fetch weights from memory. For instance, an NVIDIA A10 has a peak computation throughput of 125 TFLOPS and a memory bandwidth of 600GB/s.

Mon Feb 10 2025
Guillaume Lagrange
Digital art generated by stable diffusion.

Improve Rust Compile Time by 108X

We started with a compilation time of 108 seconds for the matmul benchmarks, which was reduced to only 1 second after all the optimizations. The most effective optimization was the element-type generics swap, where we instantiated generic functions with predefined "faked" element types to reduce the amount of LLVM code generated. The second optimization also had a major impact, further reducing the compilation time by nearly 3×. This was achieved by using our comptime system instead of associated const generics to represent the matmul instruction sizes. Finally, the last optimization—also the simplest—was to reduce the LLVM optimization level to zero, which is particularly useful for debug builds, such as tests.

Wed Jan 15 2025
Nathaniel Simard
Flame digital art generated by stable diffusion.

Burn 0.16.0 Release Notes

This release brings major performance improvements to tensor operations, particularly in matrix multiplication and convolution, along with experimental ROCm/HIP and SPIR-V support enabled by CubeCL runtimes. It also introduces foundational features for multi-backend compatibility and adds new quantization operations.

Tue Jan 14 2025
Guillaume Lagrange

Technical Posts

Digital art generated by stable diffusion.

Improve Rust Compile Time by 108X

We started with a compilation time of 108 seconds for the matmul benchmarks, which was reduced to only 1 second after all the optimizations. The most effective optimization was the element-type generics swap, where we instantiated generic functions with predefined "faked" element types to reduce the amount of LLVM code generated. The second optimization also had a major impact, further reducing the compilation time by nearly 3×. This was achieved by using our comptime system instead of associated const generics to represent the matmul instruction sizes. Finally, the last optimization—also the simplest—was to reduce the LLVM optimization level to zero, which is particularly useful for debug builds, such as tests.

Wed Jan 15 2025
Nathaniel Simard
Space digital art generated by stable diffusion.

Optimal Performance without Static Graphs by Fusing Tensor Operation Streams

This post explores Burn's tensor operation stream strategy, optimizing models through an eager API by creating custom kernels with fused operations. Our cusotm GELU experiment reveals a remarkable improvement of up to 78 times on our WGPU backend.

Tue Mar 19 2024
Nathaniel Simard
Space digital art generated by stable diffusion.

Autotune for GPU Kernels: Ensuring Consistent Peak Performance

Crafting high-performance GPU kernels for common deep learning operations, such as matrix multiplication (matmul) and reduction, requires finesse. The speed of these kernels varies depending on input shapes and the GPU device in use, meaning the fastest one may change based on the context. In Burn, Autotune automates the task of dynamically performing kernel selection, allowing one to create a plethora of kernel variations with confidence that the best-performing one will be executed in every situation.

Fri Dec 15 2023
Louis Fortier-Dubois
Space digital art generated by stable diffusion.

Creating High Performance Asynchronous Backends With Burn-Compute

Developing new high-performance deep learning backends in Burn has become remarkably easy, as it can be readily enhanced with advanced capabilities such as asynchronous computations, intelligent memory management, and autotuning mechanisms. The innovative Burn-Compute crate lays the architectural foundation for in-house backends, effortlessly equipping them with advanced features to maximize efficiency.

Tue Nov 07 2023
Louis Fortier-Dubois
Space digital art generated by stable diffusion.

Burn's New Cross-Platform GPU Backend

Introducing Burn's new Cross-Platform GPU Backend built using WGPU. Burn now supports running deep learning models on a variety of hardware configurations, leveraging graphics APIs such as Vulkan, DirectX 11/12, Metal, OpenGL, and WebGPU. We discuss the possible applications in various domains and glimpse into the promising future of the framework.

Tue Jul 25 2023
Nathaniel Simard, Louis Fortier-Dubois
Space digital art generated by stable diffusion.

Reduced Memory Usage: Burn's Rusty Approach to Tensor Handling

The latest release of Burn includes significant changes to its memory management strategy, and tensor-allocated memory can now be reused way more often. Overall, these changes significantly reduce memory usage, especially on the CPU compared to PyTorch.

Tue Mar 21 2023
Nathaniel Simard

Stay connected

Join our community! We'd love to keep you in the loop with our newsletter.

unsubscribed