Going Big and Small for 2025

Digital art generated by stable diffusion.
Mon Dec 23 2024
Louis Fortier-Dubois

2024 marked a significant evolution in Burn's architecture. Traditional deep learning frameworks often require developers to compromise between performance, portability, and flexibility; we aimed to transcend these trade-offs. Looking ahead to 2025, we are committed to applying this philosophy across the entire computing stack, encompassing everything from embedded devices to data centers.

2024 Review: Breaking Hardware Boundaries

Redefining Kernel Development

This year started with a limitation: our WGPU backend depended on basic WGSL templates, restricting our adaptability. This challenge led to the creation of CubeCL [1], our solution to unified kernel development. The task was complex – designing one abstraction to fit across diverse hardware while sustaining top performance. Our results have proven our strategy, with performance now matching or outstripping LibTorch in the majority of our benchmarks.

Multi-Backend Architecture

The backend ecosystem now includes CUDA [2], HIP/ROCm [3] and an advanced WGPU implementation supporting both WebGPU and Vulkan [4]. The most notable achievement to date is reaching performance parity across different backends on identical hardware. For example, matrix multiplication operations exhibit nearly identical performance whether executed on CUDA or Vulkan, directly reflecting our strategy of platform-agnostic optimization.

We’ve introduced new Router and HTTP backends: the router backend enables dynamic mixing of multiple backends, while the HTTP backend supports distributed processing across multiple machines. To address memory management challenges, we’ve implemented a pooling and checkpointing mechanism that allows operation fusion even during backward passes.

Hardware-Agnostic Acceleration

Our hardware acceleration strategy marks a significant technical milestone. Rather than depending on platform-specific libraries like cuBLAS [5] or rocBLAS [6], we've developed a compiler stack that harnesses the best features of each platform while ensuring compatibility across them. This involved overcoming intricate challenges in code generation and optimization, especially for operations such as matrix multiplication, which must make efficient use of tensor cores across various hardware architectures.

2025 Roadmap: Embracing Both Extremes

For 2025, we will tackle two fundamental challenges in deep learning deployment.

Going Small: Quantization

Quantization [7] is crucial for computing with limited resources. Our approach uses the fusion of complex operations, with the "fuse on read" feature enabling seamless integration of tasks like reduction into the computational pipeline. This fusion strategy automatically handles the packing and unpacking of operations, ensuring that quantized operations run efficiently without needing manual tweaks. The outcome? High-performance quantized operations that preserve accuracy while cutting resource demands.

Going Big: Scalable Distributed Computing

At the other end of the spectrum lies distributed computing. By leveraging our Router and HTTP backends to build a robust distributed training infrastructure, we aim to create a smooth distributed computing experience where workloads effortlessly flow between different hardware and backend configurations while optimizing resource utilization across heterogeneous computing environments.

To support this vision of universal compatibility, we're expanding our backend ecosystem by:

  • developing a Metal backend to fully leverage Apple Silicon capabilities beyond WGPU's current capabilities [8];
  • implementing a just-in-time vectorized CPU backend in Rust for enhanced CPU performance;
  • opening the door to new backend possibilities, such as FPGA support, to ensure Burn can adapt to any computing environment.

We will finally invest heavily in the developer experience, with comprehensive documentation for CubeCL and a drive toward API stabilization in Burn. These improvements will make it easier for developers to take advantage of the full power of our cross-platform capabilities.

In 2024, we proved that cross-platform performance doesn't require compromise. For 2025, we will extend this principle across the computing spectrum – from microcontrollers to server farms. By solving the technical challenges at both extremes, we're working to make deep learning more accessible and efficient, regardless of scale or hardware constraints.

References

[1]CubeCL: Multi-platform high-performance compute language extension for Rust.
[2]CUDA Toolkit
[3]AMD ROCm Software
[4]Rust implementation of SPIR-V
[5]cuBLAS: CUDA Basic Linear Algebra Subroutine Library
[6]rocBLAS: ROCm Basic Linear Algebra Subprograms Library
[7]Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation
[8]Rust Bindings for Metal

Stay connected

Join our community! We'd love to keep you in the loop with our newsletter.

unsubscribed