Announcing Burn-LM (alpha): LLM Inference Engine

Digital art generated by stable diffusion.
Mon, Aug 4, 2025
Nathaniel Simard

We're happy to announce the next project we've been working on lately: an LLM inference engine based on Burn! The goal of Burn-LM[1] is actually bigger than that: we want to support any large model, LLM, VLM, and others, not only for inference but also for training (pre-training, post-training, and fine-tuning). We believe training and inference should be tied together, making it trivial to continuously improve your models over time. It's not something we talk about a lot right now, but we believe in continual learning[2]: a model learns from its mistakes and can evolve the more it performs actions in the world.

However, since most tools out there are either specialized for research, large-scale training, large-scale inference, or small on-device inference, it makes it harder to improve models in a unified manner. Burn-LM is a step in that direction, and by no means a complete solution to that problem... yet. We're not building and training specific models for that purpose, but we're making the tools so that it becomes easier, whether or not you're using Burn-LM. The project is really a testbed for Burn and CubeCL. We don't want to include hardware-specific and model-specific optimizations in Burn-LM directly. Instead, we want to find generalizable solutions that work across all hardware and models and implement those optimizations directly in Burn, benefiting everyone using it for any kind of model. This is a big difference between other projects such as vLLM[3] and llama.cpp [4], which are solely focused on LLM inference performance. The hope is to reach or surpass their level of performance across all hardware with clean implementations that can be customized.

Roadmap

As mentioned in the title, the project is in early stage; however, we wanted to release it open-source as it may already be useful to some. The roadmap is focused on performance improvement rather than model support for now, since it better aligns with our objective. However, we would love to see Burn-LM being adopted for many use cases, and model availability will become important. Therefore, we're really open to contributors who want to include popular model implementations.

Quantization

We've been working on quantization in Burn for some time. We wanted to have a general solution that works across all backends and hardware. As a reminder, quantization is an optimization technique that reduces the memory footprints of models' weights by compressing floating point numbers. Burn 0.19.0 should fully support quantization integrated with our fusion compiler. We are working on supporting block quantization and sub-byte types, making it easy to run larger models efficiently.

Flash Attention

We've been pretty busy optimizing our matmul kernels[5] in Burn lately. We wanted to find a proper kernel architecture, autotune kernel selection, and fusion integration with kernels that reach SOTA performance before implementing many other algorithms. It would have only increased the maintenance burden with a relatively low impact. But now we're ready, and we're implementing flash attention kernels that work across all hardware thanks to CubeCL. This will reduce the memory footprints of LLMs inference, especially with long context window, making it easier to run with less memory with faster token generation. The attention method will be available in Burn, speeding up any transformer model built with the framework, not just LLMs in Burn-LM.

Distributed Inference

For big models, using a single GPU isn't necessarily the best thing, so we're going to work on multi-device inference and distributed workload. For on-device deployment, it might also mean that some parts of the model could be running on the CPU while other parts on the GPU to reduce memory usage.

Training

When inference is at a good completion stage, we will start to work on training. The training model code will leverage burn-train, solidifying our distributed setup. This will likely bring more training features into Burn, benefiting users who use it to train their own model.

If you want to help, don't hesitate. We're a small team working really hard on providing the best ML infrastructure possible that works everywhere with no compromise on flexibility, portability and performance. 🔥

References

[1][Burn LM]: Burn Large Models Repository
[2]A Comprehensive Survey of Continual Learning: Theory, Method and Application
[3]A high-throughput and memory-efficient inference and serving engine for LLMs
[4]LLM Inference in C/C++
[5]State-of-the-Art Multiplatform Matrix Multiplication Kernels

Join the mailing list

Join our community! We'd love to keep you in the loop with our newsletter.

unsubscribed

Copyright 2025 © Burn | Tracel Inc. All rights reserved. Design by Perdomo