Overview
This release significantly enhances GPU utilization through a new tensor transaction
mechanism for batched sync operations and simultaneous reads of multiple bindings for
CubeCL runtimes. It also includes multiple performance optimizations like mixed precision
support for matrix multiplication and convolution operations, as well as notable GEMM
improvements.
Backend capabilities have been expanded with a new remote backend for distributed
computing, improved SPIR-V support, custom operations fusion and an experimental fused
matrix multiplication.
Training components have been expanded to support semantic segmentation and object
detection datasets, new training metrics and improved training performance thanks to an
async metric processor.
As with previous releases, this version includes various bug fixes, further performance
optimizations, new tensor operations and enhanced documentation.
Module & Tensor
• Add warning in docstring for indices bound checks
#2462@laggui
• [Breaking] Make .init() method of LR schedulers return Result
#2527@towerpark
• Accept function pointer or closure for freq scaling
#2634@laggui
• Add checks for even padding when kernel size is even
#2677@laggui Bug Fixes
• Fix unsqueeze dims with multiple trailing negative indices
#2496@laggui
• Fix one_hot implementation for Int Tensors
#2501@maun
• Expose ItemLazy to be able to implement for custom types
#2525@laggui
• Module derive types should inherit visibility
#2610@laggui Backends
Bug Fixes
• Prevent various OOB accesses and discontiguous buffer bugs
#2467@wingertge
• Fix autodiff memory management by verifying parent nodes' existence
#2488@jnamika Documentation & Examples
• Add wgpu-spirv and hip-jit features to text-classification example
#2422@syl20bnr Fixes
• Fix module visitor and mapper trait definition in the book
#2609@laggui ONNX Support
Enhancements
• Add custom NCHW to NHWC kernel for implicit GEMM (optimization)
#2530@wingertge
• Use float intrinsics for deform_conv2d backward, fix into_data for padded tensors
#2681@wingertge Refactoring
• Refactor quantization tensor data representation
#2479@laggui
• Add `QTensorOps` docs + refactor tests to simplify inputs
#2557@laggui
• Import code from github-device-flow crate for burnbench
#2667@syl20bnr
• Fix web examples and conflicting feature flags w/ `default-features = false`
#2691@laggui Miscellaneous
• Update deny.toml to follow the spec changes of cargo-deny
#2408@tiruka
• Add test int one_hot and change ops docs in the book
#2519@tsanona
• Extend ImageFolderDataset to support import of COCO detection
#2612@jin-eld