Math Optimizations

Fast Math Options

Floating point operations have a lot of restrictions required to follow the specification, especially around special values (Inf/NaN) and signed zero that are rarely used. CubeCL allows marking functions with loosened restrictions to accelerate math operations, while trading off some special handling or precision.

The effect is backend-dependent, but uses a unified API of flags specifying acceptable optimizations. These FastMath flags can be applied per-function, so they can be applied only to performance-critical sections of the code.

Example:

#![allow(unused)]
fn main() {
/// Only the inverse square root has reduced precision/no special handling. Everything else is full
/// precision.
#[cube(launch_unchecked)]
fn run_on_array<F: Float>(input: &Array<F>, alpha: F, epsilon: F, output: &mut Array<F>) {
    if ABSOLUTE_POS < input.len() {
        output[ABSOLUTE_POS] = alpha * fast_rsqrt::<F>(input[ABSOLUTE_POS]) + epsilon;
    }
}

#[cube(fast_math = FastMath::all())]
fn fast_rsqrt<F: Float>(x: F) -> F {
    F::inverse_sqrt(x)
}
}

Backend Implementation

WGPU with Vulkan Compiler

Vulkan supports each flag as a modifier for all floating point operations. The compiler applies all enabled flags, but the implementation is driver-specific.

CUDA/HIP

These targets only expose specific intrinsics. These intrinsics are used when all their required flags are present. Only f32 is supported for these intrinsics, other float types are not affected by math flags on CUDA/HIP. Note that some of these are guesswork, because CUDA lacks documentation on special value handling.

CubeCL Function	Intrinsic	Required Flags
`a / b`	`__fdividef(a, b)`	`AllowReciprocal \| ReducedPrecision \| UnsignedZero \| NotInf`
`exp(a)`	`__expf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`log(a)`	`__logf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`sin(a)`	`__sinf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`cos(a)`	`__cosf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`tanh(a)`	`__tanhf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`powf(a)`	`__powf(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`sqrt(a)`	`__fsqrt_rn(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`inverse_sqrt(a)`	`__frsqrt_rn(a)`	`ReducedPrecision \| NotNaN \| NotInf`
`recip(a)`	`__frcp_rn(a)`	`AllowReciprocal \| ReducedPrecision \| UnsignedZero \| NotInf`
`normalize(a)`	n/a (`__frsqrt_rn`)	`ReducedPrecision \| NotNaN \| NotInf`
`magnitude(a)`	n/a (`__fsqrt_rn`)	`ReducedPrecision \| NotNaN \| NotInf`

Other Backends

Other backends currently don't support any of these optimizations.

FastDivmod

A very common operation, especially on GPUs, is applying integer division and modulo with a uniform, but not constant, divisor (i.e. width). For example:

#![allow(unused)]
fn main() {
#[cube(launch)]
pub fn some_2d_kernel<F: Float>(output: &mut Array<F>, width: u32) {
    let y = ABSOLUTE_POS / width;
    let x = ABSOLUTE_POS % width;
    //...
}

// ...
some_2d_kernel::launch::<F, R>(
    &client,
    // ...,
    matrix.width as u32,
);
}

However, integer division is quite slow, so this might have an impact on runtime. To mitigate the cost you can use FastDivmod to pre-calculate the factors for division using 64-bit Barret Reduction, and pass those instead of the divisor.
This is faster even if you only use division or modulo, and much faster if you use both.

Example:

#![allow(unused)]
fn main() {
#[cube(launch)]
pub fn some_2d_kernel<F: Float>(output: &mut Array<F>, width: FastDivmod) {
    let (y, x) = width.div_mod(ABSOLUTE_POS);
    //...
}

some_2d_kernel::launch::<F, R>(
    &client,
    // ...,
    FastDivmodArgs::new(&client, matrix.width as u32),
);
}

Backend Support

This is implemented using efficient extended multiplication on CUDA (__umulhi) and Vulkan (OpUMulExtended), and using manual casts and shifts on targets that support u64. Targets without either (possibly WebGPU) fall back to normal division.

Keyboard shortcuts

The CubeCL Book 🧊