Aspen's blog

Aspen, I write this blog!

Optimization - Making Rust Code Go Brrrr

Rust code can be fast. Very fast, in fact. If you look at the Benchmarks Game, it goes head-to-head with C and C++.

But performance isn't effortless, although Rust's LLVM backend makes it seem so. I'm going to go over the ways I improve performance in my Rust projects.

Rayon isn't a magic bullet

It's really not. Many people think just slapping par_iter on the smallest operation will magically fix their performance. It won't. With that mindset, synchronization overhead will eat you alive.

Rayon has more than just par_iter. For example, par-chunks is very useful - you can split your task into parallel chunks, each thread processing a portion of the entire dataset at a time. This greatly reduces synchronization overhead, especially for situations where you have a large amount of small tasks. However, it still may be better to use par_iter for large tasks that take a while per iteration.

iter.par_chunks(4096).for_each(|x| {
	for y in x {
		y.do_small_thing();
	}
});

Buffering matters!

This is simple. I/O involves syscalls. Syscalls are bad for performance. Therefore, you want to minimize syscalls and optimize I/O.

You should always wrap I/O (whether it be a File, TcpStream, et cetera) in an BufReader or BufWriter. These quite simply buffer I/O operations, preferring to write things in a single large batch, over many small batches. This reduces your total syscalls, and overall increases performance.

Remember!!: If you use a BufWriter, make sure to call flush and/or sync_all before it's dropped! This will allow you to handle any errors.

let fd = File::create("example.bin").expect("Failed to create file!");
let mut writer = BufWriter::new(fd);
std::io::copy(&mut buffer, &mut writer).expect("Failed to copy buffer!");
writer.flush().expect("Failed to write file!");

std isn't always the best.

The Rust standard library is great. I mean, it really is. But it doesn't always offer the best options. Some crates provide near-identical interfaces at greatly increased performance.

Allocating the path to hell

Many Rust developers take types such as String and Vec for granted, without understanding the downsides. These are dynamically allocated types. Allocations are not your friend when you're optimizing for performance.

In addition, look into alternative allocators which may yield better performance for your project, such as jemallocator or mimalloc.

Advanced Magic Extensions

Modern processor have tons of extremely useful extensions, such as AVX and SSE. Even on non-x86 platforms, extensions with similar functionality are available, such as NEON on ARM, and the proposed P and V extensions for RISC-V.

While Rust allows you to directly interface with these extensions, and there are many packages for higher-level interfacing, such as packed_simd and generic-simd, the LLVM optimizer is capable of automatically optimizing code to use these extensions.

You may need to pass -C target-cpu=native or -C target-features=+avx through RUSTFLAGS in order to take advantage of this (see rustc --print target-features for available features for your target, and use somethng like lscpu to see what your CPU supports).

See this function. It converts four f32s into four u8s.

#[inline]
pub unsafe fn f32_to_u8(f: f32) -> u8 {
	if f > f32::from(u8::MAX) {
		u8::MAX
	} else {
		f32::to_int_unchecked(f)
	}
}

/// Converts a slice of 4 [f32] s into a tuple of 4 [u8]s, rounding it in the process
#[must_use]
pub fn f32s4_to_u8(f: [f32; 4]) -> (u8, u8, u8, u8) {
	let f = &f[..4];
	unsafe {
		(
			f32_to_u8(f[0]),
			f32_to_u8(f[1]),
			f32_to_u8(f[2]),
			f32_to_u8(f[3]),
		)
	}
}

Now, we can throw this code into Compiler Explorer to see what assembly it generates. Don't forget the compiler flags!

example::f32s4_to_u8:
        vmovss  xmm0, dword ptr [rip + .LCPI0_0]
        vminss  xmm1, xmm0, dword ptr [rdi]
        vcvttss2si      eax, xmm1
        vminss  xmm0, xmm0, dword ptr [rdi + 4]
        vcvttss2si      ecx, xmm0
        vmovsd  xmm0, qword ptr [rdi + 8]
        vbroadcastss    xmm1, dword ptr [rip + .LCPI0_0]
        vcmpleps        xmm2, xmm1, xmm0
        vblendvps       xmm0, xmm0, xmm1, xmm2
        vcvttps2dq      xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpsllvd xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        movzx   ecx, cl
        shl     ecx, 8
        movzx   eax, al
        or      eax, ecx
        vmovd   ecx, xmm0
        or      ecx, eax
        vpextrd eax, xmm0, 1
        or      eax, ecx
        ret

Success! It generates AVX instructions, such as VBROADCASTSS and VMOVSS!

Making the compiler brrrr harder

It is entirely possible to configure the compiler to optimize more aggressively! For example, in Cargo.toml (Do note this will increase compile times!!):

[profile.release]
lto = 'thin'
panic = 'abort'
codegen-units = 1

[profile.bench]
lto = 'thin'
codegen-units = 1

Each option explained:

Edits

Please do not use the Brave browser

It's a just fork of Chromium with a sub-par adblocker and a really bad history.

See my blog post for more details.