# Optimization - Making Rust Code Go Brrrr

Rust code can be fast. Very fast, in fact. If you look at the Benchmarks Game, it goes head-to-head with C and C++.

But performance isn't effortless, although Rust's LLVM backend makes it seem so. I'm going to go over the ways I improve performance in my Rust projects.

## Rayon isn't a magic bullet

It's really not. Many people think just slapping par_iter on the smallest operation will magically fix their performance. It won't. With that mindset, synchronization overhead will eat you alive.

Rayon has more than just par_iter. For example, par-chunks is very useful - you can split your task into parallel chunks, each thread processing a portion of the entire dataset at a time. This greatly reduces synchronization overhead, especially for situations where you have a large amount of small tasks. However, it still may be better to use par_iter for large tasks that take a while per iteration.


iter.par_chunks(4096).for_each(|x| {
for y in x {
y.do_small_thing();
}
});



## Buffering matters!

This is simple. I/O involves syscalls. Syscalls are bad for performance. Therefore, you want to minimize syscalls and optimize I/O.

You should always wrap I/O (whether it be a File, TcpStream, et cetera) in an BufReader or BufWriter. These quite simply buffer I/O operations, preferring to write things in a single large batch, over many small batches. This reduces your total syscalls, and overall increases performance.

Remember!!: If you use a BufWriter, make sure to call flush and/or sync_all before it's dropped! This will allow you to handle any errors.


let fd = File::create("example.bin").expect("Failed to create file!");
let mut writer = BufWriter::new(fd);
std::io::copy(&mut; buffer, &mut; writer).expect("Failed to copy buffer!");
writer.flush().expect("Failed to write file!");



## std isn't always the best.

The Rust standard library is great. I mean, it really is. But it doesn't always offer the best options. Some crates provide near-identical interfaces at greatly increased performance.

## Allocating the path to hell

Many Rust developers take types such as String and Vec for granted, without understanding the downsides. These are dynamically allocated types. Allocations are not your friend when you're optimizing for performance.

In addition, look into alternative allocators which may yield better performance for your project, such as jemallocator or mimalloc.

Modern processor have tons of extremely useful extensions, such as AVX and SSE. Even on non-x86 platforms, extensions with similar functionality are available, such as NEON on ARM, and the proposed P and V extensions for RISC-V.

While Rust allows you to directly interface with these extensions, and there are many packages for higher-level interfacing, such as packed_simd and generic-simd, the LLVM optimizer is capable of automatically optimizing code to use these extensions.

You may need to pass -C target-cpu=native or -C target-features=+avx through RUSTFLAGS in order to take advantage of this (see rustc --print target-features for available features for your target, and use somethng like lscpu to see what your CPU supports).

• Doing things in groups of 4/8 is good for vectorization.
• Do note, branching will heavily reduce the chances of vectorization.

See this function. It converts four f32s into four u8s.


#[inline]
pub unsafe fn f32_to_u8(f: f32) -> u8 {
if f > f32::from(u8::MAX) {
u8::MAX
} else {
f32::to_int_unchecked(f)
}
}
/// Converts a slice of 4 [f32] s into a tuple of 4 [u8]s, rounding it in the process
#[must_use]
pub fn f32s4_to_u8(f: [f32; 4]) -> (u8, u8, u8, u8) {
let f = &f;[..4];
unsafe {
(
f32_to_u8(f[0]),
f32_to_u8(f[1]),
f32_to_u8(f[2]),
f32_to_u8(f[3]),
)
}
}



Now, we can throw this code into Compiler Explorer to see what assembly it generates. Don't forget the compiler flags!


example::f32s4_to_u8:
vmovss  xmm0, dword ptr [rip + .LCPI0_0]
vminss  xmm1, xmm0, dword ptr [rdi]
vcvttss2si      eax, xmm1
vminss  xmm0, xmm0, dword ptr [rdi + 4]
vcvttss2si      ecx, xmm0
vmovsd  xmm0, qword ptr [rdi + 8]
vbroadcastss    xmm1, dword ptr [rip + .LCPI0_0]
vcmpleps        xmm2, xmm1, xmm0
vblendvps       xmm0, xmm0, xmm1, xmm2
vcvttps2dq      xmm0, xmm0
vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpsllvd xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
movzx   ecx, cl
shl     ecx, 8
movzx   eax, al
or      eax, ecx
vmovd   ecx, xmm0
or      ecx, eax
vpextrd eax, xmm0, 1
or      eax, ecx
ret



Success! It generates AVX instructions, such as VBROADCASTSS and VMOVSS!

## Making the compiler brrrr harder

It is entirely possible to configure the compiler to optimize more aggressively! For example, in Cargo.toml (Do note this will increase compile times!!):


[profile.release]
lto = 'thin'
panic = 'abort'
codegen-units = 1
[profile.bench]
lto = 'thin'
codegen-units = 1



Each option explained:

• lto = 'thin' - Quite simply enables Thin LTO. You can also try lto = 'fat', performance gains should be similar.
• panic = 'abort' - Abort instead of unwinding on panic. You'll get a smaller, more performant binary, but you won't be able to catch panics anymore. See the Rust Guide for more info.
• codegen-units = 1 - Ensures that the crate is compiled with only one code generation unit. This reduces the paralellization of the compilation, but will allow the LLVM to optimize it much better.

## Edits

• 9/30/2020, 3:40 PM EST - Re-phrased the Copy/Clone section, (thanks /u/SkiFire13) mentioned sync_all in the buffering section (thanks /u/Freeky), and also mentioned lto = 'fat' (thanks /u/po8)