I'm surprised how much bloat rusts adds and how little it affects the speed. Is ...

snuxoll · on Sept 11, 2019

We can see in this specific case there was better cache locality and more data was served from the L1 and L2 cache with a drop in L3 cache misses (no hits, because it didn’t have to look in L3 for anything).

6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.

topspin · on Sept 12, 2019

> We can see in this specific case there was better cache locality

Dramatically better L1 and L2 cache behavior. It seems clear that the additional instruction load of the Rust driver is partially made up by the excellent cache utilization.

This "Rust vs C" document is just one part of a larger analysis of network driver implementations in many languages; C, Rust, Go, C#, Java, OCaml, Haskell, Swift, Javascript and Python. Have a look at the top level README.md of that GitHub repo.

safercplusplus · on Sept 12, 2019

Presumably a C++ (or SaferCPlusPlus[1] ;) implementation would see a similar cache performance advantage versus C?

Also, isn't it unintuitive that branch mispredictions go up with larger batch sizes? Wouldn't there be fewer branches per unit time?

[1] https://github.com/duneroadrunner/SaferCPlusPlus/blob/master...

eximius · on Sept 11, 2019

Unless/until a lot more is written in Rust... not much. It uses slightly more base RAM to load the binary. Some of the bloat is things that in C programs would be dynamically linked in - it isn't that Rust is doing more, it's that C gets to share a lot of stuff and Rust has to bring it's own.

masklinn · on Sept 12, 2019

> It uses slightly more base RAM to load the binary.

It's mostly vmem until / unless the data actually gets used though, no?

eximius · on Sept 12, 2019

I don't know. I wouldn't be surprised that it loaded the whole thing. How could the OS predict how much to load (or wait on)? Waiting for a page to load just for the next function call would be hugely expensive.

masklinn · on Sept 13, 2019

> How could the OS predict how much to load (or wait on)?

The same way it does for every other bit of allocated memory: it allocates the physical page on a page fault in a valid mapping.

userbinator · on Sept 12, 2019

In general, the effect of bloat is not visible in benchmarks like these where the goal is to run something small many many times, with ample memory available, and as little else on the system adding noise to the results as possible. It's the same reason you see "Java is faster than C" benchmark results, yet everyone knows how the former actually performs in practice.

The effects of larger memory usage don't become obvious until other applications start contending for it and/or swapping happens, and it's conveniently also something that is not as easily blamed on one application "being slow", which is why it doesn't receive nearly as much attention as it should.

hcs · on Sept 11, 2019

It does take extra space, but ideally you'd store the exceptional error handling code out-of-line, so that they don't need to take up cache in the common case.

emmericp · on Sept 11, 2019

Our results probably only hold true for workloads with a low IPC. The test case is also a very limited forwarder, but real network functions also have a relatively low IPC in my experience (don't have any numbers to back up this claim, though).

maplant · on Sept 11, 2019

LLVM is a really good backend that optimizes away a lot of it. Bounds checks, for example, usually get removed.

saagarjha · on Sept 11, 2019

…when possible, of course, unless you're using one of the unsafe methods.

maplant · on Sept 11, 2019

Unsafe is purely a rust directive. It doesn’t affect LLVM IR output AFAIK

wyldfire · on Sept 11, 2019

What bloat are you referring to specifically?

im3w1l · on Sept 12, 2019

I'm referring to the table in the article.

grenoire · on Sept 11, 2019

I think GP forgot to build with the release flag instead of debug.

muricula · on Sept 11, 2019

If they built with debug there would be dramatically more load/store uops than you see in the benchmark. Debug mode builds disable optimizations and store variables back to memory after most expressions to aid debugging.