We can see in this specific case there was better cache locality and more data was served from the L1 and L2 cache with a drop in L3 cache misses (no hits, because it didn’t have to look in L3 for anything).
6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.
> We can see in this specific case there was better cache locality
Dramatically better L1 and L2 cache behavior. It seems clear that the additional instruction load of the Rust driver is partially made up by the excellent cache utilization.
This "Rust vs C" document is just one part of a larger analysis of network driver implementations in many languages; C, Rust, Go, C#, Java, OCaml, Haskell, Swift, Javascript and Python. Have a look at the top level README.md of that GitHub repo.
Unless/until a lot more is written in Rust... not much. It uses slightly more base RAM to load the binary. Some of the bloat is things that in C programs would be dynamically linked in - it isn't that Rust is doing more, it's that C gets to share a lot of stuff and Rust has to bring it's own.
I don't know. I wouldn't be surprised that it loaded the whole thing. How could the OS predict how much to load (or wait on)? Waiting for a page to load just for the next function call would be hugely expensive.
In general, the effect of bloat is not visible in benchmarks like these where the goal is to run something small many many times, with ample memory available, and as little else on the system adding noise to the results as possible. It's the same reason you see "Java is faster than C" benchmark results, yet everyone knows how the former actually performs in practice.
The effects of larger memory usage don't become obvious until other applications start contending for it and/or swapping happens, and it's conveniently also something that is not as easily blamed on one application "being slow", which is why it doesn't receive nearly as much attention as it should.
It does take extra space, but ideally you'd store the exceptional error handling code out-of-line, so that they don't need to take up cache in the common case.
Our results probably only hold true for workloads with a low IPC. The test case is also a very limited forwarder, but real network functions also have a relatively low IPC in my experience (don't have any numbers to back up this claim, though).
If they built with debug there would be dramatically more load/store uops than you see in the benchmark. Debug mode builds disable optimizations and store variables back to memory after most expressions to aid debugging.
Is there any other downside? Electricity consumption / heat? Evicting other stuff from cache?