Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm surprised how much bloat rusts adds and how little it affects the speed.

Is there any other downside? Electricity consumption / heat? Evicting other stuff from cache?



We can see in this specific case there was better cache locality and more data was served from the L1 and L2 cache with a drop in L3 cache misses (no hits, because it didn’t have to look in L3 for anything).

6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.


> We can see in this specific case there was better cache locality

Dramatically better L1 and L2 cache behavior. It seems clear that the additional instruction load of the Rust driver is partially made up by the excellent cache utilization.

This "Rust vs C" document is just one part of a larger analysis of network driver implementations in many languages; C, Rust, Go, C#, Java, OCaml, Haskell, Swift, Javascript and Python. Have a look at the top level README.md of that GitHub repo.


Presumably a C++ (or SaferCPlusPlus[1] ;) implementation would see a similar cache performance advantage versus C?

Also, isn't it unintuitive that branch mispredictions go up with larger batch sizes? Wouldn't there be fewer branches per unit time?

[1] https://github.com/duneroadrunner/SaferCPlusPlus/blob/master...


Unless/until a lot more is written in Rust... not much. It uses slightly more base RAM to load the binary. Some of the bloat is things that in C programs would be dynamically linked in - it isn't that Rust is doing more, it's that C gets to share a lot of stuff and Rust has to bring it's own.


> It uses slightly more base RAM to load the binary.

It's mostly vmem until / unless the data actually gets used though, no?


I don't know. I wouldn't be surprised that it loaded the whole thing. How could the OS predict how much to load (or wait on)? Waiting for a page to load just for the next function call would be hugely expensive.


> How could the OS predict how much to load (or wait on)?

The same way it does for every other bit of allocated memory: it allocates the physical page on a page fault in a valid mapping.


In general, the effect of bloat is not visible in benchmarks like these where the goal is to run something small many many times, with ample memory available, and as little else on the system adding noise to the results as possible. It's the same reason you see "Java is faster than C" benchmark results, yet everyone knows how the former actually performs in practice.

The effects of larger memory usage don't become obvious until other applications start contending for it and/or swapping happens, and it's conveniently also something that is not as easily blamed on one application "being slow", which is why it doesn't receive nearly as much attention as it should.


It does take extra space, but ideally you'd store the exceptional error handling code out-of-line, so that they don't need to take up cache in the common case.


Our results probably only hold true for workloads with a low IPC. The test case is also a very limited forwarder, but real network functions also have a relatively low IPC in my experience (don't have any numbers to back up this claim, though).


LLVM is a really good backend that optimizes away a lot of it. Bounds checks, for example, usually get removed.


…when possible, of course, unless you're using one of the unsafe methods.


Unsafe is purely a rust directive. It doesn’t affect LLVM IR output AFAIK


What bloat are you referring to specifically?


I'm referring to the table in the article.


I think GP forgot to build with the release flag instead of debug.


If they built with debug there would be dramatically more load/store uops than you see in the benchmark. Debug mode builds disable optimizations and store variables back to memory after most expressions to aid debugging.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: