The basic Erlang assumption is that bugs are predominantly due to state. You wou...

The basic Erlang assumption is that bugs are predominantly due to state.

You would have tested at least the success path in your code (but not probably all the error conditions). 'Logic bugs' then should fall on the non-critical path, so dropping the state that led you to take that path is okay; if it's a critical process, you restart it into a known good state. A trivial example of this might be a user giving you bad input; any validation (or even a lack of validation, provided it breaks something down the line) should cause the data to drop, and the listening process to still be up (i.e., crash and restart into a known good state), ready to take other user requests (and to leave any other in process user requests alone).

For load, Erlang makes it easy to write code that automatically scales for load (up to the hardware limit, obviously). It also makes it easy to write code that can be distributed, running across multiple machines, and able to scale up that way as well. It's also pretty easy to write rate limiters and things, to automatically shed load if you're bottle necking somewhere.

Erlang also makes it harder for humans to break things (in the sense of user error, or production support tinkering with it). You can still do it, but the supervisor processes oftentimes help mitigate that (a human puts in bad state such that it affects an important process, the process will crash and restart).

How is it different than a try/catch? Because rather than trying to figure out every location that an error can occur in your code, you instead ask "what happens if -anything- goes wrong with this process? How do I get this process back to a good state?" You already partially answered this question when you created the process, and had to ask "how do I start this process in a good state?" Does the process depend on another process being up? Then you would have had to already specify as part of the supervisor that the other process is started first, and picked a supervision strategy that will restart that other process if this one restarts, causing both to dump their state and reinitialize. A try/catch, not only do you have to recognize each individual thing that can go wrong (or have a catch-all that hides errors), you have to decide how to recover from it, and it oftentimes is not at all clear how you do that. You end up with a lot more places you can miss handling errors, a lot of assumptions being made, and there isn't a good way to tie multiple processes together.

Restarting an OS process is actually a great example of why this philosophy works well. If the app dies, restart it; it probably died because of bad state. You already understand that that is a great option for unexpected bugs that leave the entire app in a bad state. Only, oftentimes that's overkill; your bad state is probably relegated to a very small part of the app, that, if the app is written in such a way that unrelated things stay unrelated (and communicate with just shared nothing message passing), can individually be restarted within the app. Subsystems, if you will. That's what Erlang does. The idioms and functionality intrinsic to the language make it far easier to write code encapsulated across process lines, such that an error in one process can be treated as identical to any other error in that process, or that kind of process, and handled in a well defined way (drop the state; restart the process into a known good state), leaving the rest of the app unaffected.