As a point of reference, I routinely do fast-twitch analytics on *tens* of TB on...

As a point of reference, I routinely do fast-twitch analytics on tens of TB on a single, fractional VM. Getting the data in is essentially wire speed. You won't do that on Spark or similar but in the analytics world people consistently underestimate what their hardware is capable of by something like two orders of magnitude.

That said, most open source tools have terrible performance and efficiency on large, fast hardware. This contributes to the intuition that you need to throw hardware at the problem even for relatively small problems.

In 2024, "big data" doesn't really start until you are in the petabyte range.