Yeah, this is really, really far from an apples-to-apples comparison. First of, the test dataset size is trivially small for usecases where big data systems are typically applied. I don't know why you'd introduce all the complexity and overhead of a distributed mapreduce framework to ETL a dataset that would fit in memory on consumer-grade hardware. It's not exactly fair to compare a framework running on a single node to one where you've artificially introduced multiple nodes and network overhead for a dataset that would easily fit on one. You'll also notice a pretty stark difference between the level of detail provided for the BlazingSQL test set up and the Spark one, which (unless I'm missing something) is lacking any code or configuration details. I've dipped my toes in the big data space long enough and seen enough "${FANCY NEW FRAMEWORK} beats ${INDUSTRY-STANDARD FRAMEWORK} by 123x!!" posts to recognize this as a gigantic red flag. How you manage partition sizes, order and choice of operations, and tuning parameters can make orders-of-magnitude level differences to your performance.
Maybe the future of frameworks like this will be on the GPU. I'm just not seeing any evidence of it yet. Right now, Spark fills the space where you can throw globs of memory at TB- to PB-scale problems. I could very well be wrong, but I don't see how this is going to be cost-effective on GPUs given the current cost of memory there.
1. You don't have to fit your whole workload on the GPU you can process it in batches like you would for a workload that doesn't fit into memory on a non gpu solution. You don't need PB of GPU memory to run PB workloads.
2. The dataset is trivially small because this is a new engine built for the rapids eco system and it is limited for the time being to a single node. We are releasing our distributed version for GTC (mid March) and will be able to give you more reasonable comparisons. This is a similar path of development to our pre Rapids engine which went from single node to distributed in about a month because we have built this engine to be distributed. Right now we are finishing up UCX integration which is the layer we will be using to communicate between all the nodes.
3. You can always try it out. Its own dockerhub (see links in this post) and if you want to run distributed workloads right now you can manage that process using dask by handling the splitting up of the job yourself. In a few weeks you will be able to have the job split up for you automatically without need for the user to be aware of the size of the cluster or how to distribute data across it.
We're pretty excited near-term for getting to sub-second / sub-100ms interactive time on real GB workloads. That's pretty normal in GPU land. More so, where this is pretty clearly going, is using multiGPU boxes like DGX2s that already have 2 TB/s memory bandwidth. Unlike multinode cpu systems, I'd expect better scaling b/c no need to leave the node.
With GPUs, the software progression is single gpu -> multi-gpu -> multinode multigpu. By far, the hardest step is single gpu. They're showing that.
1. If you process it in batches then you have to count the time it takes to send the data of each batch to and from the GPU.
2. It's fair to start out with small data sets, but then you don't compare against distributed frameworks like Spark, but rather against single-node solutions.
Also - Spark is very slow compared to analytic distributed DBMSes.
I used to think that experience meant the difference between believing benchmarks and being skeptical of them. Now I know it's the difference between being skeptical and ignoring them.
Maybe the future of frameworks like this will be on the GPU. I'm just not seeing any evidence of it yet. Right now, Spark fills the space where you can throw globs of memory at TB- to PB-scale problems. I could very well be wrong, but I don't see how this is going to be cost-effective on GPUs given the current cost of memory there.