The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
They're absolutely worth using for the right tasks. It's hard to go back to GPT4 level for everything (for me at least), but there's plenty of stuff they are smart enough for.
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
https://qwen.ai/blog?id=qwen3.6