The vast majority of AI companies I talk to seem to evaluate models mostly based on vibes.
At my company, we use a mix of offline and online evals. I’m primarily interested in search agents, so I’m fortunate that information retrieval is a well-developed research field with clear metrics, methodology, and benchmarks. For most teams, I recommend shipping early/dogfooding internally, collecting real traces, and then hand-curating a golden dataset from those traces.
Many people run simple ablation experiments where they swap out the model and see which one performs best. That approach is reasonable, but I prefer a more rigorous setup.
If you only swap the model, some models may appear to perform better simply because they happen to work well with your prompt or harness. To avoid that bias, I use GEPA to optimize the prompt for each model/tool/harness combination I’m evaluating.
Ah, interesting – yeah only swapping out the model isn't super insightful since models perform differently given different prompts. I'm going to look into GEPA, thanks!
sounds like it could be many things. there was a well-known paper called Voyager by NASA in which an agent was able to write its own skills in the form of code and improve them over time.
funnily enough this agent played minecraft, and its skills were to collect materials or craft things.
https://arxiv.org/abs/2305.16291
Yeah, this is the natural next step. We're currently combining LLMs and compute -- mostly in the form of giving agents tools, then terminal access and now most recently sandboxes. The most logical next step is to give them specialized compute engines and frameworks for their tasks.
I've been building SQL agents recently, and nothing is better than just giving it access to Trino.
Chroma | ONSITE - San Francisco | Full-time | https://trychroma.com
Chroma is the open-source database designed for AI. Help build the future of context engineering!
We're a team of 17, all engineers today. We work in Rust, Python, Typescript, and Go.
We're hiring for these areas:
- Database Storage
- Distributed Systems
- Product Engineering
- Platform
- Product Design
What we’ve recently done:
- We built and launched a new search API with hybrid search and sparse vectors
- We built a new sync API that lets us handle ingesting, chunking, and embedding data from sources (we currently support Github repos)
My problem with AI in Zed is not that it's there, but that it feels like it's always behind in AI code editor paradigms. They were pretty late to the party to add edit predictions, and their agent UX is pretty behind the game. Recently, Cursor added background agents which I feel is a game changer and I now feel it's a deal breaker when choosing an editor.
It makes me wonder if choosing to build their own GUI framework in Rust was the right move. Zed is a great code editor, but for me, it's not a great AI code editor.
- Upgrade our NodeJS version because it just got deprecated
- Upgrade our linter to the newest version, add a new rule, and fix all instances of that rule in our code
- Make minor changes to our UI
- Fix small bugs that I know how to fix, and can tell it exactly what to do
The main pain point they're solving for me is that I have many small tasks I need to do. Coding them isn't the main bottleneck, but creating a new branch and then creating a new PR is the main bottleneck for me. With cursor specifically, I don't even have to check out the branch locally to verify the code.
For any significant work, I'd rather manually do it in editor.
hey this is pretty cool! my favorite git command right now is `git rebase -i`, so a tool like this would be pretty helpful because I don't always have the best commit messages :)
At my company, we use a mix of offline and online evals. I’m primarily interested in search agents, so I’m fortunate that information retrieval is a well-developed research field with clear metrics, methodology, and benchmarks. For most teams, I recommend shipping early/dogfooding internally, collecting real traces, and then hand-curating a golden dataset from those traces.
Many people run simple ablation experiments where they swap out the model and see which one performs best. That approach is reasonable, but I prefer a more rigorous setup.
If you only swap the model, some models may appear to perform better simply because they happen to work well with your prompt or harness. To avoid that bias, I use GEPA to optimize the prompt for each model/tool/harness combination I’m evaluating.