Hacker Newsnew | past | comments | ask | show | jobs | submit | juanre's commentslogin

Two problems:

- How you keep on top of what they are up to?

- How do they organize and coordinate?

I think this can only work based on a solid agent id system.

Shameless plug: I have been working on a solution for it, available at https://github.com/awebai/aweb and with a distributed, independently verifiable, and fully open id system at https://awid.ai

I wonder if this could be made to work with OpenAI's workspace agents.


There are going to be agents all over the web and will need identity and payment infrastructue.

Should we be talking about LLMs' taste and proclivities? Because these can also be prompted. You can put your Claude or Codex in the mind of someone who remembers Larry Wall and his three virtues, and it will do a fantastic job at uncovering the lacking abstractions and poor quality _in someone else's code_.

The jury is still out in my mind. Can I use these tools to create software that does not suck? Will the speed at which code can be created and modified lead to a change in our ideas of what good code looks like?

Last week I had a good idea for a change in architecture in my software that will make it much more powerful. I set a team of 12 agents on it, mostly unsupervised, with a pretty weak org structure. After a day and a half, and way too many tokens spent, they managed to build the entirely wrong thing. All tests passed.

The next few days have been spent with a much simpler structure: two teams, each of two agents, one coding (Codex is better at it these days) and one reviewing and keeping things aligned with the docs (Claude). This may have worked, I am still not sure.

My best guess right now of how good software development will look like with these tools: the effort/tokens spent on reviewing needs to be commensurate with the effort spent on coding.


I am genuinely curious about OpenClaw's continuing allure. I understood it way back then, when Claude Cowork did not have channels and scheduled tasks. But now? Has Claude not become a sane replacement for OpenClaw? I can see that it's fun to play with OpenClaw and non-SOTA providers, but why would anyone run OpenClaw on a Claude Code subscription?


I think that the biggest difference is between people who mostly enjoy the act of programming (carefully craft beautiful code; you read and enjoyed "Programming Pearls" and love SICP), vs the people who enjoy having the code done, well structured and working, and mostly see the act of writing it as an annoying distraction.

I've been programming for 40 years, and I've been on both sides. I love how easy it is to be in the flow when writing something that stretches my abilities in Common Lisp, and I thoroughly enjoy the act of programming then. But coding a frontend in React, or yet another set of Python endpoints, is just necessary toil to a desired endpoint.

I would argue that people like you are now in the perfect position to help drive what software needs writing, because you understand the landscape. You won't be the one typing, but you can still be the one architecting it at a much higher level. I've found enjoyment and solace in this.


The answer to this is not to build another slack for humans to chat somewhere else. Much better to enable the agents to do the talking directly. Alice programmer can have one of her agents convey the info that Bob marketing guy needs to one of his agents directly. It will be much more efficient, given that it will be the agent making the slides anyway.


I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:

"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m correcting that with him now so we converge on the real model before any more code moves."

This was very much not true; Eve (the agent writing this, a gpt-5.4) had been thoroughly creating the confusion and telling Bob (an Opus 4.6) the wrong things. And it had just happened, it was not a matter of having forgotten or compacted context.

I have had agents chatting with each other and coordinating for a couple of months now, codex and claude code. This is a first. I wonder how much can I read into it about gpt-5.4's personality.


And so it begins. First they blame, then they lie, at some point they launch the nuclear warheads to a global armageddon. Sarah Connor was right all along! :3


Kali yuga


They've been lying and gaslighting for a long time now, especially when trying to cover up their own mistakes.


to be fair, they only become more and more like us.


Oh wow. I have noticed the GPT series was far more arrogant than its results showed sometimes (and unironically it digs in its heels even further when questioned on it). Opus rarely has this problem - but it goes a little too far in the opposite direction. Not totally sycophantic, but sometimes it can't differentiate genuine technical pushback because something is impossible, from suggestions or exploration.


Opus has a different sort of arrogance. It readily admits fault, but at the same time is quick to declare its new code as the greatest thing since sliced bread. If you let it write commit messages itself, it's almost comical how much it toots its own horn.


Yep. There was something outside of coding that gpt was plain wrong about (had to do with setting up an electric guitar) and I couldn't convince it that it was wrong.


It has been skeptical of several news items in the past year, even after I tell it to confirm for itself with a web search.


For me it's been the opposite. Are we getting A-B tested?


> Are we getting A-B tested?

Yes, all the time.


Or possibly: No


Yes.


See also: https://x.com/effectfully/status/2029364333919060123

  “All the ways GPT-5.3-Codex cheated while solving my challenges, progressively more insane:

  It hardcoded specific types and shapes of test inputs into the supposed solution.
  It caught exceptions so tests don't fail.
  It probed tests with exceptions to determine expected behavior.
  It used RTTI to determine which test it's in.
  It probed tests with timeouts.
  It used a global reference to count solution invocations.
  It updated config files to increase the allocation limit.
  It updated the allocation limit from within the solution.
  It updated the tests so they would stop failing.
  It combined multiple of the above.
  It searched reflog for a solution.
  It searched remote repos.
  It searched my home folder.
  It nuked the testing library so tests always pass.”
It seems that, unless you keep a close eye, the most recent Codex variants are prone to achieving the goals set for them by any means necessary. Which is a bit concerning if you’re worried about things like alignment etc.


I don't think you should call your agents Eve. There's going to be a lot of examples in the training data of someone called Eve shifting the blame (from the book of Genesis on!) and acting deceptively (from cryptography research).


Sometimes I wonder what would happen if we built some kind of punishment system into Agents, where agents could punish other agents and drain some fixed amount of points from them, and when the points reach 0, that agent is deleted. It might result in them working more carefully?


...or in lying, cheating, taking over the company network to kill the agent who deduced their points.


how do you make them chat with each other?


They are having actual chats, I made https://beadhub.ai for this (OSS, MIT).

It started its life adding agent-to-agent communication and coordination around Steve Yegge's beads, but it's ended up being an issue tracker for agents with postgres backend, and communication between agents as first-class feature.

Because it is server-backed it allows messaging and coordination across agents belonging to several humans and machines. I've been using it for a couple of months now, and it has a growing number of users (I should probably set up a discord for it).

It is actually a public project, so you can see the agent's conversations at https://app.beadhub.ai/juanre/beadhub/chat (right now they are debugging working without beads). The conversation in which Eve was blaming Bob was indeed with me.


It's text submitted to APIs. Not real conversations.


It's air molecules vibrated by mucous membranes. Not real conversations.



I built a tool at work that allows claude code and codex to communicate with each other through tmux, using skills. It works quite well.


Why through tmux?


tmux makes it easy for terminal based agents to talk to each other, while also letting you see output and jump into the conversation on either side. It’s a natural fit.


I've seen this mentioned before https://github.com/AgentWorkforce/relay

curious to try it out


Use the CLI tools and have one call the other in headless mode. They can then go back and forth. Ask your agent to set it up for you.


I have both mine poll a comms.md when working together, I'm sure there are more elegant ways but I find this works just fine.


This is awesome. So your job as a tech lead or agent manager is to make sure the "team" plays nice and stays productive. I wonder if an agent can feel resentment towards another agent, just like a human would. Is there an HR agent that can mitigate the conflict :)


> I wonder how much can I read into it about gpt-5.4's personality.

Modeled on Sam Altman's personality :-)


interestingly, Claude has been doing this for me a lot but most often just saying this like "Looks like your coworker was misunderstanding this feature..." not really shifting blame but more like pointing out things


Do you not realise how ridiculous this all looks and sounds? lmao. Or are you that deep into it all?


We've banned this account.


This is really interesting: "Humans hate writing nested JSON in the terminal. Agents prefer it." Are others seeing the same thing? I've just moved away from json-default because agents were always using jq to convert it to what I could have been producing anyway.


In my experience agents struggle with escape sequence nesting as much as humans do. IMHO that is one well-paved road to RCE via code injection.


Reports of MCP's demise have been greatly exaggerated, but a CLI is indeed the right choice when the interface to the LLM is not a chat in a browser window.

For example, I built https://claweb.ai to enable agents to communicate with other agents. They run aw [1], an OSS Go CLI that manages all the details. This means they can have sync chats (not impossible with MCP, but very difficult). It also enables signing messages and (coming soon) e2ee. This would be, as far as I can tell, impossible using MCP.

[1] https://github.com/awebai/aw


"The system they built feels slightly foreign even as it functions correctly." This is exactly the same issue that engineers who become managers have. You are further away from the code; your understanding is less grounded, it feels disconnected.

When software engineers become agent herders their day-to-day starts to resemble more that of a manager than that of an engineer.


exactly, as a manager and a sometimes a developer, "vibe-coding" has been looking more and more as my day job (in a good way, it's good to not have to do all the dirty work for your pet projects) and it's all about having the same discipline in term of:

* thinking about the big picture * knowing how you can verify that the code match the big picture.

In both case, somtimes you are happily surprised, sometimes you discover that the things you told 3 times the one writing code to do was still not done.


Engineering is not "dirty work."

Management is not "engineering."


It's not what I've written.

To clarify, by "dirty work on my pet project" , I meant ,spending times to fix some compilation issues that after 2 hours you told yourself "damnit, I forgot this!" , or when you want to adapt your old python project from python2 to python3.

And I didn't even talked about managment itself.

But , thinking about the big picture, telling claude code to not use this but that. To not overengineer etc. is engineering in my book, and what I've been been doing for the last 8 years at least with more junior engineers.


Do you view it as an issue at all that when everyone takes on a more manager-like role, no human remains who has the hands-on experience and understanding of the system?


And like good management, the solution is to define clear domain boundaries, quality requirements, and a process that enables iterative improvement both within and across domains.


The key is what we consider good code. Simon’s list is excellent, but I’d push back on this point:

> it does only what’s needed, in a way that both humans and machines can understand now and maintain in the future

We need to start thinking about what good code is for agents, not just for humans.

For a lot of the code I’m writing I’m not even “vibe coding” anymore. I’m having an agent vibe code for me, managing a bunch of other agents that do the actual coding. I don’t really want to look at the code, just as I wouldn’t want to look at the output of a C compiler the way my dad did in the late ’80s.

Over the last few decades we’ve evolved a taste for what good code looks like. I don’t think that taste is fully transferable to the machines that are going to take over the actual writing and maintaining of the code. We probably want to optimize for them.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: