The nice thing is that with LLMs using markdown we are getting a nice ecosystem for a universal method for communicating textual information. The negative is that Markdown is starting to look like the https://xkcd.com/927/ cartoon.
The silly part is having n+1 Markdown standards that all end up rendering as HTML anyway. Personally if it's a plain text file sure, basic Markdown is fine, beyond that just give me some kind of rich text editor that stores as HTML and let me do whatever and not have to hand format a Markdown table.
I'm a hot dog chef with over 20 years of experience. Credited with inventing 274 hot dog styles. International awards. World renowned and industry figure.
My entire team, very competent hot dog experts, was laid off after a hot dog cooking machine could do what took us 3 months, in just one day. I've been out of a job for 12 months. The reason? All hot dog making has been offloaded to Claudog Hotdog. "Sorry. Hot dog manual cooking is a thing of the past", one recruiter told me.
I'm working as a software engineer as we speak. I keep applying to hot dog related positions but I get no interviews. Even positions significantly below my pay grade and skillset. No one is hiring. Hot dog cooking is over. We are entering a new era.
I'd take these options from several companies (all selling hotdogs) and wrap them up in Collateral Hotdog Obligations which I'd then offer to investors.
I build bicycles. I was shocked when our internal team built a bicycle that goes to the moon
We are afraid to release it to the public! And thus we are shutting down the company. We don't want humans polluting Moon and the atmosphere and space!
I tried using it for a specific web search task. I wrote a skill, got it all set up and deployed. It worked. But also, would have worked just as well as a cron job with some LLM looking at Brave API results. Like a lot of AI tools, it was a lot of work for underwhelming results.
This is probably my favorite gain from AI assisted coding: the bar for "who cares about this app" has dropped to a minimum of 1 to make sense. I recently built an app for grocery shopping that is specific to how and where I shop, would be useless to anyone other than my wife. Took me 20 minutes. This is the next frontier: I have a random manual process I do every week, I'll write an app that does it for me.
More than that. Building a throwaway-transient-single-use web app for a single annoying use kind of makes sense now, sometimes.
I had to create a bunch of GitHub and Linear apps. Without me even asking Codex whipped up a web page and a local server to set them up, collecting the OAuth credentials, and forward them to the actual app.
Took two minutes, I used it to set up the apps in three clicks each, and then just deleted the thing.
Same energy here. I was sitting on 50+ .env files across various projects with plaintext API keys and it always bothered me but never enough to actually fix it. AI dropped the effort enough that I just had a dedicated agent run at it for a few days — kept making iterations while I was using it day to day until it landed on a pretty solid Touch ID-based setup.
This mix of doing my main work on complex stuff (healthcare) with heavy AI input, and then having 1-2 agents building lighter tools on the side, has been surprisingly effective.
That's fine and all, but how much are you ready to pay to Anthropic and OpenAI to be able to do this? Like, is it worth 100 bucks a month for you to have your own shopping app?
Haha great. I guess my wider point is that most people won't be ready to pay for it, and in the end there will be only two ways to monetize for OpenAI et al: Ads or B2B. And B2B will only work if they invest a lot into sales or if the business owners see real productivity gains one the hype has died one.
It's not worth 100 bucks a month for me to have my own shopping app, but maybe it's worth 100 bucks a month to have ready access to a software garden hose that I can use if I want to spew out whatever stupid app comes to my mind this morning.
I'd rather not pay monthly for something (like water) that I'm turning on and off and may not even need for weeks. But paying per-liter is currently more expensive so that's what we currently do.
I think the future is going to be local models running on powerful GPUs that you have on-prem or in your homelab, so you don't need your wallet perpetually tethered to a company just to turn the hose on for a few minutes.
Me, and photo editor tool to semi-automate a task of digitizing a few dozen badly scanned old physical photos for a family photo book. Needed something that could auto-straighen and auto-crop the photos with ability to quickly make manual adjustments, Gemini single-shotted me a working app that, after few minutes of back-and-forth as I used it and complained about the process, gained full four-point cropping (arbitrary lines) with snapping to lines detected in image content for minute adjustments.
Before that, it single-shot an app for me where I can copy-paste a table (or a subsection of it) from Excel and print it out perfectly aligned on label sticker paper; it does instantly what used to take me an hour each time, when I had to fight Microsoft Word (mail merge) and my Canon printer's settings to get the text properly aligned on labels, and not cut off because something along the way decided to scale content or add margins or such.
Neither of these tools is immediately usable for others. They're not meant to, and that's fine.
My buddy and I are writing our own CRUD web app to track our gaming. I was looking at a ticketing system to use for us to just track bug fixes and improvements. Nothing I found was simple enough or easy enough to warrant installing it.
I vibe'd a basic ticketing system in just under an hour that does what we need. So not 20 mins, but more like 45-60.
I built a small app to emit a 15 kHz beep (that most adults can't hear) every ten minutes, so I can keep time when I'm getting a massage. It took ten minutes, really, but I guess it's in the spirit of the question.
For 20 minutes of time, I had a simple TTS/STT app that allows me to have a voice conversation with my AI assistant.
> The situation on the Rhine/Danube frontier was complex. The peoples on the other side of the frontier were not strangers to Roman power; indeed they had been trading, interacting and occasionally raiding and fighting over the borders for some time. That was actually part of the Roman security problem: familiarity had begun to erode the Roman qualitative advantage which had allowed smaller professional Roman armies to consistently win fights on the frontier. The Germanic peoples on the other side had begun to adopt large political organizations (kingdoms, not tribes) and gained familiarity with Roman tactics and weapons. At the same time, population movements (particularly by the Huns) further east in Europe and on the Eurasian Steppe began creating pressure to push these ‘barbarians’ into the empire. This was not necessarily a bad thing: the Romans, after conflict and plague in the late second and third centuries, needed troops and they needed farmers and these ‘barbarians’ could supply both. But as we’ve discussed elsewhere, the Romans make a catastrophic mistake here: instead of reviving the Roman tradition of incorporation, they insisted on effectively permanent apartness for the new arrivals, even when they came – as most would – with initial Roman approval.
One of the key infrastructures for the Inca's large transportation network connecting diverse territories in the Andes was a system of of grass-rope bridges across the ravines that had to be rebuilt annually. I would imagine their fragility played a substantial role in the invasion / occupation. The most important ones were rebuilt by the Spanish in stone once their position was secure.
No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.
I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.
Sibling commenter also mentions:
> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."
Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.
Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.
Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.
So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).
Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.
When it comes to the statement
>its not unreasonable to say llms are trained to predict the next book instead of single token.
You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.
It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).
The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.
My hot take is that as that percentage increases, salaries will go up asymptotically, until you get to 100%, then they crash to 0. If 80% of your job can be done by AI, I'm going to give you the work of 5 people. When is 99%, I will give you the work of 100 people
If 80% is “done by the AI”, who is responsible for the certain failure on behalf of the AI? Given inference often is, >0%, wrong — in a word… hmm.
How many 9s until you’re comfortable? Even then, knowing 1000 tasks could likely have at least 1 foundational issue… how do you audit? “Pretty please do the needful” and have another “please ensure they do the needful”. Do you review the 1000 inputs/outputs processed? Don’t get me wrong, am familiar with the “send it” ethos all too well, but at-scale it seems like quite the pickle.
Genuinely curious how most people consider these angles… was tasked with building a model once to perform what literally could’ve otherwise been a SQL query… when I brought this up, it was met with “well we need to do it with AI” I don’t think a humans gonna want to find that needle in a haystack when 100,000 significant documents are originated… but I don’t have to worry about that one anymore thank goodness.
If you're okay with the work being done poorly and without review, then sure. Otherwise, it'll take the same amount of time and be done worse. I would not trust solely 1 person to review 5 people's work let alone 100.
You’re arguing semantics. OP is hypothesising a future where the quality of work is comparable to that of a human. If you don’t believe that that’s on the cards, just say it, but you’re intentionally misrepresenting the hypothetical.
reply