Most of the training cost is not in the final training run, it's in all of the R&D (including salaries, equity, etc.) that it takes to get to the final training run. The actual cost of all of the TPUs (or GPUs), power, networking, storage, etc. for the final training run is significant, but it's even more expensive to have this huge R&D team doing frontier model development and using a lot of those same resources during development.
I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.
A well run public transit system should obviously be cheaper at scale than robotaxis, but the incentives for Waymo (or Uber, or Lyft, etc.) are very different than the city's incentives. It's very possible that in practice private companies can operate more cheaply at scale than buses because they have much higher incentives to reduce costs and increase efficiency.
Yeah, none of this makes sense to me. Allocating memory for stack space is not expensive (and the default isn't even 1MB??) because you're just creating a VMA and probably faulting in one or two pages.
They also say:
>The system spends time managing threads that could be better spent doing useful work.
What do they think the async runtime in their language is doing? It's literally doing the same thing the kernel would be doing. There's nothing that intrinsically makes scheduling 10k couroutines in userspace more efficient than the kernel scheduling 10k threads. Context switches are really only expensive when the switch is happening between different processes, the overhead of a context switch on a CPU between two threads in the same process is very small (and they're not free when done in userspace anyway).
There are advantages to doing scheduling in the kernel and there are advantages to doing scheduling in userspace, but this article doesn't really touch on any of the actual pros and cons here, it just assumes that userspace scheduling is automatically more efficient.
It's a cargo cult and a bias I see all over the place.
I feel like we're now, what, 20, 25 years on and people still haven't adjusted themselves to the fact that the machines we have now are multicore, have boatloads of cache, or how that cache is shared (or not) between cores.
Nor is there apparently a real understanding of the difference between VSS and RSS.
Nor of the fact that modern machines are really really fast if you can keep stuff in cache. And so you really should be focused on how you can make that happen.
Not the runtime per se, but cooperative scheduling has the advantage that tasks do not yield at adverse code points, e.g., right before giving up a lock, or performing an I/O request. Of course the lack of preemption has it's own downsides, but with thread-per-request you tend to run into tail latency issues much earlier than context switching overhead.
I can't speak to what every team at Google does, but there are machines with Nvidia GPUs in Borg. However Google charges orgs internally for cpu/memory/gpu/tpu usage and TPUs are *way* more efficient in terms of FLOPS/$ than Nvidia GPUs, so there is a *huge* incentive for teams to use TPUs if they can, especially for teams operating large products.
This is true but the relative overhead of this is highly dependent on the protobuf structure in one's schema. For example, fixed integer fields don't need to be decoded (including repeated fixed ints), and the main idea of the "zero copy" here is avoiding copying string and bytes fields. If your protobufs are mostly varints then yes they all have to be decoded, if your protobufs contain a lot of string/bytes data then most of the decoded overhead could be memory copies for this data rather than varint decoding.
In some message schemas even though this isn't truly zero copy it may be close to it in terms of actual overhead and CPU time, in other schemas it doesn't help at all.
Pretty much all of the history of HN front pages, posts, and comments are surely in the Gemini training corpus. Therefore it seems totally plausible that Gemini would understand HN inside jokes or sentiment outside of what's literally on the front page given in the prompt, especially given that the prompt specifically stated that this is the front page for HN.
I think it's important to point out the distinction between what POSIX mandates and what actual libc implementations, notably glibc, do. Nearly all non-reentrant POSIX functions are actually only non-reentrant if you are using a 1980s computer that for some reason has threads but doesn't have thread-local storage. All of these things like strerror are implemented using TLS in glibc nowadays, so while it is technically true you need to use the _r versions if you want to be portable to computers that nobody has used in 30 years in practice you usually don't need to worry about these things, especially if you're using Linux, since they use store results in static thread-local memory rather than static global memory.
As for the string.h stuff, while it is all terrible it's at least well documented that everything is broken unless you use wchar_t, and nobody uses wchar_t because it's the worst possible localization solution. No one is seriously trying to do real localization in C (and if they were they'd be using libicu).
strerror, at least on glibc, was only made thread safe back in 2020[1], which is really not that long ago in the grand scheme of things. It was WONTFIXed when it was initially reported back in 2005(!). There have only been 10 glibc releases since then and the 2.32 branch is still actively maintained.
There is probably a wide breadth of software that is actively not using that glibc version.
But yeah, agreed that trying to do localization with the builtin functions are fraught with traps and pitfalls. Part of the problem though is less about localization and more due to the fact that you can have bugs inflicted on you if you're not careful to just overwrite the locale with the C locale (and make sure to do this everywhere you can)
NUMA has a huge amount of overhead (e.g. in terms of intercore latency), and NUMA server CPUs cost a lot more than single socket boards. If you look at the servers at Google or Facebook they will have some NUMA servers for certain workloads that actually need them, but most most servers will be single socket because they're cheaper and applications literally run faster on them. It's a win win if you can fit your workload on a single socket server so there is a lot of motivation to make applications work in a non-NUMA way if at all possible.
The first is that getaddrinfo is specified by POSIX, and the POSIX evolve very conservatively and at a glacial pace.
The second reason is that specifying a timeout breaks symmetry with a lot of other functions in Unix/C, both system calls and libc calls. For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations. There are ways to do these things in a non-blocking manner with timeouts using aio or io_uring, but those are already relatively complicated APIs for just using simple system calls, and getaddrinfo is much more complicated.
The last reason is that if you use the sockets APIs directly it's not that hard to write a non-blocking DNS resolver (c-ares is one example). The thing is though that if you write your own resolver you have to consider how to do caching, it won't work with NSS on Linux, etc. You can implement these things (systemd-resolved does it, and works with NSS) but they are a lot of work to do properly.
> For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations.
No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking. But if you do that, you should probably also consider scheduling and really any CPU instruction to be blocking. (If the system is too loaded, anything can be slow).
To be fair, network requests can be considered non-blocking in a similar way, but they're depending on other systems that you generally can't control or inspect. In practice you'll see network timeouts. Note that you (at least normally -- there might be tricky exceptions) won't see EINTR from read() to a filesystem file. But you can see EINTR for network sockets. The difference is that, in Unix terminology, disks are not considered "slow devices".
I'd consider "blocking" anything that given same inputs, state and cpu frequency, may take variable time. That means pretty much every system call and entering the system scheduler, doing something that leads to a page fault, etc. Pretty much only pure math in total functions and function calls to paged functions are acceptable.
Yeah... the sudden paging in also has been noted as a source of latency in the network-oriented software. But that's the problem with our current state of the APIs and their implementations: ideally, you'd have as many independent threads of executions as you want/need, and every time one of them initiates some "blocking" operation, it is quickly end efficiently scheduled, and another ready-to-run thread is switched in. Native threads don't give you that context-switching efficiency, and user-space threads can accidentally cause an underlying native thread block even on "read a non-local variable".
In a data center, networks can have lower latency than disk (even ssd). Now the real place this all falls on its head is page faults. That there are definitely places where you need to have granular control over what can and cannot stall a thread from making progress.
> No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking.
The important point is that the kernel takes locks during all those operations, and will wait an unbounded amount of time if those locks are contended.
So really and truly, yes, any synchronous syscall can schedule out for an arbitrary amount of time, no matter what you do.
It's sort of semantic. The word "block" isn't a synonym for "sleep", it has a specific meaning in POSIX. In that meaning, you're correct, they never "block". But in the generic way most people use the term "block", they absolutely do.
And neither are the tapes. But the pipes, apparently, are.
Well, unfortunately, disk^H^H^H^H large persistent storage I/O is actually slow, or people wouldn't have been writing thread-pools to make it look asynchrnous, or sometimes even process-pools to convert disk I/O to pipe I/O, for the last two decades.
There is a misunderstanding. "Slow device" in the POSIX sense is about unpredictability, not maximal possible bandwidth. Reading from a spinning disk might be comparably slow in the bandwidth sense, but it's actually quite deterministic how much data you can shovel to or from it.
A pipe on the other hand might easily stall for an hour. The kernel generally can't know how long it will have to wait for more data. That's why pipe reads (as well as writes) are interruptible.
The absolute bandwidth number of a harddisk doesn't matter --- in principle you can overload any system such that it fails to schedule and complete all processes in time. Putting aside possible system overload, the "slow device" terminology makes a lot of sense.
Seeking a tape also takes an unpredictable amount of time; and so is seeking a disk, for that matter (IIRC, historically it was actually quite difficult for UNIX systems to saturate disk's througput with random reads).
According to ChatGPT, a tape device is actually considered a "slow device". Even though I'm not sure it's that unpredictable. Maybe for most common use cases it is.
I was under the impression that seeking a disk you can generally calculate well with 10ms? Again, it depends on the file system abstractions built on top, and then the cache and the current system load -- how many seeks will be required?
> And neither are the tapes. But the pipes, apparently, are.
The "slow vs fast" language is really unfortunate, I realize it's traditional but it's unnecessarily confusing.
A better way to conceptualize it IMHO is bounded vs unbounded: a file or a tape contains a fixed amount of data known a priori, a socket or a pipe does not.
I agree. If you actually know what you're doing you can use perf and/or ftrace to get highly detailed processor metrics over short periods of time, and you can see the effects of things like CPU stalls from cache misses, CPU stalls from memory accesses, scheduler effects, and many other things. But most of these metrics are not very actionable anyway (the vast majority of people are not going to know what to do with their IPC or cache hit or branch hit numbers).
What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned.
To know how much latency is impacted by utilization you need to measure your specific workload. Also, how much you care about latency depends on what you're doing. In many cases people care much more about throughput than latency, so if that's the top metric then optimize for that. If you care about application latency as well as throughput then you need to measure both of those and decide what tradeoffs are acceptable.
I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.
reply