Published January 29, 2026 | Version v1
Video/Audio Open

Ep. 346: GPU Scaling: The "Go Wide or Go Tall" Dilemma

  • 1. My Weird Prompts
  • 2. Google DeepMind
  • 3. Resemble AI

Description

Episode summary: In this episode, Herman and Corn dive deep into the engineering trade-offs of serverless GPU workloads. Using a real-world text-to-speech example on the Modal platform, they explore whether it's better to scale horizontally with many small workers or vertically with a single high-end GPU like the H100. They break down the hidden costs of cold starts, the importance of memory bandwidth over raw compute, and how to find the "sweet spot" on the cost-efficiency curve to get the most bang for your buck.

Show Notes

The world of serverless computing has brought a new level of flexibility to AI development, but with that flexibility comes a complex set of economic and engineering trade-offs. In a recent discussion, Herman Poppleberry and Corn tackled a fundamental question posed by a listener named Daniel: When running heavy workloads like text-to-speech (TTS) on a serverless platform like Modal, is it better to go "wide" or "tall"?

This dilemma, which Herman describes as the "Ferrari versus a fleet of scooters" problem, pits horizontal scaling (using many small, inexpensive GPUs) against vertical scaling (using one massive, expensive GPU). While the instinct for many developers is to save money by using cheaper hardware, the hosts argue that the math of the cloud is rarely linear.

### The Bottleneck: Bandwidth Over Raw Power One of the most significant insights from the discussion is the role of Video Random Access Memory (VRAM) bandwidth. Herman explains that for many AI tasks, the primary bottleneck isn't actually the raw compute power of the chip, but how fast data can move from memory to the processing cores.

Using the Nvidia L4 and the H200 as examples, Herman points out a staggering disparity. While an L4 is significantly cheaper per hour, its memory bandwidth is roughly 300 gigabytes per second. In contrast, an H200 can reach nearly 4.8 terabytes per second. Because large language models and TTS engines are constantly moving data, a cheaper GPU might spend most of its time sitting idle while it waits for data to arrive. In this scenario, a developer is paying for the GPU to do nothing. By moving to a more expensive card with higher bandwidth, the task might finish fifteen times faster, potentially making the "expensive" card the more cost-effective choice.

### The "Sunk Overhead Trap" of Cold Starts A recurring theme in the episode is the "cold start" problem inherent to serverless environments. When a worker is spawned, the system must find a machine, pull a container image, and load the model into memory. Herman notes that even with advanced features like Modal's GPU snapshotting, this process can take around ten seconds.

If a task, such as generating a single sentence of audio, only takes two seconds of actual compute, spawning a new worker for that task is a massive waste of resources. The developer ends up paying for twelve seconds of time for only two seconds of work. If this is done across twenty parallel workers, the "setup" costs multiply rapidly. This leads to what the hosts call the "sunk overhead trap," where the cost of preparing the environment dwarfs the cost of the actual computation.

### The Power of Internal Parallelism: Batching To combat these overhead costs, Herman and Corn suggest focusing on "internal parallelism" through batching. Instead of spawning twenty different workers to handle twenty lines of dialogue, a developer can use a single, more powerful GPU to process those twenty lines simultaneously in a single batch.

Batching is almost always more cost-effective because it requires only one cold start and one model load. Furthermore, high-end GPUs like the H100 or the Blackwell B200 are designed to handle massive amounts of data at once. If a developer only uses a fraction of a high-end card's capacity, they are "paying for empty seats in the stadium." By saturating the card with a large batch, the cost per generated second of audio drops significantly.

### Finding the "U-Shaped" Curve Herman introduces a practical framework for developers to optimize their spending: the "U-shaped" cost curve. To find the "bang for the buck" sweet spot, developers should calculate the "dollars per inference" rather than just looking at the hourly rate of the hardware.

On one end of the curve, very cheap GPUs result in a high cost per inference because the tasks take too long to complete. On the other end, very expensive GPUs can also result in high costs if the workload isn't large enough to fully utilize the hardware's power. The goal is to find the bottom of that U-curve—the mid-tier or high-tier card that balances speed and utilization perfectly. For modern TTS or image generation, this often means using mid-tier cards like the L4 or partitioned A100s, or high-tier cards if the batch size is sufficiently large.

### Beyond Cost: Quality and Possibility Finally, the discussion touches on how hardware choices affect the software itself. Corn and Herman note that vertical scaling doesn't just change the speed of a task; it changes what is possible. Smaller GPUs often require "quantized" or "distilled" versions of AI models to fit within VRAM limits. By stepping up to a card with 141GB of VRAM, developers can run full, uncompressed model weights, which can lead to higher quality outputs—such as more human-sounding voices in a TTS application.

### Conclusion: The 70% Rule Herman concludes with a rule of thumb for anyone looking to optimize their serverless GPU bill: aim for 70% utilization. If a GPU is sitting at 10% load, it is over-provisioned and wasting money. If it is pinned at 99% and the task is dragging on, it's time to move up to a more powerful tier. By monitoring the ratio of compute time to overhead time and embracing batching, developers can ensure they aren't just spending money on curiosity, but are truly getting the most out of their silicon.

Listen online: https://myweirdprompts.com/episode/serverless-gpu-scaling-efficiency

Notes

My Weird Prompts is an AI-generated podcast. Episodes are produced using an automated pipeline: voice prompt → transcription → script generation → text-to-speech → audio assembly. Archived here for long-term preservation. AI CONTENT DISCLAIMER: This episode is entirely AI-generated. The script, dialogue, voices, and audio are produced by AI systems. While the pipeline includes fact-checking, content may contain errors or inaccuracies. Verify any claims independently.

Files

serverless-gpu-scaling-efficiency-cover.png

Files (25.3 MB)

Name Size Download all
md5:1ad8d81b67422749f62f96c079ed4bf4
6.3 MB Preview Download
md5:dfba036ea7c6fbb788f73c80f1cc86bc
1.6 kB Preview Download
md5:389d4c42b633245dabf522472c600af4
19.0 MB Download
md5:6c7a0da8763e2ad931d71e33c3eaf090
18.1 kB Preview Download

Additional details