
Memory: How It Works and Why It Sets the Cost of Compute
Why AI Systems Remember Instead of Recompute
Memory is simply any form of information that an AI system can recall instead of having to recompute or guess. In AI systems, memory exists to preserve immediate results and contextual information, so the model can reuse prior work instead of repeating computations at each step.
The actual memory chip is attached to the GPU. These chips are High Bandwidth Memory (HBM), a specialized form of DRAM, and it usually sits very close to the GPU die via precise packing. This memory holds the model weights, key-value cache, and activations used during inference. HBM is very fast (measured in terabyte / second of bandwidth) but is limited in capacity. Additionally, the server contains system DRAM attached to the CPU on the server’s motherboard, which is accessed over slower interconnects and cannot support high-throughput inferences. As a result, GPU-resident HBM is the key limiting memory resource.
At the core of how modern language models operate is a mechanism called a key-value cache, or KV cache. When a model is given a prompt and begins processing text, for every token it encounters, the model processes two vectors in HBM: a key and a value. These vectors are later used in matrix multiplication to determine which prior tokens are relevant for the current computation. The classic example is two sticky notes: as you read a word in a sentence, you log down two sticky notes. The first sticky note is the key, which tells you what the word’s meaning is and its eventual relevance later. The second sticky note is the value, which tells you what information the token actually contains. Take the name “Drew” for example, the key tells you that it is a noun, a subject, a person, and likely an important part of the sentence, whereas the value, by contrast, is simply the information of “Drew.” In a sentence, a model goes token by token, and logs each token’s KV into a cache in order to help the model decide whether and how this token should influence future tokens output. As the sentence grows, the notes accumulate and remain available, as these entries remain in memory during the duration of the request.
Caching happens during the prefill stage, which is when the model parallel processes the user’s prompt and builds up its memory of tokens. This relationship is linear - the longer a prompt, the longer the KV cache in the memory. The next stage is decoding, where the model references the current entire KV cache to generate a token. After a token is generated and added to the cache, the decoding process is repeated, now looking at an N+1-length cache.
As KV cache grows, it imposes a hard capacity ceiling at the GPU level. Since the cache is resident memory, it stays for the full request or conversation. GPU HBM is finite, and as a result, when a user continues its request, it consumes a slice of HBM and only so many requests can fit - just as a bookshelf can only hold a fixed amount of books of a certain size. Capacity is then governed not by the operations per second, or FLOPs, but rather the memory consumed by each request.
The decoding step is structurally memory bound as well. During decode, the GPU isn’t limited by the amount of tensor cores but rather the speed at which you can feed them from HBM. During this stage, it has to pull model weights and pull KV cache, which isn’t arithmetic-intense but rather a constant migration of lots of data, meaning the GPU usually spends a greater time waiting for data to arrive than the computations themselves. More generally, achievable compute constraint can be expressed as:
Achievable FLOPs <= min(Peak FLOPs, Memory BW x Arithmetic Intensity)
Because decode has low arithmetic intensity, this bound is almost always set by memory bandwidth, not by the GPU’s theoretical compute capability. In training, the constraint is memory capacity as it must fit about three to four times the model size in HBM, even as model sizes grew by about a factor of 400 while accelerator memory only grew twice in size. In contrast, inference, where models may fit, the GPUs may achieve only about a small percentage FLOP utilization because weights and KV cache must be streamed from HBM each token because of bandwidth. The imbalance is structural: over the past two decades, trends show that peak FLOPs have grown roughly 60,000 times over whereas DRAM and interconnect bandwidth have only increased 100 and 30 times over respectively, showing the distinct disparity that has underlined the modern “memory wall” in AI.
Where Inference Actually Bottlenecks
Inference doesn’t fail because models lack compute; it slows when they run out of fast memory to work with, which has directed the priorities for how accelerator generations have evolved. H100s, for example, ship with about 80 GB of HBM and about 3.3 TB/s of bandwidth per GPU. The H200 raised that to 141 GB and 4.8 TB/s, and the Blackwell-vintage B200s extend that to 180 - 192 GB and around 8 TB/s per GPU. This aggregates to 1.44 TB of memory and about 64 TB/s of bandwidth in a DGX B200 system. The idea is simple: compute tells you how capable a GPU is, while memory lower-bounds how much of that capability can be exercised and how quickly you can utilize that intelligence. If memory is the limiting factor, additional compute delivers diminishing returns unless paired with greater memory and bandwidth. The expanding memory unlocks higher economic value for all parties, as it allows for greater context windows and more users per accelerator, and those constraints surface clearly in how LLMs behave under load.
For example, when your chatbot conversation starts to pause or slow after a long exchange, that is because long prompts trigger the prefill stage, where the model allocates GPU memory to build a KV cache for the tokens in that context. During the decode stage, the model then has to reread a growing cache every newly generated token, which leads to sustained bandwidth pressure. If your KV cache is large, more memory is allocated, and therefore a greater amount of data is required to be moved for the next step of generation. That’s why short prompts feel instantaneous as cache is small. More memory consumed per request leads to fewer users served concurrently, and this forces the hand of the system, as they must continually trade off between individual context length, concurrent users, latency, and throughput.
As a result, conversations with a long cache might be truncated or reset by automatic capacity management. When your context is truncated, the model tries to probabilistically infer missing information as opposed to directing attention towards earlier tokens. Similarly, when the cache is extremely long, the decode stage must attend to a large number of competing key-value pairs. This dilutes attention over relevant tokens and degrades the consistency of generated outputs. This combination of truncation and noisy attention culminates in what frustrates many chatbot users far too often: hallucinations. These are not model malfunctions as some might interpret, but direct consequences of memory constraints under load.
Memory Stacks - and Prices Jump
HBM does not scale like conventional DRAM - it is produced by stacking multiple DRAM dies vertically using through-silicon-vias (TSV), bonded to electrically connect each die, underfilled with insulation to manage thermal stress, tested, and finally integrated alongside a GPU by advanced packaging. Since each step introduces yield loss and other capacity constraints, it collapses the viable supply base to three primary manufacturers: Micron, Samsung, and SK Hynix.
As the HBM stack grows taller, yield losses compound. For example, if each individual die has a 99% yield, stacking eight dies together lowers usable output to ~92%, twelve dies to about 89%, and sixteen dies to ~85% before accounting for any additional process losses. Independent of high per-die yields, moving from eight-high to twelve-high to sixteen-high stacks reduces usable output as TSV defects and interconnect failures accumulate across layers. At these heights, bonding and underfill become first-order constraints because failures at these stages invalidate the entire stack rather than a single die - much like a multi-story building falls if the foundation of one floor is unstable. Additionally, HBM is harder to qualify than commodity DRAM, as dies require testing before stacking as well as after the stack is completed, which makes testing a gating factor. These constraints cause HBM capacity to expand in discrete increments, shaping pricing downstream.
Even when HBM dies are available, without advanced packing capacity, they cannot be deployed. High-end accelerators need CoWoS packing, where the GPU and multiple HBM stacks are mounted on a silicon interposer and then integrated onto an organic substrate, which is very slow to expand as a process and not easily substituted. C.C. Wei, the CEO of TSMC, has repeatedly stated that CoWoS capacity is “very tight” and essentially sold out through 2026, with expansions lagging demand. The majority of this capacity is pre-allocated to large customers like Nvidia, who are estimated to consume more than 50% of available CoWoS capacity to support Blackwell. As a result, supply is gated by access to finite packing slots rather than fabrication alone.
HBM does not clear through a liquid spot market but rather is allocated through long- and short- term contracts. Importantly, this means pricing doesn’t adjust continuously but resets at negotiation boundaries. By late 2025, Micron stated that its entire 2026 HBM supply was fully priced and volume-locked. Additionally, SK Hynix and Samsung stated they are intentionally moving away from multi-year contracts in order to capitalize on expected stepwise price increases through 2027. Under this structure, hyperscalers are prioritized while smaller buyers face rationing or delayed access. Industry reporting in late 2025 and early 2026 showed Samsung raised memory prices by up to 60%, as well as commentary pointing to 20+% yoy increases from HBM under new contracts. HBM4, the next generation HBM standard, has been reported to have about 50% higher ASPs than HBM3. The result isn’t gradual inflation but rather discrete price jumps tied with contract resets. When HBM capacity and key packing slots are committed, then the marginal cost of expanding compute is dictated by memory allocation and not silicon capability. Price here is a step-function, and not a slope.
The economics make the shift explicit, as according to Epoch AI’s BOM estimates, HBM alone contributes $2,900 and advanced packaging $1,000 out of a $6,400 total cost of production for a B200 - roughly 60% of the total cost. Additionally, packaging yield losses add roughly $1,000 more, eating into margins. This means that the marginal cost of deploying the next unit of compute isn’t driven primarily by transistors nor logic efficiency, but by access to memory. In this space, $ / usable-FLOPS is dictated by memory economics and makes it the price-setting input for scalable AI compute.
Let’s consider a cost waterfall where there is a 30% increase in HBM pricing - well within contract step changes. Using the rough prices stated above, this raises the HBM cost per GPU by about $870, and including higher dollar-value packaging yield costs, around $1,100 to $1,200 a unit. At the system level, that leads to an eight-GPU server absorbing about $8,000 to $10,000 of incremental cost, and if you expand to a 10,000-GPU cluster, that figure might rise to $10 to 12 million of additional capital is required. Because the demand for frontier accelerators is relatively price-inelastic, these cost increases are rarely absorbed at the margin. The three primary memory suppliers capture the initial price reset, and GPU vendors attempt to keep high gross margins through higher ASPs, and the margin trickles down to hyperscalers and data-center operators. Since pricing and procurement happen at discrete boundaries, these adjustments reprice deployments at once instead of smoothing, creating abrupt step changes in $ / usable-FLOP - creating memory pricing as a first-order financial exposure.
The Risk Beneath the Stack
What emerges from the prior sections is less a cost story than a risk one. Memory has become price-setting for usable compute. It is supplied through allocation to larger players and repriced at contract boundaries, as well as constrained by physical bottlenecks that do not scale smoothly. This combination produces discrete volatility, and not continuous. When costs reset, the impact propagates through GPU pricing and deployment budgets all at once. In that sense, memory no longer behaves like a background component cost, but like a volatile input whose price determines whether compute can be deployed on acceptable terms.
That exposure is broadly unhedged. Hyperscalers and data-center operators commit to multi-year build plans and fixed deployment targets, yet face step-function swings in memory-driven capital intensity. Neo-clouds and GPU service providers sell compute under relatively stable pricing models while their upstream costs reprice episodically. Lenders and financing partners underwrite assets and cash flows assuming predictable replacement costs, even as those costs increasingly hinge on memory availability and contract timing. In this environment, HBM begins to behave economically less like a depreciating component and more like a scarce commodity input: its value is sustained by constrained supply and rising demand from successive compute generations rather than eroded by them.
When a non-substitutable input becomes volatile and price-setting, it ceases to be an engineering variable and becomes a financial one. Markets typically respond to such conditions by creating mechanisms to price, transfer, and hedge that risk, separating operational execution from exposure to upstream shocks. As memory increasingly determines the economics of scalable compute, the absence of such mechanisms becomes a growing mismatch between how AI infrastructure is built and how its risks are managed.