GPU-64: A 64-bit Inference GPU with Native O(1) KV-Cache for Edge LLM Deployment

Peyriguere, Boris

doi:10.5281/zenodo.18364282

Published January 25, 2026 | Version 1.0.0

Publication Open

GPU-64: A 64-bit Inference GPU with Native O(1) KV-Cache for Edge LLM Deployment

Peyriguere, Boris¹

1. Pacific Prime Research

GPU-64 is a power-efficient 64-bit GPU architecture optimized for Large Language Model (LLM) inference.

Key innovations:
- Content-Addressable Memory (CAM) based KV-Cache with O(1) lookup latency
- 16,384 KV entries per SM (4× more than GPU-256)
- 8×8 tensor cores for FP16/INT8 inference
- 75W TDP for edge/mobile deployment
- 4× inference speedup vs traditional GPUs

The architecture uses compact 64-bit registers (KEY[32] + VALUE[32]) enabling 4× more KV-Cache entries compared to GPU-256, making it ideal for long-context LLM inference on edge devices.

RTL implementation and Python emulator available on GitHub.

Files

gpu64.pdf

Files (172.0 kB)

Name	Size	Download all
gpu64.pdf md5:e6d10697a1e77c6c6b6e0798a229942a	172.0 kB	Preview Download

Additional details

URL: https://github.com/Complexity-ML/gpu64-inference

Is supplemented by: Preprint: https://github.com/Complexity-ML/gpu64-inference (URL)

Views

Downloads

Show more details

	All versions	This version
Views	76	76
Downloads	53	53
Data volume	11.5 MB	11.5 MB

More info on how stats are collected....

DOI

Resource type

Publication

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: January 25, 2026
Modified: January 25, 2026

gpu64.pdf

Files (172.0 kB)

Identifiers

Related works

GPU-64: A 64-bit Inference GPU with Native O(1) KV-Cache for Edge LLM Deployment

Authors/Creators

Description

Files

gpu64.pdf

Files (172.0 kB)

Additional details

Identifiers

Related works