Research

Scaling FastAPI for Inference

2024-10-15 · 5 min

Concurrency patterns, batching strategies and caching to keep inference under 1s without setting the server on fire.

Low latency inference depends on batching, proper async IO and fast serialization. I treat the API as a performance critical system because it is.

Caching is non-negotiable for repeated queries. Redis with TTLs and request deduplication reduces p95 latency materially.

I profile CPU/GPU utilization and scale horizontally using containerized workers behind a thin API gateway.

Also: if you don’t cap payload sizes, someone will inevitably try to upload the internet.