Research
Scaling FastAPI for Inference
2024-10-15 · 5 min
Concurrency patterns, batching strategies and caching to keep inference under 1s without setting the server on fire.
Low latency inference depends on batching, proper async IO and fast serialization. I treat the API as a performance critical system because it is.
Caching is non-negotiable for repeated queries. Redis with TTLs and request deduplication reduces p95 latency materially.
I profile CPU/GPU utilization and scale horizontally using containerized workers behind a thin API gateway.
Also: if you don’t cap payload sizes, someone will inevitably try to upload the internet.
- Batch requests when the model allows it.
- Use async endpoints and avoid blocking calls.
- Cache embeddings and frequent responses.
- Instrument p50/p95 latency for regressions.
- Protect your API from creative input sizes.