Guide: Token generation speed
↑ Back to toolWhat is this tool?
A free LLM inference speed calculator and token generation latency estimator. Enter prefill (prompt) and decode throughput in tokens per second, plus how many prompt tokens and how many new tokens to generate. The tool shows prefill time, decode time, total latency, and an effective average tok/s over the whole request. Use it for what-if planning — not as a benchmark. Runs in your browser.
Prefill vs decode
Prefill processes the full prompt (often parallel across the sequence) and is often limited by memory bandwidth or attention work on the prompt. Decode generates one token at a time (autoregressive) and is often more compute-bound per step. Real stacks report different tok/s for each phase — this tool keeps them separate so you can match numbers from your profiler or vendor docs.
- High prefill, low decode — Long prompts feel slow to "start" even if streaming is steady afterward.
- Low prefill, high decode — Short prompts start fast; long answers still take time.
How timing works
With prompt token count P, new tokens N, prefill tok/s Tp, and decode tok/s Td:
- Prefill time ≈
P / Tp - Decode time ≈
N / Td - Total ≈ sum of the two (no overlap modeled)
- Effective average tok/s ≈
(P + N) / total_seconds
Production systems pipeline, batch, and use speculative decoding — this model is intentionally simple.
Features
- Presets — CPU-ish prefill; laptop / desktop / strong GPU (7B Q4, 13B Q4); server 70B multi-GPU (illustrative tok/s pairs).
- Custom throughput — Edit prefill and decode tok/s directly.
- Token counts — Prompt tokens and new tokens to generate.
- Results — Prefill s, decode s, total s, effective average tok/s.
How to use
- Pick a preset or type your measured prefill and decode tok/s.
- Set prompt tokens — From a tokenizer or rough estimate.
- Set new tokens — Expected completion length (max output cap, typical reply, etc.).
- Read times — See which phase dominates for your scenario.
Use cases
| Scenario | How this helps |
|---|---|
| RAG / long system prompt | Large P inflates prefill time — compare before trimming context. |
| Chat UI SLA | Estimate time to first token vs time to finish a 256-token reply. |
| Hardware comparison | Plug in tok/s from two GPUs and compare total latency for the same P and N. |
| Education | See why prefill and decode are not interchangeable metrics. |
Limits & disclaimers
- Presets are fictional ballparks — Not measured on your machine or model.
- No batching, speculation, or overlap — Real servers hide latency with concurrency.
- Quantization & context affect tok/s — Use numbers from your actual run.
- Benchmark with your stack — vLLM, llama.cpp, TensorRT-LLM, etc. each report differently.
Related terms
People often search for LLM tokens per second calculator, inference latency calculator, prefill vs decode time, time to first token estimate, generation speed calculator, LLM throughput estimator, decode tokens per second, and prompt processing speed. This page splits prefill and decode and sums simple seconds for planning.
FAQ
Is this token speed simulator free?
Yes. It runs client-side in your browser.
Why is my effective tok/s lower than decode tok/s?
Effective average divides total tokens (prompt + output) by total time. A long prefill pulls the average down even if decode is fast.
Can I use this as a benchmark?
No — use your framework’s benchmarks and hardware measurements. This is a calculator from numbers you supply.
What if prefill or decode is zero tokens?
Zero prompt tokens makes prefill time 0; zero new tokens makes decode time 0. Throughput inputs are clamped to at least 1 tok/s to avoid divide-by-zero.
Similar tools
Other Spoold utilities that pair with this simulator:
Conclusion
Use Token generation speed to reason about prefill vs decode latency from tok/s and token counts. For VRAM planning, use LLM RAM / VRAM; for token counts, use Token calculator or Token & context budget; for wall-clock conversions, see Unix time.