Your files. Your model. Your machine. Nothing leaves the room.
The most dangerous assumption you can bring to any AI system:
It does not. There is always a bottleneck between your document and the model.
Most people assume "free AI" means one thing. It does not. These are opposites.
These lessons do not change when you switch from Phi-4 Mini to ChatGPT, Copilot, or Claude. They are not lessons about local models. They are lessons about how language models process context.
Common errors: task has two verbs (pick one) · no output format (you lose control) · no constraints (model fills gaps) · generic system context (give it a specific role)
| Tool | Role | Why this choice |
|---|---|---|
| LM Studio | Local model GUI | Desktop app. No Docker. No terminal. Built-in document attach. CPU inference out of the box. |
| Phi-4 Mini Q4_K_M | Primary model | 3.8B parameters. 2.3GB on disk. Best CPU-only model in 2026. 10–12 tok/s on 13th gen i7. |
| Qwen2.5 3B Q4_K_M | Backup model | If Phi-4 Mini has issues on any machine. Similar speed profile. |
| Ollama + qwen2.5-coder:1.5b | Coding demo only | VS Code extension block. Optional. Requires pre-install. |
| Model | Speed | Verdict |
|---|---|---|
| Phi-4 Mini Q4_K_M (3.8B) | 8–12 tok/s | ✓ Usable |
| Qwen2.5 3B Q4_K_M | 8–10 tok/s | ✓ Backup |
| Qwen2.5 7B | 2–4 tok/s | ✗ Too slow for live |
A paragraph response takes 20–30 seconds. Use the wait time to read the output carefully. That is not wasted time. That is evaluation time.