Skip to content

ZeroTrace Companion

Model Recommendations

Which model to pick across local and cloud — by RAM budget, by use case, by language, by cost.

Companion's AI assistant works with any model your provider supports. This page recommends models by use case so you don't have to evaluate the entire ecosystem yourself.

Local — by RAM budget

For local providers (Ollama, LM Studio):

Available RAMRecommended modelWhy
8 GBqwen2.5:3b or llama3.2:3bSmallest practical models with tool calling
16 GBqwen2.5:7b or llama3.1:8bSweet spot — good quality, fast on Apple Silicon / mid GPU
32 GBqwen2.5:14b or mistral-smallNoticeably more capable, especially for complex tool chains
64 GB+qwen2.5:32b or llama3.1:70b (quantised)Top-tier local quality; requires patience or strong GPU

For a starting point, qwen2.5:7b is the broad recommendation. Good tool-calling support, multilingual, runs well on a typical modern laptop.

Cloud — by provider

OpenAI

ModelBest forApproximate cost
gpt-4o-miniDefault. Cheap, capable, fast.Cents per typical chat
gpt-4oWhen you need top qualitySeveral cents per chat
o1-mini / o3-miniReasoning-heavy queriesHigher per-response cost
o1 / o3Hardest reasoning tasksSignificantly more

For most Companion workflows, gpt-4o-mini is the right default. Bump to gpt-4o when complex tool chains start failing.

Anthropic

ModelBest forApproximate cost
claude-haiku-4-5Default. Fast, cheap, good at tool calling.Cents per typical chat
claude-sonnet-4-5Default for serious work. Excellent reasoning.Several cents per chat
claude-opus-4-5Top-tier reasoning when you need itSignificantly more

For Companion, Claude Sonnet 4.5 is the sweet-spot recommendation if cost isn't tight. Claude Haiku 4.5 is a great budget pick for routine queries.

OpenRouter

OpenRouter exposes hundreds of models. Worth trying:

ModelWhy
qwen/qwen-2.5-72b-instructCheap, capable, strong at tool calling
anthropic/claude-sonnet-4-5Same Sonnet via OpenRouter (slightly different pricing)
openai/gpt-4o-miniSame gpt-4o-mini via OpenRouter
meta-llama/llama-3.3-70b-instructOpen-weights via cloud, good quality
deepseek/deepseek-chatSurprisingly strong cheap option
google/gemini-2.5-proLong context, alternative reasoning style

The big win with OpenRouter: try many models without separate accounts. Pricing per model is shown on openrouter.ai/models.

By use case

Quick conversational Q&A

Pick the smallest / cheapest model that runs comfortably:

  • Local: 3B-7B model.
  • Cloud: gpt-4o-mini / claude-haiku-4-5.

A small model answers "what is RSSI" in two seconds; a large model takes ten and gives the same answer.

Complex tool chains

Larger models plan multi-step tool calls more reliably:

  • Local: 14B+ models.
  • Cloud: gpt-4o, claude-sonnet-4-5+, o1.

For investigations where the assistant chains four or five tools, the better planners are worth the cost.

Code generation (HID scripts, automation)

ProviderPick
Localqwen2.5-coder:7b or qwen2.5-coder:14b
Cloud (OpenAI)gpt-4o
Cloud (Anthropic)claude-sonnet-4-5
OpenRouterdeepseek/deepseek-coder is a budget pick

Code-tuned local models trade some general conversational ability for better code output. Cloud models are generally good at code without specialisation.

Multilingual (German, French, Spanish, etc.)

ProviderPick
LocalQwen family (7B+); EuroLLM for German
CloudAny major cloud model — they all handle European languages well

For best results, set the system prompt to instruct the model to respond in your language.

Reasoning-heavy queries

The "thinking" / "reasoning" model variants trade speed for accuracy on hard questions:

  • Local: DeepSeek-R1 distills, Qwen QwQ.
  • Cloud (OpenAI): o1, o3 family.
  • Cloud (Anthropic): claude-opus-4-5, or any Sonnet 4.5+ with extended thinking.

Slower per response but handle multi-step reasoning better.

Local — model size vs. quality

A useful rule of thumb:

  • 3B models — good for greeting, simple Q&A, basic tool calls.
  • 7-8B models — usable for genuine work; the floor for serious investigation.
  • 14B models — clearly better at complex tasks; slower.
  • 30B+ models — much closer to cloud-LLM quality; need real hardware.

There's no substitute for trying. Pull two models, run the same investigation question through both, see which fits your patience and your need.

Don't just pull the biggest model your machine can run. The fastest acceptable response is more useful than the slowest possible best response. For most users, "good enough at 5-second response time" beats "perfect at 30-second response time."

Quantisation

Local models come in different quantisation levels (Q4, Q5, Q8, FP16). Lower quantisation = smaller / faster, slight quality loss. Ollama's defaults usually pick a sensible quantisation; you rarely need to override.

For tightly RAM-constrained setups, Q4_K_M is the standard "small but still good" choice.

Switching models on the fly

Companion lets you switch the active model in Settings → AI → Model. The next message uses the new model.

For workflows that benefit from switching:

  • Quick conversational queries → small / cheap model.
  • Tool-chain queries → larger / more capable model.
  • Sensitive AirLeak queries → local model.
  • General help → cloud model if cost is acceptable.

Updating local models

ollama pull <model> re-pulls the latest version of a model. Models update occasionally; the new version replaces the old one of the same name.

To keep multiple versions, use Ollama's tag syntax: ollama pull qwen2.5:7b-q4_K_M keeps the q4 variant separate from the default qwen2.5:7b.

Local — disk usage

Each local model takes a few GB on disk:

  • 3B model: ~2 GB
  • 7B model: ~4-5 GB
  • 14B model: ~8-10 GB
  • 30B+ model: ~20+ GB

Models live under Ollama's data directory (or your runner's equivalent). Delete unused models with ollama rm <model>.

Cost — cloud providers

For cloud providers, Companion shows per-response cost estimates. Rough rules of thumb:

PatternApproximate cost
Single conversational message, mini-tier modelCents
Long conversation, top-tier model, lots of tool callsTens of cents to a few dollars
Multi-day investigation with tool chainsSingle-digit dollars over the period

Reasoning models (o1, o3, Claude with extended thinking) cost notably more per response — sometimes 5-10x — because they generate large internal reasoning traces.

For privacy-comfortable budget-tight users: local Ollama with a 7B model is free and sufficient for most workflows. For privacy-flexible budget-comfortable users: claude-sonnet-4-5 or gpt-4o is the sweet spot.

Capability detection

Different models — even from the same provider — support different capabilities:

CapabilityWhy it matters
Tool callingRequired for the assistant to drive Companion or call MCP tools
VisionImage inputs (not currently used by Companion's flows)
StreamingToken-by-token rendering

Companion shows the active model's capabilities under settings → AI. Some smaller / older models lack tool calling — they work for chat but can't operate Companion's tools.

Command Palette

Search for a command to run...