See every token/sec, every GB of RAM, every local call
VirexaLLM surfaces decode speed, memory footprint, eval scores, and per-request traces for every model running on your machine — all computed locally, all inspectable in the desktop app.
Observability for on-device inference
tok/sec
Live Throughput
Watch decode speed for every running model in real time
RAM / VRAM
Memory Profile
See how much memory a loaded model really costs
Local
Quality Evals
Run eval suites on-device against your own datasets
Per-Request
Traces
Prompt, completion, model hash, and timings for every call
What the analytics layer shows you
From device-wide performance to single-request traces — without leaving the app.
Performance Dashboards
Tokens per second, prompt ingest speed, time-to-first-token, and steady-state throughput charted per model on your hardware.
Memory Insights
Exact RAM and VRAM used by each loaded model, including KV cache growth. Pick the quant that actually fits your machine.
Response Quality Evals
Plug a test set into VirexaLLM and score local models on accuracy, refusal rate, or custom rubrics — all computed on-device.
Per-Model Breakdown
Compare Llama 3 vs Mistral vs Qwen on your prompts: speed, memory, and quality side by side, no cloud upload.
Per-Device Rollups
On managed fleets, roll up metrics across every activated workstation so admins can see which models are popular and which are dragging.
Per-Request Tracing
Inspect the prompt, completion, model hash, token counts, latency, and sampling parameters for every local call.
Pick the right local model for your hardware
Slice throughput and memory use by model, by quantization, and by prompt length. Rolling charts make it obvious which local model is fast enough, which is too heavy, and which gives the best answer per watt.
From dashboard to root cause
Overview
Tokens/sec, memory, loaded models, and recent requests — the whole device at a glance.
Drill Down
Filter by model, quantization, or prompt type to find the outlier slowing you down.
Inspect
Open any request to see the prompt, completion, sampling parameters, and timing breakdown.
Act
Swap a quant, adjust thread count, or pick a different model — all without restarting the server.
Quality evals you can run on your laptop
Drop in a dataset, pick the models to test, hit run. VirexaLLM scores accuracy, refusal rate, or your custom rubric per model — all without your eval data ever touching the internet.
Per-request tracing, right on the device
Every call is a first-class object — inspectable, searchable, and never phoned home.
Prompt & Completion
Full request and response, with streaming timing and token counts preserved locally.
Model Fingerprint
Model ID, quantization, hash, context length, and sampling parameters captured on every call.
Hardware Telemetry
CPU threads, GPU layers, memory use, and decode speed broken out by phase.
Views for builders and fleet admins
Role-appropriate dashboards for individual developers and teams managing many workstations.
Developer view
Per-request traces, decode speed, memory profile, and local eval scores for the machine in front of you.
Fleet view
Aggregate model adoption, version drift, and performance across every activated workstation.
Eval view
Run datasets against loaded models and compare quality scores — ready for ML bake-offs.
API access
Every metric and trace available from a local endpoint so you can pipe it into your own tooling.
Frequently asked questions
Does data leave the device for analytics?
Can I export the data?
How do I know which quant to pick?
Can I run eval harnesses locally?
Can I see individual requests?
Your laptop is the server now
Download VirexaLLM and run Llama, Mistral, Phi-3, Gemma, or Qwen locally in minutes. Free desktop app for macOS, Windows, and Linux — your prompts never leave the device.