See every token/sec, every GB of RAM, every local call

VirexaLLM surfaces decode speed, memory footprint, eval scores, and per-request traces for every model running on your machine — all computed locally, all inspectable in the desktop app.

Observability for on-device inference

tok/sec

Live Throughput

Watch decode speed for every running model in real time

RAM / VRAM

Memory Profile

See how much memory a loaded model really costs

Local

Quality Evals

Run eval suites on-device against your own datasets

Per-Request

Traces

Prompt, completion, model hash, and timings for every call

What the analytics layer shows you

From device-wide performance to single-request traces — without leaving the app.

Performance Dashboards

Tokens per second, prompt ingest speed, time-to-first-token, and steady-state throughput charted per model on your hardware.

Memory Insights

Exact RAM and VRAM used by each loaded model, including KV cache growth. Pick the quant that actually fits your machine.

Response Quality Evals

Plug a test set into VirexaLLM and score local models on accuracy, refusal rate, or custom rubrics — all computed on-device.

Per-Model Breakdown

Compare Llama 3 vs Mistral vs Qwen on your prompts: speed, memory, and quality side by side, no cloud upload.

Per-Device Rollups

On managed fleets, roll up metrics across every activated workstation so admins can see which models are popular and which are dragging.

Per-Request Tracing

Inspect the prompt, completion, model hash, token counts, latency, and sampling parameters for every local call.

Pick the right local model for your hardware

Slice throughput and memory use by model, by quantization, and by prompt length. Rolling charts make it obvious which local model is fast enough, which is too heavy, and which gives the best answer per watt.

From dashboard to root cause

1

Overview

Tokens/sec, memory, loaded models, and recent requests — the whole device at a glance.

2

Drill Down

Filter by model, quantization, or prompt type to find the outlier slowing you down.

3

Inspect

Open any request to see the prompt, completion, sampling parameters, and timing breakdown.

4

Act

Swap a quant, adjust thread count, or pick a different model — all without restarting the server.

Quality evals you can run on your laptop

Drop in a dataset, pick the models to test, hit run. VirexaLLM scores accuracy, refusal rate, or your custom rubric per model — all without your eval data ever touching the internet.

Per-request tracing, right on the device

Every call is a first-class object — inspectable, searchable, and never phoned home.

Prompt & Completion

Full request and response, with streaming timing and token counts preserved locally.

Model Fingerprint

Model ID, quantization, hash, context length, and sampling parameters captured on every call.

Hardware Telemetry

CPU threads, GPU layers, memory use, and decode speed broken out by phase.

Views for builders and fleet admins

Role-appropriate dashboards for individual developers and teams managing many workstations.

Developer view

Per-request traces, decode speed, memory profile, and local eval scores for the machine in front of you.

Fleet view

Aggregate model adoption, version drift, and performance across every activated workstation.

Eval view

Run datasets against loaded models and compare quality scores — ready for ML bake-offs.

API access

Every metric and trace available from a local endpoint so you can pipe it into your own tooling.

Frequently asked questions

Does data leave the device for analytics?
No. All metrics are computed and stored locally. Fleet rollups only exchange anonymized aggregate counters if the admin explicitly opts in.
Can I export the data?
Yes. Every chart and trace exports to CSV or JSONL so you can pull it into your own notebooks or BI tools.
How do I know which quant to pick?
Memory Insights shows how each quantization (Q4_K_M, Q5_K_M, Q8_0) actually fits your RAM, and throughput charts show the decode speed per quant on your CPU or GPU.
Can I run eval harnesses locally?
Yes. Drop in a dataset of prompts and expected outputs and VirexaLLM scores each loaded model. Useful for picking the right local model per task.
Can I see individual requests?
Yes. Every call captures prompt, completion, model hash, quantization, sampling parameters, and precise timings — inspectable right in the desktop app.

Your laptop is the server now

Download VirexaLLM and run Llama, Mistral, Phi-3, Gemma, or Qwen locally in minutes. Free desktop app for macOS, Windows, and Linux — your prompts never leave the device.