Local AI Benchmark Queue and Test Methodology

Current evidence status: the rows below are queued tests, not completed benchmark results. TokenByte should not use them as performance claims until the matching evidence files, screenshots, settings, and measured values are published.

Public test queue

What is planned, what is measured, and what is still missing.

This page separates planned tests from completed evidence. A queued row is a promise of method, not a performance claim.

Structured benchmark queue

The public data file keeps the queue honest: each test has a status, required evidence, target metrics, and the guide it will update.

Open JSON Log

Test	Machine	Workload	Metrics to publish	Status	Evidence needed
RTX 3090 ComfyUI workflow	RTX 3090 / 24GB VRAM	Image generation, upscale chain, batch output	VRAM use, render time, resolution, workflow file, failure notes	Queued	Workflow screenshot, exact graph, seed/settings, VRAM capture, output sample, failure notes
Mac Mini local model run	Mac Mini + external SSD	Summaries, transcripts, small local LLM prompts	Tokens/sec, memory pressure, model name, prompt set, output quality	Queued	Model name, quantization, app version, prompt set, memory capture, output notes
Storage stress test	NVMe / external SSD	Model library loading, output folder growth, backup flow	Capacity used, load time, heat, reliability notes	Queued	Drive model, enclosure, folder size, transfer notes, heat notes, backup result
Automation folder watcher	Mac Mini or local PC	File summaries and Markdown output	Files processed, runtime, error rate, model cost/privacy notes	Queued	Script/config, file count, runtime log, error cases, before/after output sample

Benchmark fields

Every result needs receipts.

TokenByte benchmark posts should include enough context for a reader to reproduce or reject the result.

HW

Hardware

CPU, GPU, VRAM, RAM, storage, OS, power, cooling, and any unusual constraints.

Gear hub

SW

Software

Model names, app versions, drivers, workflow files, quantization, and settings.

Testing policy

RUN

Run Data

Speed, memory, VRAM, temperature, output size, time-to-result, and failure modes.

RTX guide

BUY

Verdict

Who should buy, who should skip, cheaper alternatives, and what to upgrade first.

Roadmap

Measurement protocol

How a queued row becomes a published result.

The page is allowed to influence buying advice only after the test moves from queued to measured in the public JSON log.

Minimum evidence gate

Record the exact hardware: CPU, GPU, VRAM, RAM, storage, OS, driver, cooling, and power constraints.
Record the exact software: app version, model or workflow name, quantization, settings, seed when relevant, and prompts or input files.
Run the same workload at least three times when the result is a speed, time, or memory claim.
Publish the failure notes, not only the best run. A setup that crashes, swaps, overheats, or silently degrades output is not a clean win.
Attach screenshots, output samples, workflow files, logs, or photos before changing a buying verdict.

Status	Allowed claim	Not allowed yet
Queued	TokenByte plans to test this workload and has listed the required evidence.	Speed claims, winner language, product rankings, or completed-review scores.
Measured	The row has original measurements and evidence files attached to the public log.	Broad claims outside the tested hardware, model, settings, or workflow.
Retest	Older data exists, but the recommendation needs a new run before changing buying advice.	Using stale numbers as current buying proof.