Skip to content
Torad. LAB / 001
Talk to us
Torad Edge · inference runs offline

Torad Edge / the model, on your machine

Your machine runs the model. The weights are yours.

Edge runs the model you trained with Etch on hardware you already own, fully offline. No per-token bill, no account, no round trip.

Local · offline · bit-exact at 4-bit

The model lives on your own machine and runs offline. Nothing reports back.

Why it fits

It runs models the cloud software cannot fit.

An 8B or 14B model fits on a card you can buy. Our format shrinks it to about a third of the size at the same speed.

31%

of the memory VALIDATED

Qwen3-1.7B's weights take 870 MB in our 4-bit format (Torad Quant 4-bit, our own format. It stores each number using 4 bits instead of the usual 16, so the model takes about a quarter of the space.) against 2.7 GB in 16-bit (Short for bfloat16, the standard 16-bit format AI models normally run in. Accurate, but it takes a lot of memory.). That headroom is why an 8B or 14B model fits on an ordinary 16 GB graphics card.

238 / 267

tok/s · within 10% VALIDATED

Qwen3-1.7B writes 238 against the cloud software's 267 How fast the model writes. A token is roughly a short word or a few letters, so this is close to words per second., within ten percent. Same speed, a third of the size.

The cloud column cannot fit these models, and your data leaves it every time. Cut the connection and watch which column stays lit.

We run, they don't. VALIDATED

Hosted cloud online Torad Edge local
8B model, 16 GB GPUcannot fit40 tok/s
14B model, 16 GB GPUcannot fit21 tok/s
your data leavesalwaysnever

The runtime

One small program runs the whole thing.

Your prompt goes in, the answer comes out, and the runtime is small enough for one person to read end to end. Retrain with Etch and the new weights drop in place, no redeploy.

one kernel · one artifact prompt answer the weights weights update in place · no redeploy

Above / in use

One program answers your prompt, and new weights drop into it without a restart. Nothing leaves the room.

Below / taken apart

serve train audit
Roughly 2,500 lines of Rust plus 1,300 of CUDA, against vLLM's 500,000. Small enough that one owner can read it and re-run it end to end. VALIDATED

The megakernel, in full →

Lossless at 4-bit

Shrinking an AI usually wrecks it. Done right, you cannot tell.

We ask the full-size model and the shrunk model the same questions, then score how alike the answers come back. That score is the A score from 0 to 1.00 for how alike two models' answers are. 1.00 means identical. It measures whether the answers point the same direction, not just whether the words match.: 1.00 is identical.

How alike is the shrunk model to the original?

VALIDATED
0.9998 cosine similarity .80 1.00
Round-to-nearest0.85Round every number off. The shrunk model drifts into nonsense within a sentence.
Rotation + codebook0.91The careful static methods. Closer, but you would still feel the model get duller.
TQ4 · ours0.9998The same answers, a quarter of the size. The model you ship is the model you trained, down to the last bit.
870 MBvs 2.7 GB in bf16
a quarterof the size
0.9997Gemma 4 E2B, same way
Measured on Qwen3-1.7B with our 4-bit TQ4 format. The shrunk model gives the same answers as the original, down to the bit.

Start

The model is yours either way. Pricing coming soon.

Tell us the model and what it needs to know, and we will tell you what it can do and when Edge ships. No account, no card, no obligation.

Start a model