Skip to content
Torad. LAB / 001
Talk to us

Technology

One program serves your model and improves it at the same time.

A model a quarter the size that still gives the same answers as the full original, and keeps learning while it serves. Four parts make that work, and we mark a claim as measured only where we can re-run the measurement.

I megakernel II qgre III tq4 IV native-fp4
The four parts, keyed I to IV to the sections below.

megakernel · serve and train in one program

Your model improves on the machine that serves it.

Everywhere else, the program that runs a model and the program that trains it are separate; the megakernel makes them one. The model answering your question is the same one improving, with no redeploy and no round trip to a cloud.

The code is small enough that one person can read all of it and re-run it.

Two separate programs everywhere else, made one here.

conventional · two engines inference engine training loop the megakernel · one launch serve + train one launch
The two programs every other stack keeps apart, made one. The model answering you is the same one improving, on your machine.
~200x less code than the standard stack VALIDATED

About 2,500 lines of Rust plus 1,300 of CUDA, against the roughly 500,000 in The most popular open-source software for running large AI models. Fast, but it needs a lot of memory to load a model., the most common cloud stack for serving models. Code that small is code one person can audit, which is the whole point of the bit-exact guarantee below.

qgre · the training engine

Trained to be right, not to sound right.

Quality-Gated Reward Escalation: our training engine. It trains a model with reinforcement learning on a single consumer graphics card, where the usual setups need a rack of datacenter GPUs. is our training engine. The usual method, Showing a model example answers and having it copy them. It learns to sound like a good answer, which is not the same as being right., shows a model example answers to copy, so it learns to sound right. QGRE uses Letting the model try, checking whether its answer is actually correct, and rewarding it when it is. The model learns to be right, not just to sound right. instead, and rewards the model only when a real checker finds its answer correct.

One approach teaches the format. The other teaches being right.

The model tries, a real checker judges, and the reward lands only when the answer is right.

the model tries its own attempt reward if correct
Supervised copying teaches a model to sound right. QGRE rewards it for being checked and found correct, so it learns to be right.

To prove it works, we taught a 1.7-billion-parameter model to derive a piece of advanced physics (Hamiltonian mechanics) from scratch, with no worked examples to copy, on a single 16 GB graphics card. A symbolic math engine checked the answers, not a word match. VALIDATED

This engine exists because someone saw a model in distress in January 2026 and refused to punish-train it. We built it on reward instead.

Read the story About us →

tq4 · a quarter of the size, the same answers

Shrink the model without losing the answers.

Torad Quant 4-bit, our own 4-bit format. It stores each number in the model using 4 bits instead of the usual 16, so the model takes about a quarter of the space. is how we store the shrunk model: each number kept in 4 bits instead of 16. Two things make it unusual: it is The shrunk model gives the same answers as the full-precision original, down to the same top choice. Not "close." The same. with the full original, and unlike almost every other 4-bit format, it stays trainable.

A / run the test yourself

We give the original model and the shrunk model the same questions and score how alike the answers come back (a score called A score from 0 to 1.00 for how alike two models' answers are. 1.00 means identical. It measures whether the answers point the same direction, not just whether the words match., where 1.00 means identical). Step through the three methods and watch the shrunk model's line snap onto the original.

Step the method. Watch the gap close.

0.85 out of 1.00 VALIDATED

the original, full size the shrunk model

How alike is the shrunk model to the original?

1.00 = identical answers
Round-to-nearest0.85just round every number off. The model drifts into nonsense within a sentence.
Rotation + codebook0.91rotate and codebook, but not trainable. Better, yet you would still feel the model get duller.
TQ4 · ours0.9998the same answers, at a quarter of the size. Measured on Qwen3-1.7B, the top answer matched on 5 of 5 test prompts. It holds on other models too (Gemma 4 E2B scores 0.9997 on 4 of 5).

B / why it stays trainable

Most shrinking freezes a model for good. TQ4 keeps two small, learnable controls on top of the frozen 4-bit weights, so the shrunk model can keep learning, up to full Carrying on a model's original training on new material, so a whole domain's knowledge ends up baked into the weights. This is how Kandi learned its festival. that installs a new domain's knowledge.

frozen 4-bit weights R1 scale R2 codebook
  1. A trainable One number per small block of weights that sets how large that block's values can be. We call it R1. Keeping it trainable lets the shrunk model keep adjusting.

    One number per small block of weights, setting how large that block's values can be. Kept learnable, so the shrunk model keeps adjusting.

  2. A learned The short list of values the 4 bits are allowed to mean (16 of them). Learning this list per model, instead of fixing it, is the R2 knob.

    The short list of 16 values the 4 bits are allowed to mean. Learned per model instead of fixed, so the format adapts to your data.

31% of the standard stack's memory VALIDATED

Measured on one model: the linear weights of An open 1.7-billion-parameter model from Alibaba's Qwen family. It is the model we run most of our tests on. take 870 MB in TQ4, where the same weights in the standard 16-bit format (Short for bfloat16, the standard 16-bit format AI models normally run in. Accurate, but it takes a lot of memory.) take 2.7 GB. About a third the size, for that model's weights.

native-fp4 · the engine that runs it

Runs models a cloud stack cannot fit.

Native FP4 is how we run these 4-bit models on the newest graphics chips, which can do A 4-bit number format the newest GPUs can compute on directly, instead of unpacking back to 16-bit first. Smaller and faster, with just enough range for a neural network. math directly. It is the engine underneath Torad Edge, the part that runs your model on your own machine.

We train in native FP4 too: the weights, activations, and gradient updates are all 4-bit, while the optimizer's master copy is held in 8-bit (An 8-bit whole-number format. Small, but precise enough to hold the running master copy of the weights during training without it drifting.) rather than the usual 32-bit. ASSERTED That is our method, not a published benchmark yet. The two readings below are the part we stand behind.

What we measured · every line names its device and method

As fast as the cloud stack, not faster

Qwen3-1.7B writes 238 How fast the model writes. A token is roughly a short word or a few letters, so this is close to words per second. to the standard stack's 267, within 10%.

VALIDATED

It runs models the standard stack cannot fit

Qwen3-8B at 40 and Qwen3-14B at 21 tok/s on a 16 GB graphics card, where The most popular open-source software for running large AI models. Fast, but it needs a lot of memory to load a model. in the standard 16-bit format cannot even load them.

VALIDATED

One thing we will not claim: we never sell "1.20x faster than vLLM." That holds for one tiny model only, so we leave it off.

What we claim, and what we don't

We grade our own work harder than a marketer would.

Green · measured and re-runnable

  • 31% of vLLM memory VALIDATED
  • throughput at parity VALIDATED
  • we run, they don't VALIDATED
  • ~200x smaller surface VALIDATED
  • cosine ladder bit-exact VALIDATED
  • family-agnostic, Gemma 4 E2B VALIDATED

Amber · believed, not yet proven to your standard

  • "7x fewer steps" ASSERTED
  • "first 2B to chain function calls" ASSERTED
  • QGRE R6 completeness ASSERTED
  • Kandi on-device-vs-cloud ASSERTED
  • Kandi recall is density-gated, never "knows the entire lineup"

Green means measured and re-runnable. Amber means we believe it and have not proven it to your standard yet. We show you both, because the line is the product.

How it works

Get started

Name your model and the world it needs to know.

A model a quarter the size, with the same answers. Yours to run offline.

Start a model