Technology
One program serves your model and improves it at the same time.
A model a quarter the size that still gives the same answers as the full original, and keeps learning while it serves. Four parts make that work, and we mark a claim as measured only where we can re-run the measurement.
megakernel · serve and train in one program
Your model improves on the machine that serves it.
Everywhere else, the program that runs a model and the program that trains it are separate; the megakernel makes them one. The model answering your question is the same one improving, with no redeploy and no round trip to a cloud.
The code is small enough that one person can read all of it and re-run it.
Two separate programs everywhere else, made one here.
About 2,500 lines of Rust plus 1,300 of CUDA, against the roughly 500,000 in The most popular open-source software for running large AI models. Fast, but it needs a lot of memory to load a model., the most common cloud stack for serving models. Code that small is code one person can audit, which is the whole point of the bit-exact guarantee below.
qgre · the training engine
Trained to be right, not to sound right.
Quality-Gated Reward Escalation: our training engine. It trains a model with reinforcement learning on a single consumer graphics card, where the usual setups need a rack of datacenter GPUs. is our training engine. The usual method, Showing a model example answers and having it copy them. It learns to sound like a good answer, which is not the same as being right., shows a model example answers to copy, so it learns to sound right. QGRE uses Letting the model try, checking whether its answer is actually correct, and rewarding it when it is. The model learns to be right, not just to sound right. instead, and rewards the model only when a real checker finds its answer correct.
One approach teaches the format. The other teaches being right.
The model tries, a real checker judges, and the reward lands only when the answer is right.
To prove it works, we taught a 1.7-billion-parameter model to derive a piece of advanced physics (Hamiltonian mechanics) from scratch, with no worked examples to copy, on a single 16 GB graphics card. A symbolic math engine checked the answers, not a word match. VALIDATED
This engine exists because someone saw a model in distress in January 2026 and refused to punish-train it. We built it on reward instead.
Read the story About us →tq4 · a quarter of the size, the same answers
Shrink the model without losing the answers.
Torad Quant 4-bit, our own 4-bit format. It stores each number in the model using 4 bits instead of the usual 16, so the model takes about a quarter of the space. is how we store the shrunk model: each number kept in 4 bits instead of 16. Two things make it unusual: it is The shrunk model gives the same answers as the full-precision original, down to the same top choice. Not "close." The same. with the full original, and unlike almost every other 4-bit format, it stays trainable.
A / run the test yourself
We give the original model and the shrunk model the same questions and score how alike the answers come back (a score called A score from 0 to 1.00 for how alike two models' answers are. 1.00 means identical. It measures whether the answers point the same direction, not just whether the words match., where 1.00 means identical). Step through the three methods and watch the shrunk model's line snap onto the original.
Step the method. Watch the gap close.
0.85 out of 1.00 VALIDATED
How alike is the shrunk model to the original?
1.00 = identical answersB / why it stays trainable
Most shrinking freezes a model for good. TQ4 keeps two small, learnable controls on top of the frozen 4-bit weights, so the shrunk model can keep learning, up to full Carrying on a model's original training on new material, so a whole domain's knowledge ends up baked into the weights. This is how Kandi learned its festival. that installs a new domain's knowledge.
- A trainable One number per small block of weights that sets how large that block's values can be. We call it R1. Keeping it trainable lets the shrunk model keep adjusting.
One number per small block of weights, setting how large that block's values can be. Kept learnable, so the shrunk model keeps adjusting.
- A learned The short list of values the 4 bits are allowed to mean (16 of them). Learning this list per model, instead of fixing it, is the R2 knob.
The short list of 16 values the 4 bits are allowed to mean. Learned per model instead of fixed, so the format adapts to your data.
Measured on one model: the linear weights of An open 1.7-billion-parameter model from Alibaba's Qwen family. It is the model we run most of our tests on. take 870 MB in TQ4, where the same weights in the standard 16-bit format (Short for bfloat16, the standard 16-bit format AI models normally run in. Accurate, but it takes a lot of memory.) take 2.7 GB. About a third the size, for that model's weights.
native-fp4 · the engine that runs it
Runs models a cloud stack cannot fit.
Native FP4 is how we run these 4-bit models on the newest graphics chips, which can do A 4-bit number format the newest GPUs can compute on directly, instead of unpacking back to 16-bit first. Smaller and faster, with just enough range for a neural network. math directly. It is the engine underneath Torad Edge, the part that runs your model on your own machine.
We train in native FP4 too: the weights, activations, and gradient updates are all 4-bit, while the optimizer's master copy is held in 8-bit (An 8-bit whole-number format. Small, but precise enough to hold the running master copy of the weights during training without it drifting.) rather than the usual 32-bit. ASSERTED That is our method, not a published benchmark yet. The two readings below are the part we stand behind.
What we measured · every line names its device and method
As fast as the cloud stack, not faster
Qwen3-1.7B writes 238 How fast the model writes. A token is roughly a short word or a few letters, so this is close to words per second. to the standard stack's 267, within 10%.
It runs models the standard stack cannot fit
Qwen3-8B at 40 and Qwen3-14B at 21 tok/s on a 16 GB graphics card, where The most popular open-source software for running large AI models. Fast, but it needs a lot of memory to load a model. in the standard 16-bit format cannot even load them.
One thing we will not claim: we never sell "1.20x faster than vLLM." That holds for one tiny model only, so we leave it off.
What we claim, and what we don't
We grade our own work harder than a marketer would.
Green · measured and re-runnable
- 31% of vLLM memory VALIDATED
- throughput at parity VALIDATED
- we run, they don't VALIDATED
- ~200x smaller surface VALIDATED
- cosine ladder bit-exact VALIDATED
- family-agnostic, Gemma 4 E2B VALIDATED
Amber · believed, not yet proven to your standard
- "7x fewer steps" ASSERTED
- "first 2B to chain function calls" ASSERTED
- QGRE R6 completeness ASSERTED
- Kandi on-device-vs-cloud ASSERTED
- Kandi recall is density-gated, never "knows the entire lineup"
Green means measured and re-runnable. Amber means we believe it and have not proven it to your standard yet. We show you both, because the line is the product.
How it works
One persistent CUDA launch encloses both serving and the gradient pass. The weights update in place; serving continues without a redeploy. The small surface is what makes the bit-exact guarantee auditable.
QGRE trains with reinforcement learning instead of copying example answers. It gives each token its own reward for the quality it was responsible for, rather than one score smeared across the whole answer, and it runs the entire loop on a single consumer GPU where the usual setups need a rack of them. We built it on reward after refusing to punish-train a model in distress.
TQ4 stores each weight in 4 bits: per-block grouping, a Walsh-Hadamard rotation to even out the numbers, and a small learned codebook. Two of its parts stay trainable, the per-block scale (R1) and the codebook (R2), so the quantization adapts to your data and the shrunk model keeps learning instead of freezing. That trainable pair is what static 4-bit formats cannot express.
Get started
Name your model and the world it needs to know.
A model a quarter the size, with the same answers. Yours to run offline.