Technology

One program serves your model and improves it at the same time.

A model a third the size, the same answers, still learning while it serves. Four parts make it work.

The four parts, keyed I to IV to the sections below.

megakernel · serve and train in one program

Your model improves on the machine that serves it.

Everywhere else, the serving program and the training program are separate. The megakernel is both: the model answering you is the one improving. No redeploy, no cloud round trip.

The code is small enough for one person to read in full.

Two separate programs everywhere else, made one here.

One program serves and trains the same weights on your machine.

~200x less code than the standard stack VALIDATED

About 2,500 lines of Rust plus 1,300 of CUDA, against the roughly 500,000 in , the most common cloud stack for serving models. One person can audit code that small.

qgre · the training engine

Trained to be right, not to sound right.

is our training engine. teaches a model to copy example answers, so it learns to sound right. QGRE uses instead: reward only when a real checker finds the answer correct.

The reward comes from a checker verifying the answer, not from matching example text.

We trained a 1.7-billion-parameter model to derive advanced physics (Hamiltonian mechanics) from scratch, with no worked examples, on a single 16 GB graphics card. A symbolic math engine checked the answers. VALIDATED

Each token earns its own reward, not one score for the whole answer.

QGRE trains on reward for being correct, not punishment for being wrong.

About us →

tq4 · a third of the size, the same answers

Shrink the model without losing the answers.

is how we store the shrunk model: each number kept in 4 bits instead of 16. It is with the full original, and unlike almost every other 4-bit format, it stays trainable.

A / run the test yourself

Same questions to the original and the shrunk model; the score is , 1.00 means identical. Step through the three methods.

Compare three shrinking methods.

0.85 out of 1.00 VALIDATED

the original, full size the shrunk model

Measured on Qwen3-1.7B: the top answer matched on 5 of 5 test prompts. Holds on Gemma 4 E2B too, 0.9997. The traces above are illustrative; the scores are the measured set.

B / why it stays trainable

Most shrinking freezes a model for good. TQ4 keeps two small, learnable controls on top of the frozen 4-bit weights, so the shrunk model can keep learning, up to full . Underneath, a Walsh-Hadamard rotation evens out the numbers before they are stored in 4 bits.

A trainable
One number per small block of weights, setting how large that block's values can be. Kept learnable, so the shrunk model keeps adjusting.
A learned
The short list of 16 values the 4 bits are allowed to mean. Learned per model instead of fixed, so the format adapts to your data.

31% of the standard stack's memory VALIDATED

Measured on one model: the linear weights of take 870 MB in TQ4, where the same weights in the standard 16-bit format () take 2.7 GB. About a third the size, for that model's weights.

native-fp4 · the engine that runs it

Runs models a cloud stack cannot fit.

Native FP4 is how we run these 4-bit models on the newest graphics chips, which can do math directly. It is the engine underneath Torad Edge.

We train in native FP4 too: the weights, activations, and gradient updates are all 4-bit, while the optimizer's master copy is held in 8-bit () rather than the usual 32-bit. ASSERTED

What we measured · every line names its device and method

As fast as the cloud stack, not faster

Qwen3-1.7B writes 238 to the standard stack's 267, within 10%.

VALIDATED

It runs models the standard stack cannot fit

Qwen3-8B at 40 and Qwen3-14B at 21 tok/s on a 16 GB graphics card, where in the standard 16-bit format cannot even load them.

VALIDATED

Get started

Tell us what your model needs to know.

A model a third the size, with the same answers. Yours to run offline.