Integers: Training Neural Networks Without Floating Point
Part 1 of a series on native integer machine learning
When you train a neural network today, you're almost certainly doing it in float32, or if you're lucky enough to have the hardware, bfloat16. The floating point unit is so deeply assumed in modern ML that most frameworks don't even expose a seam where you could swap it out. But floating point is not free, and it's not universal. This series is about what happens when you take that assumption away entirely.
The Problem with Float
The case for integer arithmetic in neural networks is usually framed around inference: quantize your trained float model down to integers, ship it to a phone, profit. This is well-understood, well-supported and wildely implemented. HuggingFace provides a great overview of these quantization methods in their transformer documentation.
What's less explored is whether you can skip the float phase entirely, i.e. training, backpropagation, and weight updates all in integer arithmetic, on hardware that might not have a floating point unit (FPU) at all.
The motivation isn't exotic. A large and growing class of compute targets, i.e. microcontrollers, FPGAs, custom ASICs, and edge accelerators, either lack FPUs entirely or pay a significant power and area penalty to use them. The standard answer is "train in float, quantize to int for deployment," but this creates a fundamental mismatch: you optimize a float model, then approximate it as integers and hope the gap is small. You also need float hardware to train in the first place, which rules out the scenario where the training itself needs to happen on the edge, e.g. on-device fine tuning.
Beyond hardware constraints, there's a more fundamental question: is floating point load-bearing for learning, or is it just what we've always used because the math looks nicer? The smooth gradients, the dynamic range, the precise intermediate values — how much of that actually matters for convergence?
That's the question I started trying to answer about a month ago.
What "Native Integer" Means
"Native Integer" in this context means something specific. It doesn't mean:
- Training in float, then quantization to integer afterward
- Using floats for gradients and integers for activations
- Simulating integer arithmetic in float space
It means the core tensor operations, forward pass, backward pass and weight updates are comptued directly in integer types, with no floating point involved in the hot path.
The main technical challenge this creates is scale management. Floating point values handle scale implicitly through their exponent field. Integers don't do that. When you multiply two i32 values together, the product has roughly double the bit-width of meaningful signal, and you need to decide how to downcast it back to a representable range without losing too much information. If this is done natively, the activations either explode to saturation or collapse to zero within a few layers.
A numerical example motivates this better. Say you have a two-layer network where weights and inputs are stored
as i32 values in range
Layer 1 has 128 inputs. A single dot product accumulates 128 multiplications, each up to
This still fits in a i32 (max.
Now pass that accumulator without scaling into layer 2 as the input.
Layer 2 also has 128 input, and now each element of the input vector is up to
The opposite failure is just as easy to each. If you right-shift the layer 1 output aggressively to bring it back into
With three or four layers and naively chosen shifts, the activations reliably collapse to an all-zero tensor within the first few forward passes.
The solution that we will use for now is shift-based fixed-point arithmetic:
Every tensor carries an associated shift value that represents its effective scale as a power of two.
Multiplying two tensors at shifts s_a and s_b produces an accumulator at shift s_a + s_b, and downcasting
applies stochastic rounding to avoid the systematic bias that straight truncation introduces.
The Project: integers
For compatibility reasons, and because I really want to get better in Rust, I decided to build the codebase in Rust. The library is completely built from scratch, with no ML framework dependencies.
The code base abstractions are:
Scalar and Numeric traits
Rathert than hardcoding a type, every module is generic over a Scalar, currently i32 and f32.
This means the same Linear layer, the same ReLU, the same optimizer can run in either mode.
These traits implement mathematical operators for the underlying data types, e.g. matmul.
The f32 path serves as a correctness baseline. If integer training diverges wildly from float training on the same architecture and seed, that's an indicator
something is wrong in the shift logic.
Scalars implement types for inference, Numeric for accumulators.
Tensor<T>
A simple row-major flat array with a shape vector. No striding, no accelerator backend.
The shift value is not stored in the tensor itself, it's tracked as a u32 value that threads
through the computation graph alongside the tensor.
This is deliberate: keeping shifts out of the tensor type means fewer abstraction layers and easier reasoning about where scale changes happen.
Params<S>
The weight store, which maintains both a master copy (full S::Acc precision for accumulating gradient updates)
and a storage copy (quantized S for use in forward/backward).
The quant_shift on a parameter set controls how aggressively the master weights are compressed when synced to storage.
Params are used in objects that implement the Module trait.
Module<S> trait
The standard forward/backward interface, both methods carry shift values, s_x the input shift and s_g the output shift.
fn forward(&mut self, input: &Tensor<S>, s_x: u32, rng: &mut XorShift64) -> (Tensor<S>, u32);
fn backward(&mut self, grad: &Tensor<S::Acc>, s_g: u32) -> (Tensor<S::Acc>, u32);
Apart from the incoming shift values each methods returns the shift value of the resulting output tensor.
Sequential
A container that chains modules and threads shifts through the full forward and backward passes automatically.
Optimizers
The optimizers (SGD with optional momentum, Adam) are also integer-native,
with learning rates expressed as bit shifts: lr_shift = 4 means a learning rate of 1/16.
This sounds limiting, but it aligns with how you would implement a fixed-point optimizer on hardware anyway.
Note: The f32 implementation of both optimizers use "raw" arithmetic, only the i32 version uses our internal
Scalar and Numeric operations.
Random number generator
Random numbers are generated using a shift-register generator (Xorshift), specifically the 64 bit version.
#[derive(Debug, PartialEq)]
pub struct XorShift64 {
pub state: u64,
}
impl XorShift64 {
pub fn new(seed: u64) -> Self {
// edge case handleing for state = 0
let state = if seed == 0 { 0xC0FFEE } else { seed };
Self { state }
}
pub fn next(&mut self) -> u64 {
let mut x = self.state;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
self.state = x;
x
}
/// Random value generator for value in range [0, range)
#[inline(always)]
pub fn gen_range(&mut self, range: u32) -> u32 {
(self.next() as u32) % range
}
}
Stochastic rounding
The downcast path uses the XorShift64 random number generator to round fractional remainders
probabilistically rather than always truncating. Over many samples, this gives unbiased estimates
where truncation would introduce a systematic downward drift.
pub fn stochastic_downcast(val: i32, shift: u32, rng: &mut XorShift64) -> i32 {
if shift == 0 { return val; }
let mask = (1 << shift) - 1;
let frac = val & mask;
let thresh = rng.gen_range(1 << shift) as i32;
let round_bit = if frac.abs() > thresh { 1 } else { 0 };
let shifted = (val >> shift) + round_bit;
shifted
}
Again a quick numerical example.
Let's say we want to downscale a 5 dimensional vector with elements
Current State: A Baseline
At this point the project is a working baseline, not a polished library. The immediate goals were to prove out the architecture and establish that integer training can learn something on real tasks.
Both experiments use the same architecture and batch size accross float and integer runs.
The f32 path serves as the baseline; the i32 path uses the same Sequential container and module implementations,
with the Scalar type swapped out.
Iris
The Iris dataset has 150 samples, 4 continuous features, and 3 classes. Features are z-score quantized before loading.
Architecture: 4 -> 8 -> 8 -> 3 with ReLU activations and MSE loss.
| f32 | i32 | |
|---|---|---|
| Test accuracy | 90% | 90% |
| Epochs | 500 | 5000 |
lr_shift |
7 | 3 |
momentum_shift |
1 | 0 |
| Batch size | 32 | 32 |
| Grad. Clip value | - | 512 |
Both runs reach the same final accuracy. The integer model gets there, but needs 100x more epochs to do it.
The hyperparameters also diverge significantly:
f32 runs comfortably with samall learning rate (lr_shift: 7 lr_shift: 3
That i32 converges at all on Iris is encouraging. That it needs this much more iteration to do so is the first concrete measurement of what integer training costs.
MNIST
The MNIST dataset has 60000 training samples and 10000 test samples, with 784 pixel features per image. Features are z-score quantized.
Architecture: 784 -> 128 -> 128 -> 10 with ReLU activations, MSE loss, and early stopping at 95% test accuracy.
| f32 | i32 | |
|---|---|---|
| Test accuracy | 95.89% | ~20% |
| Epochs | 31 (early stopping) | - |
| Time per epoch | ~7.5s | increasing |
lr_shift |
6 | 5 |
momentum_shift |
None | None |
| Batch size | 32 | 32 |
The f32 model converges cleanly, hitting the early stopping threshold at epoch 31.
The i32 model currently doesn't. Loss does decrease, the network is learning something, but test accuracy plateaus at 20%, barely above random for a 10-class problem.
There's also an unexpected symptom:
Per epoch compute time grows as training progresses, which shouldn't happen with a fixed architecture and batch size. That points to a memory or bookkeeping issue in the integer path that needs investigation before the results can be trusted.
The honest picture after this round of experiments:
Integer training works on small problems, and the scaling behavior on larger ones is an open question, for me. Getting MNIST i32 to match the f32 baseline is the central problem we will work on next.
What's Next
This post is only a starting point. The subsequent posts in this series will cover:
- i32 MNIST What needs to be done to get to 95% accuracy for i32?
- Shift management in depth. The relationship between
input_shift,output_shift,quant_shiftandgrad_shiftdeserves a full treatment. Getting this wrong is the most common failure mode, and the current codebase has some heuristics (auto-detecting shifts from weight magnitude after Xavier init) that need to be documented and justified properly. - Other layers. The repo contains already an implementation of a
RNNCelland will soon receiveConv2d. These will receive dedicated posts, too. - The hardware question. What would it actually look like to run this inference (and eventually training)
on a microcontroller of FPGA? What changes when you drop from
i32toi8?
The code is open and the approach is straightforward. The point of writing this up isn't to claim that integer training is solved, it clearly isn't, but to document what a ground-up integer ML stack might look like, what problems it runs into, and whether the core assumption holds: that you can train a useful neural network without ever touching a floating point number.
So far, for small problems, the answer looks like yes. Part 2 will find out how far that scales.
The source code is available at github.