Cort33x — 35M-parameter language model, trained locally

Overview

Cort33x is a 35M-parameter mixed-transformer language model trained entirely on local hardware. The goal was never to compete with frontier models — it was to own every decision in the stack: tokenizer, architecture, training loop, and evaluation, with no cloud abstractions hiding the details.

Small models trained under tight compute constraints force the kind of engineering judgment that gets abstracted away at scale. Every parameter has to earn its place.

The stack, from scratch

Custom BPE tokenizer. Built and trained from raw text rather than reusing an off-the-shelf vocabulary. Vocabulary size, merge rules, and special-token handling were tuned against the actual training corpus, which measurably improved tokens-per-parameter efficiency for the target domain.
Mixed-transformer architecture. The model interleaves attention variants rather than using a uniform stack, trading a small amount of implementation complexity for better quality at fixed parameter count. Final configuration: 12 layers, 6 attention heads, 1024-token context.
Training loop. Hand-written PyTorch training pipeline with gradient accumulation, cosine learning-rate scheduling (peak 3e-4), checkpointing, and loss-curve instrumentation — all sized to run on a single local GPU without gradient checkpointing tricks becoming the bottleneck.

What the constraints taught

Data quality dominates. At 35M parameters, a cleaner corpus outperformed a bigger one every time. Deduplication and filtering bought more perplexity than any architecture tweak.

Tokenizer choices are architecture choices. A domain-tuned vocabulary effectively lengthens the usable context window and shifts capacity from memorizing subwords to modeling structure.

Instrument before you optimize. The most valuable component was the least glamorous: logging that made every regression attributable to a specific change.

Status

Research project, complete as a training exercise. The pipeline is being reused as a testbed for tokenizer and data-mixture experiments.