$1000 tier nanochat run #8

karpathy · 2025-10-13T21:08:19Z

karpathy
Oct 13, 2025
Maintainer

I did a bit of work to set up the $1000 run, thought I'd share the napkin math if helpful. Here is the draft of the run1000.sh script, I just kicked it off and now we wait ~31 hours... The rest of the budget (~10 more hours) I am saving for midtraning/sft, possibly a bit of RL. Here is what I have for pretraining and I'll edit this as we go along:

UPDATE 1: I finished the d32 run. It includes the midtraining bugfix. The full script is below. I'll push to master shortly.
UPDATE 2: I am hosting the d32 chat_web.py here. (Please obviously don't put any sensitive information into these nanochat WebUIs). I'll probably take down a bit later. The d32 is about an ~$800 model.
UPDATE 3: Added the RL result, almost at 20% gsm8k nice
UPDATE 4: The model is now uploaded to huggingface here have fun
UPDATE 5: Added the summary "poster" that I tweeted t to the bottom of the post.
UPDATE 6: nanochat d32 is now hosted on https://nanochat.karpathy.ai/ nice.

# The $1000 tier of nanochat
# Designed to run end-to-end for $1000/24 ~= 41.6 hours on an 8XH100 node
# A bit sparser on comments, see speedrun.sh for more detail

# all the setup stuff
export OMP_NUM_THREADS=1
NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
[ -d ".venv" ] || uv venv
uv sync
source .venv/bin/activate
if [ -z "$WANDB_RUN" ]; then
    WANDB_RUN=dummy
fi
python -m nanochat.report reset
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml
EVAL_BUNDLE_URL=https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip
if [ ! -d "$NANOCHAT_BASE_DIR/eval_bundle" ]; then
    curl -L -o eval_bundle.zip $EVAL_BUNDLE_URL
    unzip -q eval_bundle.zip
    rm eval_bundle.zip
    mv eval_bundle $NANOCHAT_BASE_DIR
fi

# train tokenizer on ~4B characters and kick off download of the rest for pretraining
python -m nanochat.dataset -n 16
# start downloading the rest of the shards for a total of 800 (see below why 800)
python -m nanochat.dataset -n 800 &
# todo: download the rest of it
python -m scripts.tok_train --max_chars=4000000000
python -m scripts.tok_eval

# Documenting my process for determining the hyperparameters for this run1000.sh script:
# We want a budget of approx. $1000 ~= 41.6 hours of 8XH100 compute
# 1) I guessed the model size for this to be about depth=32
# 2) Determine the device_batch_size that fits:
# Running the base_train.py script with --depth=32, I saw that --device_batch_size=16
# runs out of memory, but --device_batch_size=8 fits. Inspecting `nvidia-smi` during training,
# I saw all GPUs were at about 78/80GB VRAM, so it just barely fits and we have good MFU at ~50%.
# So the training script was running ok and showed:
# Vocab size: 65,536
# num_layers: 32
# model_dim: 2048
# num_heads: 16
# num_kv_heads: 16
# Tokens / micro-batch / rank: 8 x 2048 = 16,384
# Tokens / micro-batch: 131,072
# Total batch size 524,288 => gradient accumulation steps: 4
# Number of parameters: 1,879,048,192
# Estimated FLOPs per token: 1.207960e+10
# Calculated number of iterations from target data:param ratio: 71,680
# Total number of training tokens: 37,580,963,840
# Tokens : Params ratio: 20.00
# Total training FLOPs estimate: 4.539628e+20
# step 00004/71680 (0.01%) | loss: 8.813754 | lrm: 1.00 | dt: 1571.88ms | tok/sec: 83,385 | mfu: 50.92 | total time: 0.00m
# step 00005/71680 (0.01%) | loss: 8.488074 | lrm: 1.00 | dt: 1572.76ms | tok/sec: 83,338 | mfu: 50.89 | total time: 0.00m
# ...
# 3) validate that the runtime fits our budget:
# The training script uses the Chinchilla scaling law to compute-optimally set #tokens = 20 * #params. In particular:
# The script shows that we will be training for 71,680 steps, and each step takes 1.574s so:
# estimated time to train: 71,680 * 1.574s / 60 / 60 = 31.3 hours.
# This is OK, fits our budget, and leaves ~10 hours for midtraining and SFT and evals and maybe RL.
# It's possible that we might even fit depth=33 or depth=34, but for now let's go along with this.
# 4) The last thing to pay attention to is the amount of training data required for the run.
# The script above calculated that "Total number of training tokens: 37,580,963,840"
# The tok_eval.py script reports about ~4.8 chars/token on average for the default tokenizer settings.
# So ~38B tokens # ~4.8 chars/token = ~185B chars.
# Each data shard is ~250M chars, so we need ~185B / 250M ~= 740 shards.
# For safety, I bumped that up to 800 shards, and that's why up above I used -n 800 when pre-downloading dataset shards.
# If we didn't have enough data, the training script would loop around and do multiple epochs over the same data,
# which would decrease model performance. Possibly 2, 3 or so epochs is ~ok, but certainly not ideal and at 10+ epochs we'd
# start to overfit hard.
# 5) That's it, everything else (e.g. the learning rates) is adjusted automatically by the training script.
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=32 --device_batch_size=8
torchrun --standalone --nproc_per_node=8 -m scripts.base_loss
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval

# midtrain
# NOTE: ensure that we use the same device_batch_size here as the base training script.
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=8 --run=$WANDB_RUN
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i mid

# sft
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- --run=$WANDB_RUN
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

# generate final report
python -m nanochat.report generate

The full report is as follows:

nanochat training report

Generated: 2025-10-13 20:50:44

Environment

Git Information

Branch: master
Commit: 5fd0b13 (dirty)
Message: Merge pull request Update README.md #2 from epoyraz/patch-1

Hardware

Platform: Linux
CPUs: 104 cores (208 logical)
Memory: 1771.7 GB
GPUs: 8x NVIDIA H100 80GB HBM3
GPU Memory: 633.5 GB total
CUDA Version: 12.8
Hourly Rate: $24.00/hour

Software

Python: 3.10.12
PyTorch: 2.8.0+cu128

Bloat

Characters: 331,885
Lines: 8,116
Files: 43
Tokens (approx): 82,971
Dependencies (uv.lock lines): 2,004

Run started: 2025-10-13 20:50:46

Tokenizer training

timestamp: 2025-10-13 20:53:47

max_chars: 4,000,000,000
doc_cap: 10,000
vocab_size: 65,536
train_time: 166.9047
num_special_tokens: 9
token_bytes_min: 1
token_bytes_max: 32
token_bytes_mean: 6.9135
token_bytes_std: 2.8720

Tokenizer evaluation

timestamp: 2025-10-13 20:53:52

Comparison with GPT-2

Text Type	Bytes	GPT-2 Tokens	GPT-2 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	404	4.50	372	4.89	+7.9%
korean	893	745	1.20	710	1.26	+4.7%
code	1259	576	2.19	492	2.56	+14.6%
math	1834	936	1.96	966	1.90	-3.2%
science	1112	260	4.28	225	4.94	+13.5%
fwe-train	4208518	900364	4.67	856992	4.91	+4.8%
fwe-val	4986609	1083995	4.60	1035237	4.82	+4.5%

Comparison with GPT-4

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	387	4.70	372	4.89	+3.9%
korean	893	364	2.45	710	1.26	-95.1%
code	1259	309	4.07	492	2.56	-59.2%
math	1834	832	2.20	966	1.90	-16.1%
science	1112	249	4.47	225	4.94	+9.6%
fwe-train	4208518	874799	4.81	856992	4.91	+2.0%
fwe-val	4986609	1054793	4.73	1035237	4.82	+1.9%

Base model training

timestamp: 2025-10-15 05:23:04

run: d32
depth: 32
max_seq_len: 2048
num_iterations: -1
target_flops: -1.0000
target_param_data_ratio: 20
device_batch_size: 8
total_batch_size: 524,288
embedding_lr: 0.2000
unembedding_lr: 0.0040
weight_decay: 0.0000
matrix_lr: 0.0200
grad_clip: 1.0000
eval_every: 250
eval_tokens: 10,485,760
core_metric_every: 2000
core_metric_max_per_task: 500
sample_every: 2000
model_tag:
Number of parameters: 1,879,048,192
Number of FLOPs per token: 1.207960e+10
Calculated number of iterations: 71,680
Number of training tokens: 37,580,963,840
Tokens : Params ratio: 20.0000
DDP world size: 8
warmup_ratio: 0.0000
warmdown_ratio: 0.2000
final_lr_frac: 0.0000
Minimum validation bpb: 0.7236
Final validation bpb: 0.7236
CORE metric estimate: 0.3274
MFU %: 51.79%
Total training flops: 4.539628e+20
Total training time: 1849.18m
Peak memory usage: 77017.78MiB

Base model loss

timestamp: 2025-10-15 16:02:50

train bpb: 0.7276
val bpb: 0.7238
sample 0: <|bos|>The capital of France is Paris. It is the largest city in the country and the most important cultural and
sample 1: <|bos|>The chemical symbol of gold is Au. It is a chemical element with the atomic number 79. It is
sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Saturday. If today is Monday, then tomorrow will be Tuesday. If today is
sample 3: <|bos|>The opposite of hot is not cold. The opposite of happy is not sad. The opposite of honest is
sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune
sample 5: <|bos|>My favorite color is blue. I love the sky, the ocean, and the sky. I love
sample 6: <|bos|>If 5x + 3 = 13, then x is equal to 5.
If 5x + 3 = 13

Base model evaluation

timestamp: 2025-10-15 12:58:58

Model: base_model (step 71680)
CORE metric: 0.3168
hellaswag_zeroshot: 0.4434
jeopardy: 0.3151
bigbench_qa_wikidata: 0.6026
arc_easy: 0.6420
arc_challenge: 0.2617
copa: 0.4000
commonsense_qa: 0.1677
piqa: 0.4820
openbook_qa: 0.1813
lambada_openai: 0.5011
hellaswag: 0.4608
winograd: 0.4505
winogrande: 0.1365
bigbench_dyck_languages: 0.1310
agi_eval_lsat_ar: 0.1033
bigbench_cs_algorithms: 0.4129
bigbench_operators: 0.2238
bigbench_repeat_copy_logic: 0.0938
squad: 0.4069
coqa: 0.3004
boolq: 0.0737
bigbench_language_identification: 0.1798

Midtraining

timestamp: 2025-10-15 15:48:54

run: d32-fixlrm-decay-last-10
dtype: bfloat16
max_seq_len: 2048
device_batch_size: 8
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
init_lr_frac: 1.0000
weight_decay: 0.0000
eval_every: 150
eval_tokens: 10,485,760
total_batch_size: 524,288
dry_run: 0
Number of iterations: 766
DDP world size: 8
Minimum validation bpb: 0.3402

Chat evaluation mid

timestamp: 2025-10-15 16:14:34

source: mid
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
ARC-Easy: 0.6233
ARC-Challenge: 0.4787
MMLU: 0.3896
GSM8K: 0.1099
HumanEval: 0.1098
ChatCORE metric: 0.2417

Chat SFT

timestamp: 2025-10-15 18:10:56

run: d32
source: mid
dtype: bfloat16
device_batch_size: 4
num_epochs: 1
max_iterations: -1
target_examples_per_step: 32
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
weight_decay: 0.0000
init_lr_frac: 0.0200
eval_every: 100
eval_steps: 100
eval_metrics_every: 200
Training rows: 20,843
Number of iterations: 651
Training loss: 0.9066
Validation loss: 0.8413

Chat evaluation sft

timestamp: 2025-10-15 18:22:17

source: sft
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
ARC-Easy: 0.6797
ARC-Challenge: 0.4991
MMLU: 0.4049
GSM8K: 0.1274
HumanEval: 0.1280
ChatCORE metric: 0.2734

Summary

Characters: 331,885
Lines: 8,116
Files: 43
Tokens (approx): 82,971
Dependencies (uv.lock lines): 2,004

Metric	BASE	MID	SFT	RL
CORE	0.3168	-	-	-
ARC-Challenge	-	0.4787	0.4991	-
ARC-Easy	-	0.6233	0.6797	-
GSM8K	-	0.1099	0.1274	0.1994
HumanEval	-	0.1098	0.1280	-
MMLU	-	0.3896	0.4049	-
ChatCORE	-	0.2417	0.2734	-

I redacted the time because it is inaccurate. I actually think this run comes in well below $1000, probably closer to $800 or so. I might experiment with bumping the model size. I will work on export/import of models and post this one.

diegooprime · 2025-10-15T20:50:23Z

diegooprime
Oct 15, 2025

cool. i chatted with the model and i'ts significantly smarter than @depth20, can you share the weights for both the base and sft'ed model? it'd be cool to experiment on more post-training on the base model to see how far we could take it...

2 replies

tomoqt Nov 7, 2025

bump, having the intermediate checkpoints would really help with experimentation on the post-training side, so that we don't have to rerun pretraining

rfernand2 Nov 10, 2025

The base model would be a great resource to build on (and avoid the $800 training run). The others would be nice to have also, but the base it the most important.

Sooowayydh · 2025-10-16T01:22:13Z

Sooowayydh
Oct 16, 2025

This is amazing! Any chance you’ll release the weights for the d32 model? Pretty please

1 reply

karpathy Oct 16, 2025
Maintainer Author

I uploaded here https://huggingface.co/karpathy/nanochat-d32

pritam5756 · 2025-10-16T03:13:06Z

pritam5756
Oct 16, 2025

Are you planning to release any video on Nanochat in future?

0 replies

burtenshaw · 2025-10-16T12:16:30Z

burtenshaw
Oct 16, 2025

Nice! I ported the weights to transformers: https://huggingface.co/karpathy/nanochat-d32/discussions

0 replies

Seqaeon · 2025-10-18T07:11:38Z

Seqaeon
Oct 18, 2025

Okay So I am applying this to the OpenWebText dataset from the NanoGPT but with just one GPU and I keep getting this loss plateaus and starts climbing:
CODE:

# hyperparameters
batch_size =4# 64//8 # how many independent sequences will we process in parallel?
block_size = 2048 # what is the maximum context length for predictions?
gradient_accumulation_steps = 524288//(batch_size*block_size)  # Adjust based on your needs

max_iters = 10000
eval_interval = 100
learning_rate = 1e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'

vocab_size =50304
eval_iters = 50
n_embd = 1024#64#*8
n_head = n_kv_head = 8//2
n_layer = 16//2
embedding_lr = 0.2
unembedding_lr = 0.004
matrix_lr = 0.02 
dropout = 0.0
epochs = 2
weight_decay = 0.00
warmup_ratio = 0#.2 # ratio of iterations for LR warmup
warmdown_ratio = 0.2 # ratio of iterations for LR warmdown
final_lr_frac = 0.0 # final LR is this fraction of the initial LR



# ------------

torch.manual_seed(1337)


@torch.no_grad()
def estimate_loss():
    
    total_train_loss = 0.0
    total_test_loss = 0.0
    model.eval()
    losses = {}
    accs = {}
    for counter, (batch, target, attention_mask) in tqdm(enumerate(test_dataloader), total = len(test_dataloader), desc= "Evaluating on Test Set", leave=True):
        if eval_iters:
            if counter == eval_iters:
                break
                
        batch = {k: v.pin_memory().to(device, non_blocking=True) if torch.is_tensor(v) else v
                for k, v in batch.items()}
        target = {k: v.pin_memory().to(device, non_blocking=True) if torch.is_tensor(v) else v
                 for k, v in target.items()}

        # Mixed precision forward
        with autocast(device_type=device):
            logits, loss = model(batch, target)
            total_test_loss += loss.item()
            total_test_acc +=acc.item()
    losses['test'] = total_test_loss/ (counter + 1)
    
    model.eval()
    for counter, (batch, target) in tqdm(enumerate(train_loader), total = len(train_loader), desc = "Evaluating on Train Set",  leave = True):
        if eval_iters:
            if counter == eval_iters:
                break
        batch = {k: v.pin_memory().to(device, non_blocking=True) if torch.is_tensor(v) else v
                for k, v in batch.items()}
        target = {k: v.pin_memory().to(device, non_blocking=True) if torch.is_tensor(v) else v
                 for k, v in target.items()}

        # Mixed precision forward
        with autocast(device_type=device):
            logits, loss = model(batch, target)
            total_train_loss += loss.item()
    losses['train'] = total_train_loss/ (counter + 1)
    model.train()

    return losses
def apply_rotary_emb(x, cos, sin):
    assert x.ndim == 4  # multihead attention
    d = x.shape[3] // 2
    x1, x2 = x[..., :d], x[..., d:] # split up last time into two halves
    y1 = x1 * cos + x2 * sin # rotate pairs of dims
    y2 = x1 * (-sin) + x2 * cos
    out = torch.cat([y1, y2], 3) # re-assemble
    out = out.to(x.dtype) # ensure input/output dtypes match
    return out






def norm(x):
    # Purely functional rmsnorm with no learnable params
    return F.rms_norm(x, (x.size(-1),))
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.c_fc = nn.Linear(n_embd, 4 * n_embd, bias=False)
        self.c_proj = nn.Linear(4 * n_embd, n_embd, bias=False)

    def forward(self, x):
        x = self.c_fc(x)
        x = F.gelu(x).square()
        x = self.c_proj(x)
        return x
def repeat_kv(x, n_rep):
    """torch.repeat_interleave(x, dim=1, repeats=n_rep)"""
    if n_rep == 1:
        return x
    bs, n_kv_heads, slen, head_dim = x.shape
    return (
        x[:, :, None, :, :]
        .expand(bs, n_kv_heads, n_rep, slen, head_dim)
        .reshape(bs, n_kv_heads * n_rep, slen, head_dim)
    )
class CausalSelfAttention(nn.Module):
    def __init__(self,layer_idx):
        super().__init__()
        self.layer_idx = layer_idx
        self.n_head = n_head
        self.n_kv_head = n_kv_head
        self.n_embd = n_embd
        self.head_dim = self.n_embd // self.n_head
        assert self.n_embd % self.n_head == 0
        assert self.n_kv_head <= self.n_head and self.n_head % self.n_kv_head == 0
        self.c_q = nn.Linear(self.n_embd, self.n_head * self.head_dim, bias=False)
        self.c_k = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
        self.c_v = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
        self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=False)

    def forward(self, x, cos_sin, kv_cache):
        B, T, C = x.size()

        # Project the input to get queries, keys, and values
        q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
        k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
        v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)

        # Apply Rotary Embeddings to queries and keys to get relative positional encoding
        cos, sin = cos_sin
        q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin) # QK rotary embedding
        q, k = norm(q), norm(k) # QK norm
        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)

        # Apply KV cache: insert current k,v into cache, get the full view so far
        if kv_cache is not None:
            k, v = kv_cache.insert_kv(self.layer_idx, k, v)
        Tq = q.size(2) # number of queries in this forward pass
        Tk = k.size(2) # number of keys/values in total (in the cache + current forward pass)

        # Apply MQA: replicate the key/value heads for each query head
        nrep = self.n_head // self.n_kv_head
        k, v = repeat_kv(k, nrep), repeat_kv(v, nrep)

        # Attention: queries attend to keys/values autoregressively. A few cases to handle:
        if kv_cache is None or Tq == Tk:
            # During training (no KV cache), attend as usual with causal attention
            # And even if there is KV cache, we can still use this simple version when Tq == Tk
            y = F.scaled_dot_product_attention(q, k, v, is_causal=True,dropout_p=dropout)
        elif Tq == 1:
            # During inference but with a single query in this forward pass:
            # The query has to attend to all the keys/values in the cache
            y = F.scaled_dot_product_attention(q, k, v, is_causal=False)
        else:
            # During inference AND we have a chunk of queries in this forward pass:
            # First, each query attends to all the cached keys/values (i.e. full prefix)
            attn_mask = torch.zeros((Tq, Tk), dtype=torch.bool, device=q.device) # True = keep, False = mask
            prefix_len = Tk - Tq
            if prefix_len > 0: # can't be negative but could be zero
                attn_mask[:, :prefix_len] = True
            # Then, causal attention within this chunk
            attn_mask[:, prefix_len:] = torch.tril(torch.ones((Tq, Tq), dtype=torch.bool, device=q.device))
            y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)

        # Re-assemble the heads side by side and project back to residual stream
        y = y.transpose(1, 2).contiguous().view(B, T, -1)
        y = self.c_proj(y)
        return y

class Block(nn.Module):
    def __init__(self, layer_idx):
        super().__init__()
        self.attn = CausalSelfAttention(layer_idx)
        self.mlp = MLP()

    def forward(self, x, cos_sin, kv_cache):
        x = x + self.attn(norm(x), cos_sin, kv_cache)
        x = x + self.mlp(norm(x))
        return x

@torch.compile
def zeropower_via_newtonschulz5(G: Tensor, steps: int) -> Tensor:
    """
    Newton-Schulz iteration to compute the zeroth power / orthogonalization of G. We opt to use a
    quintic iteration whose coefficients are selected to maximize the slope at zero. For the purpose
    of minimizing steps, it turns out to be empirically effective to keep increasing the slope at
    zero even beyond the point where the iteration no longer converges all the way to one everywhere
    on the interval. This iteration therefore does not produce UV^T but rather something like US'V^T
    where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model
    performance at all relative to UV^T, where USV^T = G is the SVD.
    """
    assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng
    a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16()
    if G.size(-2) > G.size(-1):
        X = X.mT

    # Ensure spectral norm is at most 1
    X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
    # Perform the NS iterations
    for _ in range(steps):
        A = X @ X.mT
        B = b * A + c * A @ A # quintic computation strategy adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
        X = a * X + B @ X

    if G.size(-2) > G.size(-1):
        X = X.mT
    return X

class Muon(torch.optim.Optimizer):
    """
    Muon - MomentUm Orthogonalized by Newton-schulz

    https://kellerjordan.github.io/posts/muon/

    Muon internally runs standard SGD-momentum, and then performs an orthogonalization post-
    processing step, in which each 2D parameter's update is replaced with the nearest orthogonal
    matrix. To efficiently orthogonalize each update, we use a Newton-Schulz iteration, which has
    the advantage that it can be stably run in bfloat16 on the GPU.

    Some warnings:
    - This optimizer should not be used for the embedding layer, the final fully connected layer,
    or any {0,1}-D parameters; those should all be optimized by a standard method (e.g., AdamW).
    - To use it with 4D convolutional filters, it works well to just flatten their last 3 dimensions.

    Arguments:
        lr: The learning rate used by the internal SGD.
        momentum: The momentum used by the internal SGD.
        nesterov: Whether to use Nesterov-style momentum in the internal SGD. (recommended)
        ns_steps: The number of Newton-Schulz iteration steps to use.
    """
    def __init__(self, params, lr=0.02, momentum=0.95, nesterov=True, ns_steps=5):
        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov, ns_steps=ns_steps)
        params: list[Tensor] = [*params]
        param_groups = []
        for size in {p.numel() for p in params}:
            group = dict(params=[p for p in params if p.numel() == size])
            param_groups.append(group)
        super().__init__(param_groups, defaults)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            params: list[Tensor] = group["params"]
            for p in params:
                g = p.grad
                assert g is not None
                state = self.state[p]
                if "momentum_buffer" not in state:
                    state["momentum_buffer"] = torch.zeros_like(g)
                buf: Tensor = state["momentum_buffer"]
                buf.lerp_(g, 1 - group["momentum"])
                g = g.lerp_(buf, group["momentum"]) if group["nesterov"] else buf
                g = zeropower_via_newtonschulz5(g, steps=group["ns_steps"])
                p.add_(g, alpha=-group["lr"] * max(1, p.size(-2) / p.size(-1))**0.5)

def get_lr_multiplier(it):
    warmup_iters = round(warmup_ratio * num_iterations)
    warmdown_iters = round(warmdown_ratio * num_iterations)
    if it < warmup_iters:
        return (it + 1) / warmup_iters
    elif it <= num_iterations - warmdown_iters:
        return 1.0
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress * 1.0 + (1 - progress) * final_lr_frac

# Momentum scheduler for Muon optimizer
def get_muon_momentum(it):
    frac = min(it / 300, 1)
    momentum = (1 - frac) * 0.85 + frac * 0.95
    return momentum
class GPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.transformer = nn.ModuleDict({
            "wte": nn.Embedding(vocab_size, n_embd),
            "h": nn.ModuleList([Block(layer_idx) for layer_idx in range(n_layer)]),
        })
        self.block_size = block_size
        self.n_embd = n_embd
        self.n_head = n_head
        self.n_layer = n_layer
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
        # To support meta device initialization, we init the rotary embeddings here, but it's fake
        # As for rotary_seq_len, these rotary embeddings are pretty small/cheap in memory,
        # so let's just over-compute them, but assert fail if we ever reach that amount.
        # In the future we can dynamically grow the cache, for now it's fine.
        self.rotary_seq_len = block_size * 10 # 10X over-compute should be enough, TODO make nicer?
        head_dim = n_embd // n_head
        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)
        self.register_buffer("cos", cos, persistent=False) # persistent=False means it's not saved to the checkpoint
        self.register_buffer("sin", sin, persistent=False)
        # Cast the embeddings from fp32 to bf16: optim can tolerate it and it saves memory: both in the model and the activations
        self.transformer.wte.to(dtype=torch.bfloat16)


    def setup_optimizers(self, unembedding_lr=0.004, embedding_lr=0.2, matrix_lr=0.02, weight_decay=0.0):
        model_dim = self.n_embd
        # Separate out all parameters into 3 groups (matrix, embedding, lm_head)
        matrix_params = list(self.transformer.h.parameters())
        embedding_params = list(self.transformer.wte.parameters())
        lm_head_params = list(self.lm_head.parameters())
        assert len(list(self.parameters())) == len(matrix_params) + len(embedding_params) + len(lm_head_params)
        # Create the AdamW optimizer for the embedding and lm_head
        # Scale the LR for the AdamW parameters by ∝1/√dmodel (having tuned the LRs for 768 dim model)
        dmodel_lr_scale = (model_dim / 768) ** -0.5
        # if rank == 0:
        print(f"Scaling the LR for the AdamW parameters ∝1/√({model_dim}/768) = {dmodel_lr_scale:.6f}")
        adam_groups = [
            dict(params=lm_head_params, lr=unembedding_lr * dmodel_lr_scale),
            dict(params=embedding_params, lr=embedding_lr * dmodel_lr_scale),
        ]
        adamw_kwargs = dict(betas=(0.8, 0.95), eps=1e-10, weight_decay=weight_decay)
        AdamWFactory = partial(torch.optim.AdamW, fused=True)
        adamw_optimizer = AdamWFactory(adam_groups, **adamw_kwargs)
        # Create the Muon optimizer for the linear layers
        muon_kwargs = dict(lr=matrix_lr, momentum=0.95)
        MuonFactory = Muon
        muon_optimizer = MuonFactory(matrix_params, **muon_kwargs)
        # Combine them the two optimizers into one list
        optimizers = [adamw_optimizer, muon_optimizer]
        for opt in optimizers:
            for group in opt.param_groups:
                group["initial_lr"] = group["lr"]
        return optimizers

    def init_weights(self):
        self.apply(self._init_weights)
        # zero out classifier weights
        torch.nn.init.zeros_(self.lm_head.weight)
        # zero out c_proj weights in all blocks
        for block in self.transformer.h:
            torch.nn.init.zeros_(block.mlp.c_proj.weight)
            torch.nn.init.zeros_(block.attn.c_proj.weight)
        # init the rotary embeddings
        head_dim = self.n_embd // self.n_head
        cos, sin = self._precompute_rotary_embeddings(self.rotary_seq_len, head_dim)
        self.cos, self.sin = cos, sin

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            # https://arxiv.org/pdf/2310.17813
            fan_out = module.weight.size(0)
            fan_in = module.weight.size(1)
            std = 1.0 / math.sqrt(fan_in) * min(1.0, math.sqrt(fan_out / fan_in))
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

    # TODO: bump base theta more, e.g. 100K is more common more recently
    def _precompute_rotary_embeddings(self, seq_len, head_dim, base=10000, device=None):
        # autodetect the device from model embeddings
        if device is None:
            device = self.transformer.wte.weight.device
        # stride the channels
        channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32, device=device)
        inv_freq = 1.0 / (base ** (channel_range / head_dim))
        # stride the time steps
        t = torch.arange(seq_len, dtype=torch.float32, device=device)
        # calculate the rotation frequencies at each (time, channel) pair
        freqs = torch.outer(t, inv_freq)
        cos, sin = freqs.cos(), freqs.sin()
        cos, sin = cos.bfloat16(), sin.bfloat16() # keep them in bfloat16
        cos, sin = cos[None, :, None, :], sin[None, :, None, :] # add batch and head dims for later broadcasting
        return cos, sin

    def get_device(self):
        return self.transformer.wte.weight.device

    def estimate_flops(self):
        """ Return the estimated FLOPs per token for the model. Ref: https://arxiv.org/abs/2204.02311 """
        nparams = sum(p.numel() for p in self.parameters())
        nparams_embedding = self.transformer.wte.weight.numel()
        l, h, q, t = self.n_layer, self.n_head, self.n_embd // self.n_head, self.block_size
        num_flops_per_token = 6 * (nparams - nparams_embedding) + 12 * l * h * q * t
        return num_flops_per_token

    def forward(self, idx, targets=None, kv_cache=None, loss_reduction='mean'):
        B, T = idx.size()

        # Grab the rotary embeddings for the current sequence length (they are of shape (1, seq_len, 1, head_dim))
        assert T <= self.cos.size(1), f"Sequence length grew beyond the rotary embeddings cache: {T} > {self.cos.size(1)}"
        assert idx.device == self.cos.device, f"Rotary embeddings and idx are on different devices: {idx.device} != {self.cos.device}"
        assert self.cos.dtype == torch.bfloat16, "Rotary embeddings must be in bfloat16"
        # if kv cache exists, we need to offset the rotary embeddings to the current position in the cache
        T0 = 0 if kv_cache is None else kv_cache.get_pos()
        cos_sin = self.cos[:, T0:T0+T], self.sin[:, T0:T0+T] # truncate cache to current sequence length

        # Forward the trunk of the Transformer
        x = self.transformer.wte(idx)
        x = norm(x)
        for block in self.transformer.h:
            x = block(x, cos_sin, kv_cache)
        x = norm(x)

        # Forward the lm_head (compute logits)
        softcap = 15
        if targets is not None:
            # training mode: compute and return the loss
            # TODO: experiment with Liger Kernels / chunked cross-entropy etc.
            logits = self.lm_head(x)
            logits = softcap * torch.tanh(logits / softcap) # logits softcap
            logits = logits.float() # use tf32/fp32 for logits
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1, reduction=loss_reduction)
            return loss
        else:
            # inference mode: compute and return the logits
            logits = self.lm_head(x)
            logits = softcap * torch.tanh(logits / softcap) # logits softcap
            return logits

    @torch.inference_mode()
    def generate(self, tokens, max_tokens, temperature=1.0, top_k=None, seed=42):
        """
        Naive autoregressive streaming inference.
        To make it super simple, let's assume:
        - batch size is 1
        - ids and the yielded tokens are simple Python lists and ints
        """
        assert isinstance(tokens, list)
        device = self.get_device()
        rng = None
        if temperature > 0:
            rng = torch.Generator(device=device)
            rng.manual_seed(seed)
        ids = torch.tensor([tokens], dtype=torch.long, device=device) # add batch dim
        for _ in range(max_tokens):
            logits = self.forward(ids) # (B, T, vocab_size)
            logits = logits[:, -1, :] # (B, vocab_size)
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            if temperature > 0:
                logits = logits / temperature
                probs = F.softmax(logits, dim=-1)
                next_ids = torch.multinomial(probs, num_samples=1, generator=rng)
            else:
                next_ids = torch.argmax(logits, dim=-1, keepdim=True)
            ids = torch.cat((ids, next_ids), dim=1)
            token = next_ids.item()
            yield token
model = GPT().to(device)
# m = model.to(device)


print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters')

num_flops_per_token = model.estimate_flops()
print(f"Estimated FLOPs per token: {num_flops_per_token:e}")

optimizers = model.setup_optimizers(unembedding_lr=unembedding_lr, embedding_lr=embedding_lr, matrix_lr=matrix_lr, weight_decay=weight_decay)
adamw_optimizer, muon_optimizer = optimizers
resume_from_checkpoint = False

if resume_from_checkpoint:
    start_iter,step = load_checkpoint(f"{checkpoint_dir}/best_model.pt", model, optimizer, scheduler if 'scheduler' in locals() else None)
    start_iter,step = start_iter+1,step+1
    for param_group in optimizer.param_groups:
        param_group['lr'] = learning_rate
    # step = start_iter * len(train_loader) // gradient_accumulation_steps
else:
    start_iter = 0
    step = 0
# Training loop
step = 0
best_val_loss = float('inf')
for epoch in tqdm(range(epochs), desc="Train Epochs", leave=True):
    model.train()
    model.zero_grad(set_to_none=True)
    
    total_accumulated_loss = 0.0
    accumulated_loss = 0.0
    total = num_iterations = max_iters*gradient_accumulation_steps if max_iters !=-1 else ((len(train_loader)//gradient_accumulation_steps) + (1 if len(train_loader) % gradient_accumulation_steps else 0))
    for counter, (batch, target) in tqdm(enumerate(train_loader), total=total, desc=f"Training for Epoch {epoch}", leave=True):
        if counter == total:
            break
        start_time = time.time()
        # Move to device
        batch = batch.pin_memory().to(device, non_blocking=True)
        target = target.pin_memory().to(device, non_blocking=True)
        
        with autocast(device_type=device):
            loss = model(batch, target)
        
        # Track loss
        total_accumulated_loss += loss.item()
        accumulated_loss += loss.item()
        
        # Scale loss by accumulation steps
        loss = loss / gradient_accumulation_steps
        
        # Backward pass
        loss.backward()
        
        # Update weights after accumulating gradients
        if (counter + 1) % gradient_accumulation_steps == 0:
            # Gradient clipping
            
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            lrm = get_lr_multiplier(step)
            for opt in optimizers:
                for group in opt.param_groups:
                    group["lr"] = group["initial_lr"] * lrm
            muon_momentum = get_muon_momentum(step)
            for group in muon_optimizer.param_groups:
                group["momentum"] = muon_momentum
            for opt in optimizers:
                opt.step()
            # Optimizer step
            # optimizer.step()
            # scheduler.step()
            model.zero_grad(set_to_none=True)
            end_time = time.time()
            
            # Log every 100 steps
            if (counter + 1) % (gradient_accumulation_steps * 1) == 0:
                print(f'Epoch {epoch}, Batch {counter+1}/{len(train_loader)}, '
                      f'Step {step}, '
                      f'Loss: {accumulated_loss / gradient_accumulation_steps:.4f}, '
                      f'LRM: {lrm:.8f}',
                      f'Time: {(end_time-start_time)/60 :.4f}'
                )
            
            accumulated_loss = 0.0
            step += 1
            torch.cuda.empty_cache()

    
    # Handle remaining batches
    remaining_batches = total % gradient_accumulation_steps
    if remaining_batches > 0:
        end_time = time.time()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        # optimizer.step()
        lrm = get_lr_multiplier(step)
        for opt in optimizers:
            for group in opt.param_groups:
                group["lr"] = group["initial_lr"] * lrm
        muon_momentum = get_muon_momentum(step)
        for group in muon_optimizer.param_groups:
            group["momentum"] = muon_momentum
        for opt in optimizers:
            opt.step()
        # scheduler.step()
        model.zero_grad(set_to_none=True)
        torch.cuda.empty_cache()

        
        print(f'Epoch {epoch}, Batch {counter+1}/{len(train_loader)}, '
              f'Step {step}, '
              f'Loss: {accumulated_loss / remaining_batches:.4f}, '
              f'LRM: {lrm:.8f}',
              f'Time: {(end_time-start_time)/60 :.4f}'
            )
        step += 1
    
    # Evaluation at end of epoch
    losses = estimate_loss(model, train_loader, val_loader)
    print(f"\nepoch {epoch}: accumulated train loss {total_accumulated_loss/len(train_loader):.4f}, "
          f"eval train loss {losses['train']:.4f}, val loss {losses['val']:.4f}, "
          f"lrm {lrm:.2e}")
    
    # Save best checkpoint
    if losses['val'] < best_val_loss:
        best_val_loss = losses['val']
        save_checkpoint(epoch, step, model, optimizer, scheduler, best=True)
    
    # Regular checkpoint
    save_checkpoint(epoch, step, model, optimizer, scheduler)

OUTPUT:

Epoch 0, Batch 64/2258895111, Step 0, Loss: 11.0246, LRM: 1.00000000 Time: 0.1213
Epoch 0, Batch 128/2258895111, Step 1, Loss: 9.8716, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 192/2258895111, Step 2, Loss: 8.8365, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 256/2258895111, Step 3, Loss: 7.9904, LRM: 1.00000000 Time: 0.0616
Epoch 0, Batch 320/2258895111, Step 4, Loss: 7.5289, LRM: 1.00000000 Time: 0.0614
Epoch 0, Batch 384/2258895111, Step 5, Loss: 7.4689, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 448/2258895111, Step 6, Loss: 7.6550, LRM: 1.00000000 Time: 0.0614
Epoch 0, Batch 512/2258895111, Step 7, Loss: 7.9924, LRM: 1.00000000 Time: 0.0614
Epoch 0, Batch 576/2258895111, Step 8, Loss: 8.3691, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 640/2258895111, Step 9, Loss: 8.6597, LRM: 1.00000000 Time: 0.0614
Epoch 0, Batch 704/2258895111, Step 10, Loss: 8.9499, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 768/2258895111, Step 11, Loss: 9.1566, LRM: 1.00000000 Time: 0.0615
Epoch 0, Batch 832/2258895111, Step 12, Loss: 9.4029, LRM: 1.00000000 Time: 0.0616
Epoch 0, Batch 896/2258895111, Step 13, Loss: 9.5109, LRM: 1.00000000 Time: 0.0615

0 replies

adityavit · 2025-10-19T00:53:30Z

adityavit
Oct 19, 2025

What is the best way to run this model locally on my GPU with Ollama?

0 replies

chunqishi · 2025-10-19T01:32:04Z

chunqishi
Oct 19, 2025

This d32 model, has 1.75B parameter size (Number of parameters: 1,879,048,192), why its GPU memory usage is 75.2 GB ( Peak memory usage: 77017.78MiB)? according the scale of memory over parameter size, it should much lower than 75G. Meanwhile, other online report of d20 model, its GPU peak memory usage is also around 70G. I checked the peak memory source code: "print0(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1024 / 1024:.2f}MiB")". This is just GPU memory of 'rank 0'. Why the parameter size changed, the GPU memory usage keeps same???

0 replies

mohant2526 · 2025-10-19T01:48:02Z

mohant2526
Oct 19, 2025

Hi

0 replies

samerGMTM22 · 2025-11-06T09:43:59Z

samerGMTM22
Nov 6, 2025

Thanks @karpathy

0 replies

richardprobe · 2025-11-10T04:40:38Z

richardprobe
Nov 10, 2025

So i wonder for basic GSM8K math question, i think it still doesn't do that well (and no hard-feelings, i know in grand scheme of things the training budget is way too small), but i wonder - what do you think is the basic bottelenck? Dataset is obviously good with verifier, so is it just about it being 1) a small model, and 2) way too little training time?

0 replies

dnalbach · 2025-11-14T15:33:06Z

dnalbach
Nov 14, 2025

By scaling the model with the same dataset, it seems like the experiment becomes how much juice can be squeezed out of the same dataset with different model configurations. Thank you for doing that scaling for us, so we can see what this dataset does at higher scales.

My sense is that the smaller models could perform better with a higher quality dataset(s). Probably performance would scale better in proportion to dataset quality as well. Meaning, that at 1.8B params I think the benchmarks and coherence are more limited by the pretraining data quality than the model hyperparameters or even training duration.

Ideally someone with more knowledge of training data than me would do experiments with multiple model sizes on several different pre-training datasets, showing the same evals and which datasets perform better and scale better.

0 replies

samerGMTM22 · 2025-11-14T16:06:05Z

samerGMTM22
Nov 14, 2025

My training just started ... while I am an engineer with 20 years career in commercial non tech department, I am super into AI and while I have a good grip on concepts I wanted to do this project to learn more!

The code was overwhelming but with help of actual chatGPT I was able to step into the math try learn and understand it as much as I can.

I feel my learning of the ML parts of the project is elevated to 25-30% max of the math (maybe if I memorize the equations that would help me add to my personal knowledge eval 😅)

I haven't played much with code, I have replaced the mid training I.e SFT with new contracts/ law data sets as it is my career in supply chain so can't wait!!!

The pre training was kept to the original recommended corpus

And honestly can't wait to have my own model (obviously based on Karoathy's and contributors nanoCHAT which was 99% of the work on my project)

But still I feel proud to have stepped in an intimidating area even if learning was limited I took a step the might be expensive but cheap against the experience and feeling

0 replies

$1000 tier nanochat run #8

Uh oh!

Uh oh!

karpathy Oct 13, 2025 Maintainer

nanochat training report

Environment

Git Information

Hardware

Software

Bloat

Tokenizer training

Tokenizer evaluation

Comparison with GPT-2

Comparison with GPT-4

Base model training

Base model loss

Base model evaluation

Midtraining

Chat evaluation mid

Chat SFT

Chat evaluation sft

Summary

Replies: 12 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Oct 16, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy
Oct 13, 2025
Maintainer

Replies: 12 comments 6 replies

karpathy Oct 16, 2025
Maintainer Author