$1000 tier nanochat run #8
Replies: 12 comments 6 replies
-
|
cool. i chatted with the model and i'ts significantly smarter than @depth20, can you share the weights for both the base and sft'ed model? it'd be cool to experiment on more post-training on the base model to see how far we could take it... |
Beta Was this translation helpful? Give feedback.
-
|
This is amazing! Any chance you’ll release the weights for the d32 model? Pretty please |
Beta Was this translation helpful? Give feedback.
-
|
Are you planning to release any video on Nanochat in future? |
Beta Was this translation helpful? Give feedback.
-
|
Nice! I ported the weights to transformers: https://huggingface.co/karpathy/nanochat-d32/discussions |
Beta Was this translation helpful? Give feedback.
-
|
Okay So I am applying this to the OpenWebText dataset from the NanoGPT but with just one GPU and I keep getting this loss plateaus and starts climbing: OUTPUT: |
Beta Was this translation helpful? Give feedback.
-
|
What is the best way to run this model locally on my GPU with Ollama? |
Beta Was this translation helpful? Give feedback.
-
|
This d32 model, has 1.75B parameter size (Number of parameters: 1,879,048,192), why its GPU memory usage is 75.2 GB ( Peak memory usage: 77017.78MiB)? according the scale of memory over parameter size, it should much lower than 75G. Meanwhile, other online report of d20 model, its GPU peak memory usage is also around 70G. I checked the peak memory source code: "print0(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1024 / 1024:.2f}MiB")". This is just GPU memory of 'rank 0'. Why the parameter size changed, the GPU memory usage keeps same??? |
Beta Was this translation helpful? Give feedback.
-
|
Hi |
Beta Was this translation helpful? Give feedback.
-
Thanks @karpathy
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
By scaling the model with the same dataset, it seems like the experiment becomes how much juice can be squeezed out of the same dataset with different model configurations. Thank you for doing that scaling for us, so we can see what this dataset does at higher scales. My sense is that the smaller models could perform better with a higher quality dataset(s). Probably performance would scale better in proportion to dataset quality as well. Meaning, that at 1.8B params I think the benchmarks and coherence are more limited by the pretraining data quality than the model hyperparameters or even training duration. Ideally someone with more knowledge of training data than me would do experiments with multiple model sizes on several different pre-training datasets, showing the same evals and which datasets perform better and scale better. |
Beta Was this translation helpful? Give feedback.
-
|
My training just started ... while I am an engineer with 20 years career in commercial non tech department, I am super into AI and while I have a good grip on concepts I wanted to do this project to learn more! The code was overwhelming but with help of actual chatGPT I was able to step into the math try learn and understand it as much as I can. I feel my learning of the ML parts of the project is elevated to 25-30% max of the math (maybe if I memorize the equations that would help me add to my personal knowledge eval 😅) I haven't played much with code, I have replaced the mid training I.e SFT with new contracts/ law data sets as it is my career in supply chain so can't wait!!! The pre training was kept to the original recommended corpus And honestly can't wait to have my own model (obviously based on Karoathy's and contributors nanoCHAT which was 99% of the work on my project) But still I feel proud to have stepped in an intimidating area even if learning was limited I took a step the might be expensive but cheap against the experience and feeling |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I did a bit of work to set up the $1000 run, thought I'd share the napkin math if helpful. Here is the draft of the
run1000.shscript, I just kicked it off and now we wait ~31 hours... The rest of the budget (~10 more hours) I am saving for midtraning/sft, possibly a bit of RL. Here is what I have for pretraining and I'll edit this as we go along:UPDATE 1: I finished the d32 run. It includes the midtraining bugfix. The full script is below. I'll push to master shortly.
UPDATE 2: I am hosting the d32
chat_web.pyhere. (Please obviously don't put any sensitive information into these nanochat WebUIs). I'll probably take down a bit later. The d32 is about an ~$800 model.UPDATE 3: Added the RL result, almost at 20% gsm8k nice
UPDATE 4: The model is now uploaded to huggingface here have fun
UPDATE 5: Added the summary "poster" that I tweeted t to the bottom of the post.
UPDATE 6: nanochat d32 is now hosted on https://nanochat.karpathy.ai/ nice.
The full report is as follows:
nanochat training report
Generated: 2025-10-13 20:50:44
Environment
Git Information
Hardware
Software
Bloat
Run started: 2025-10-13 20:50:46
Tokenizer training
timestamp: 2025-10-13 20:53:47
Tokenizer evaluation
timestamp: 2025-10-13 20:53:52
Comparison with GPT-2
Comparison with GPT-4
Base model training
timestamp: 2025-10-15 05:23:04
Base model loss
timestamp: 2025-10-15 16:02:50
If 5x + 3 = 13
Base model evaluation
timestamp: 2025-10-15 12:58:58
Midtraining
timestamp: 2025-10-15 15:48:54
Chat evaluation mid
timestamp: 2025-10-15 16:14:34
Chat SFT
timestamp: 2025-10-15 18:10:56
Chat evaluation sft
timestamp: 2025-10-15 18:22:17
Summary
I redacted the time because it is inaccurate. I actually think this run comes in well below $1000, probably closer to $800 or so. I might experiment with bumping the model size. I will work on export/import of models and post this one.
Beta Was this translation helpful? Give feedback.
All reactions