-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Description
Hi @karpathy,
First, sincerely thank you for creating nanochat — it’s an inspiring and beautifully minimal demonstration of how small dense models can be trained and inferred efficiently from scratch. Your work has been incredibly valuable for understanding the core mechanics of LLM training.
I’d like to propose an exciting extension: supporting small Mixture-of-Experts (MoE) models in nanochat. While your current implementation focuses on dense architectures, MoE offers a compelling path to scaling model capacity efficiently — enabling larger effective models without proportionally increasing compute or memory overhead.
A MoE version of nanochat could demonstrate:
- Lightweight expert routing (e.g., top-k gating)
- Sparse activation and parameter updates
- Efficient parallelization of experts across devices (even on a single GPU)
- Comparative benchmarks between dense and sparse architectures at small scale
This would not only make nanochat even more educational but also position it as a pioneering minimalistic reference for MoE implementation — something currently missing in the open-source ecosystem.
Thanks again for your groundbreaking work.
Best Regards,
Adonishong