[Feature Request] Extend nanochat to Support Training and Inference of Small MoE Models

Hi @karpathy,

First, sincerely thank you for creating nanochat — it’s an inspiring and beautifully minimal demonstration of how small dense models can be trained and inferred efficiently from scratch. Your work has been incredibly valuable for understanding the core mechanics of LLM training.

I’d like to propose an exciting extension:  supporting small Mixture-of-Experts (MoE) models in nanochat. While your current implementation focuses on dense architectures, MoE offers a compelling path to scaling model capacity efficiently — enabling larger effective models without proportionally increasing compute or memory overhead.

A MoE version of nanochat could demonstrate:
- Lightweight expert routing (e.g., top-k gating)
- Sparse activation and parameter updates
- Efficient parallelization of experts across devices (even on a single GPU)
- Comparative benchmarks between dense and sparse architectures at small scale

This would not only make nanochat even more educational but also position it as a pioneering minimalistic reference for MoE implementation — something currently missing in the open-source ecosystem.

Thanks again for your groundbreaking work.

Best Regards,
Adonishong

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Extend nanochat to Support Training and Inference of Small MoE Models #289

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Extend nanochat to Support Training and Inference of Small MoE Models #289

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions