Skip to content

[Feature Request] Extend nanochat to Support Training and Inference of Small MoE Models #289

@adonishong

Description

@adonishong

Hi @karpathy,

First, sincerely thank you for creating nanochat — it’s an inspiring and beautifully minimal demonstration of how small dense models can be trained and inferred efficiently from scratch. Your work has been incredibly valuable for understanding the core mechanics of LLM training.

I’d like to propose an exciting extension: supporting small Mixture-of-Experts (MoE) models in nanochat. While your current implementation focuses on dense architectures, MoE offers a compelling path to scaling model capacity efficiently — enabling larger effective models without proportionally increasing compute or memory overhead.

A MoE version of nanochat could demonstrate:

  • Lightweight expert routing (e.g., top-k gating)
  • Sparse activation and parameter updates
  • Efficient parallelization of experts across devices (even on a single GPU)
  • Comparative benchmarks between dense and sparse architectures at small scale

This would not only make nanochat even more educational but also position it as a pioneering minimalistic reference for MoE implementation — something currently missing in the open-source ecosystem.

Thanks again for your groundbreaking work.

Best Regards,
Adonishong

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions