Some update to tr10 config by thomasw21 · Pull Request #20 · bigscience-workshop/bigscience

thomasw21 · 2021-11-23T10:07:06Z

This PR is for sorting out the tr10-104B config.

thomasw21

Some thoughts on the config, I'll update the readme at the end so that we leave a trail with the conclusions we come up with.

thomasw21 · 2021-11-23T10:10:25Z

train/tr10-13B-ml/tr10-13B.slurm


 NLAYERS=40
 NHIDDEN=5120
-NHEADS=32


I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.

https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS

cc @VictorSanh @stas00 @mryab (People who were involved in the original post)

FWIW, 530B training used:

NLAYERS=105 NHIDDEN=20480 NHEADS=128

So the same proportion as 32 and 5120

Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.

I'm not sure if there is a research showing size of the head vs. number of the heads performance.

thomasw21 · 2021-11-23T10:13:18Z

train/tr10-13B-ml/tr10-13B.slurm

+    --hidden-dropout 0.0 \
+    --attention-dropout 0.0 \


https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs

train/tr10-13B-ml/tr10-13B.slurm

thomasw21 · 2021-11-23T10:16:54Z

train/tr10-13B-ml/tr10-13B.slurm

    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
-    --lr 6e-5 \
+    --lr 1e-4 \


GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5?

stas00 · 2021-11-23T17:17:06Z

train/tr10-13B-ml/tr10-13B.slurm

+    --lr 1e-4 \
    --min-lr 6e-6 \
    --lr-decay-style cosine \
-    --lr-decay-samples 126_953_125 \


you removed this one w/o any commentary?

The original tr1-13B said:

We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125

I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341

We have been using this in arch/scaling.

However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao

Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.

stas00 · 2021-11-23T17:22:50Z

train/tr10-13B-ml/tr10-13B.slurm

    --data-path $DATA_PATH \
    --data-impl mmap \
-    --split 900,100,0 \
+    --split 950,50,0 \


currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.

train/tr10-13B-ml/tr10-13B.slurm

thomasw21 added 2 commits November 23, 2021 11:06

Some update to tr10 config

5766b5d

Woops

11433bd

thomasw21 commented Nov 23, 2021

View reviewed changes

thomasw21 requested review from TevenLeScao, VictorSanh, ibeltagy and stas00 November 23, 2021 10:17

stas00 reviewed Nov 23, 2021

View reviewed changes

restore the split

d50b067

stas00 reviewed Nov 23, 2021

View reviewed changes

Update the formula for computing the number of samples

13b93c6

stas00 reviewed Nov 24, 2021

View reviewed changes

train/tr10-13B-ml/tr10-13B.slurm Outdated Show resolved Hide resolved

Woops

673e801

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some update to tr10 config#20

Some update to tr10 config#20
thomasw21 wants to merge 5 commits intomasterfrom
thomas/update_1B3_multilingual

thomasw21 commented Nov 23, 2021 •

edited by stas00

Loading

Uh oh!

thomasw21 left a comment

Uh oh!

thomasw21 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021 •

edited

Loading

Uh oh!

thomasw21 Nov 23, 2021

Uh oh!

Uh oh!

thomasw21 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021

Uh oh!

thomasw21 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021

Uh oh!

stas00 Nov 23, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thomasw21 commented Nov 23, 2021 • edited by stas00 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasw21 commented Nov 23, 2021 •

edited by stas00

Loading

stas00 Nov 23, 2021 •

edited

Loading