Conversation
thomasw21
left a comment
There was a problem hiding this comment.
Some thoughts on the config, I'll update the readme at the end so that we leave a trail with the conclusions we come up with.
|
|
||
| NLAYERS=40 | ||
| NHIDDEN=5120 | ||
| NHEADS=32 |
There was a problem hiding this comment.
I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.
cc @VictorSanh @stas00 @mryab (People who were involved in the original post)
There was a problem hiding this comment.
FWIW, 530B training used:
NLAYERS=105
NHIDDEN=20480
NHEADS=128
So the same proportion as 32 and 5120
There was a problem hiding this comment.
Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.
I'm not sure if there is a research showing size of the head vs. number of the heads performance.
| --hidden-dropout 0.0 \ | ||
| --attention-dropout 0.0 \ |
There was a problem hiding this comment.
https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs
| --adam-beta2 0.95 \ | ||
| --adam-eps 1e-8 \ | ||
| --lr 6e-5 \ | ||
| --lr 1e-4 \ |
There was a problem hiding this comment.
GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5?
| --lr 1e-4 \ | ||
| --min-lr 6e-6 \ | ||
| --lr-decay-style cosine \ | ||
| --lr-decay-samples 126_953_125 \ |
There was a problem hiding this comment.
you removed this one w/o any commentary?
There was a problem hiding this comment.
The original tr1-13B said:
We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125
There was a problem hiding this comment.
I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341
We have been using this in arch/scaling.
However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao
There was a problem hiding this comment.
Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.
| --data-path $DATA_PATH \ | ||
| --data-impl mmap \ | ||
| --split 900,100,0 \ | ||
| --split 950,50,0 \ |
There was a problem hiding this comment.
currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.
This PR is for sorting out the tr10-104B config.