Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8
attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden
layer is of dimension 3072. What will be the number of parameters of the model? You can
ignore the bias terms, and other parameters used corresponding to the final loss
computation from the final encoder representation. The BERT model can take at most 512
tokens in the input.
Mathbot Says...
I wasn't able to parse your question, but the HE.NET team is hard at work making me smarter.