Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8

attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden

layer is of dimension 3072. What will be the number of parameters of the model? You can

ignore the bias terms, and other parameters used corresponding to the final loss

computation from the final encoder representation. The BERT model can take at most 512

tokens in the input.

asked by guest
on Sep 20, 2024 at 10:17 am



Mathbot Says...

I wasn't able to parse your question, but the HE.NET team is hard at work making me smarter.