(LLM), which is trained on an extensive corpus of text documents. The primary objective of an LLM is to accurately predict the subsequent word in a given text sequence. Due to the considerable computational resources required for training, it's crucial to understand how performance gains in these models correlate with both model size, assessed by the number of parameters (N), and the volume of training data, represented by the number of text tokens (D). In a specific study by Hoffmann et al. (2022), the predictive error (L) of a particular LLM on a given dataset was described by the power-law equation: L = 1.61 + (406.4)N-0.34 + (410.7) D-0.28 .The constant 1.61 indicates the inherent variability in the dataset; perfect prediction is unattainable even with an ideal model. The terms that include N and D reveal that/nand the volume or training data, represented by the number of text tokens (D). In a specific study by Hoffmann et al. (2022), the predictive error (L) of a particular LLM on a given dataset was described by the power-law equation: L = 1.61 + (406.4) N-0.34+ (410.7)D-0.28 . The constant 1.61 indicates the inherent variability in the dataset; perfect prediction is unattainable even with an ideal model. The terms that include N and D reveal that increases in either variable result in a decline in prediction error, signifying improved performance. Assuming researchers wish to practically implement this scaling law, let's consider they have already trained a model with specific values for N and D values: N = 10³ and D = 109. They are now evaluating their next steps. What option leads to the most improvements in prediction error according to this scaling law: [1 point] Increase the model's size (N) twofold, while/nIncrease the model's size (N) twofold, while maintaining a constant data size (D) O Double the data size (D), with the model size remaining the same. Insufficient information to make a decision.
Fig: 1
Fig: 2
Fig: 3