University of Cambridge > Talks.cam > DAMTP Data Intensive Science Seminar > Where neural scaling laws come from: a model-based theory of data structure

Where neural scaling laws come from: a model-based theory of data structure

Download to your calendar using vCal

If you have a question about this talk, please contact Sven Krippendorf .

Neural scaling laws reveal strikingly robust power-law relationships between the performance of language models and the amount of training data. Yet, a principled explanation of where the scaling exponent comes from—in terms of measurable properties of real data, rather than solvable surrogates that neglect representation learning effects—has remained elusive. In this talk, I introduce a model-based perspective on data structure grounded in random hierarchies: analytically tractable generative models designed to capture the hierarchical and compositional structure of natural language while retaining explicit control over important learning-related statistics. I will then present new work that, building on this framework, ties the scaling exponent observed in autoregressive language modelling to two fundamental, empirically accessible statistics of text: (i) how correlations between two tokens decay with their separation t, and (ii) how the conditional entropy of the next token decreases as a function of context length n. The core message is that the representation-learning mechanism we identified by studying how deep learning methods learn random hierarchies provides the missing link from these descriptive statistics to quantitative predictions, as it yields a concrete formula for the scaling exponent in terms of the joint behaviour of these curves. The resulting prediction matches observed scaling remarkably well for modern neural architectures trained on large text corpora. This provides, to our knowledge, the first theory of neural scaling that depends only on intrinsic properties of the data and remains predictive in the regime of contemporary language modelling.

This talk is part of the DAMTP Data Intensive Science Seminar series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity