Where neural scaling laws come from: a model-based theory of data structure
- 👤 Speaker: Francesco Cagnetta, Marie Skłodowska-Curie Fellow at SISSA, Trieste
- 📅 Date & Time: Tuesday 10 February 2026, 14:00 - 15:00
- 📍 Venue: DAMTP, room MR4
Abstract
Neural scaling laws reveal strikingly robust power-law relationships between the performance of language models and the amount of training data. Yet, a principled explanation of where the scaling exponent comes from—in terms of measurable properties of real data, rather than solvable surrogates that neglect representation learning effects—has remained elusive. In this talk, I introduce a model-based perspective on data structure grounded in random hierarchies: analytically tractable generative models designed to capture the hierarchical and compositional structure of natural language while retaining explicit control over important learning-related statistics. I will then present new work that, building on this framework, ties the scaling exponent observed in autoregressive language modelling to two fundamental, empirically accessible statistics of text: (i) how correlations between two tokens decay with their separation t, and (ii) how the conditional entropy of the next token decreases as a function of context length n. The core message is that the representation-learning mechanism we identified by studying how deep learning methods learn random hierarchies provides the missing link from these descriptive statistics to quantitative predictions, as it yields a concrete formula for the scaling exponent in terms of the joint behaviour of these curves. The resulting prediction matches observed scaling remarkably well for modern neural architectures trained on large text corpora. This provides, to our knowledge, the first theory of neural scaling that depends only on intrinsic properties of the data and remains predictive in the regime of contemporary language modelling.
Series This talk is part of the DAMTP Data Intensive Science Seminar series.
Included in Lists
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Tuesday 10 February 2026, 14:00-15:00