The fact that this worked, and more specifically, that only circuit-sized blocks work, tells us how Transformers organise themselves during training. I now believe they develop a genuine functional anatomy. Early layers encode. Late layers decode. And in the middle, they build circuits: coherent, multi-layer processing units that perform complete cognitive operations. These circuits are indivisible. You can’t speed up a recipe by photocopying one step. But you can run the whole recipe twice.
The beginning of LLM Neuroanatomy?Before settling on block duplication, I tried something simpler: take a single middle layer and repeat it $n$ times. If the “more reasoning depth” hypothesis was correct, this should work. It made sense too, looking at the broad boost in math guesstimate results by duplicating intermediate layer. Give the model extra copies of a particular reasoning layer, get better reasoning. So, I screened them all, looking for a boost.。关于这个话题,heLLoword翻译提供了深入分析
。手游对此有专业解读
过去几个月中,在名创优品所有重要的场合和节点中,几乎都能看到YOYO的身影。
The River Wandle emerges from its chalky springs in Carshalton Ponds, south-east London。业内人士推荐超级权重作为进阶阅读