RYS-XLarge: Critical Editorial
News/2026-03-10-rys-xlarge-critical-editorial-u80rs
💬 OpinionMar 10, 20268 min read

RYS-XLarge: Critical Editorial

Featured:HuggingFace
RYS-XLarge: Critical Editorial

Our Honest Take on "LLM Neuroanatomy": Clever Hack Tops Old Leaderboard, But It's Not a Paradigm Shift

Verdict at a glance

  • Genuinely impressive: A solo researcher with two gaming GPUs reached #1 on the mid-2024 Hugging Face Open LLM Leaderboard by duplicating seven middle layers of a 72B model without any weight changes or training — a creative systems-level optimization that most teams overlooked.
  • Disappointing: The achievement is tied to an outdated leaderboard version that has since been replaced with significantly harder benchmarks; the technique is a narrow, model-specific hack rather than a generalizable advance in understanding or scaling LLMs.
  • Who it's for: Interpretability hobbyists, efficiency tinkerers running large models on consumer hardware, and mechanistic interpretability researchers looking for fresh (if anecdotal) inspiration.
  • Price/performance verdict: Effectively "free" performance on that specific snapshot of the leaderboard, but the engineering effort required makes it impractical for production; real gains remain marginal compared to proper fine-tuning or newer base models.

What's actually new

The core contribution is the discovery and exploitation of what the author calls "LLM Neuroanatomy." By building a homemade "brain scanner" for Transformer internals, David Noel Ng observed that middle layers appear to perform abstract reasoning largely decoupled from input/output formats.

Clue #1 came from Base64 experiments: 2023-era models could decode Base64 inputs, reason about the underlying question, and re-encode the answer. This suggested early layers act as "translators" to an internal abstract representation, late layers as "writers" back to tokens, and middle layers as the "thinker."

Clue #2 was the bizarre architecture of Goliath-120B, a 2023 merge that alternated layers from two fine-tuned Llama-2 70B models and fed later-layer outputs into earlier layers of the other model. This violated conventional wisdom about layer ordering yet still performed decently.

These observations led to the key experiment: taking an existing 72B model, identifying seven particularly "thinking-heavy" middle layers via activation analysis, and duplicating them in place. The resulting model — dnhkng/RYS-XLarge — topped the then-current Open LLM Leaderboard across IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO without any gradient updates or weight merging.

The method is purely architectural: duplicate selected blocks and stitch them back. The author emphasizes no weights were changed. This is a genuine, if quirky, contribution to the small but growing body of work on test-time architecture modification and layer-level specialization in LLMs.

The hype check

The title "How I Topped the HuggingFace Open LLM Leaderboard" is technically accurate for that moment in mid-2024, but the post's framing as profound "neuroanatomy" discovery overstates the rigor. The piece repeatedly calls the finding unpublished because "blogging is way more fun than drafting scientific papers." This is charming but also a red flag: the claims lack the statistical controls, ablation studies, and peer review that would be required to elevate this from clever anecdote to reliable science.

Marketing-like language around "pure, abstract reasoning" in middle layers and "LLM Neuroanatomy" sounds insightful but rests on correlational observations from one model family. The Base64 trick, while fun, was already known in jailbreak circles; the insight that middle layers do core cognition is consistent with prior mechanistic interpretability work (e.g., induction heads, residual stream analysis) even if the specific duplication method is novel.

The leaderboard itself has been overhauled. As noted in the additional context, Hugging Face released a tougher version with new benchmarks that reordered rankings dramatically — some models moved 59 places. A top score on the old leaderboard is no longer strong evidence of superior capability. The post is dated "Mar 10, 2026" — likely a typo for 2025 or 2024 — which further dates the claim.

Real-world implications

For individual researchers and small teams, this validates that significant leaderboard movement is still possible through clever inference-time or architecture hacks rather than massive compute. It particularly benefits people running large models on limited hardware (the author used two gaming GPUs), showing that systems optimization can sometimes beat raw scale.

The idea of identifying and amplifying "thinking" layers could inspire more targeted model surgery — perhaps dynamic layer routing, test-time layer duplication for hard problems, or better understanding of where reasoning happens for debugging and safety work.

However, the unlocked use cases are mostly research-oriented. Production deployments care about consistent gains across updated benchmarks, latency, memory, and reliability. Duplicating layers increases both parameter count and compute (roughly 10% more FLOPs for seven duplicated layers in a 72B model), trading efficiency for performance on specific tasks.

Limitations they're not talking about

Several caveats are underplayed:

  1. Leaderboard obsolescence: The old leaderboard was known to be gameable. The new version uses harder benchmarks precisely because earlier rankings failed to correlate well with real capability.

  2. Lack of generality: The technique was tuned to one specific 72B model. There's no evidence it works across model families, sizes, or post-2024 architectures (e.g., newer Mixture-of-Experts, state-space models, or heavily post-trained reasoning models).

  3. No ablation depth: While the author mentions months of hacking, the post doesn't provide comprehensive ablations — what happens if you duplicate different layers, different numbers, early vs late? How sensitive is performance to which exact seven layers? Without this, it's hard to distinguish signal from lucky cherry-picking.

  4. Inference cost: Duplicating layers increases memory and latency. On gaming GPUs this may be acceptable for occasional use, but it's not a free lunch.

  5. Mechanistic claims exceed evidence: Asserting that middle layers perform "pure abstract reasoning in a representation that has nothing to do with any human language" is a strong claim. The experiments show performance correlation but don't prove the internal representation is format-invariant in the way described. Modern interpretability tools (sparse autoencoders, circuit analysis) would be needed for stronger claims.

  6. Reproducibility: The post is narrative-driven rather than a clean methods section. Exact layer indices, the precise "brain scanner" methodology, and full benchmark numbers aren't presented in a way that makes immediate replication trivial.

How it stacks up

Compared to standard approaches at the time, this beat fine-tuned and merged models on the old leaderboard without training. That's notable. However, it doesn't compete with today's frontier open models (Qwen2-72B-Instruct, Llama-3.1-70B, or newer reasoning-optimized models) on current benchmarks.

It sits in the same category as other architectural tricks — speculative decoding, early exiting, mixture-of-depths, or layer pruning — that offer modest gains at the cost of complexity. The closest analogue is work on "model surgery" and test-time scaling, but those usually involve more principled methods (e.g., DeepSeek's dynamic routing or academic layer-reuse papers).

Constructive suggestions

The author should prioritize turning the blog post into a proper paper or technical report with:

  • Full experimental details and code release
  • Systematic ablations across layer choices and model families
  • Results on the current Open LLM Leaderboard or other standardized suites
  • Comparison against strong baselines (continued pretraining, LoRA, model merging)

The "brain scanner" tool sounds promising — open-sourcing it with clear methodology would benefit the interpretability community more than the narrative alone.

Future work could explore whether identified "thinking layers" can be shared across models, whether dynamic duplication at inference time (only on hard prompts) improves efficiency, or whether this insight helps with safety (e.g., monitoring middle-layer activations for deception).

Finally, the community would benefit from less romantic framing ("neuroanatomy," "pure abstract reasoning") and more precise language about what was measured.

Our verdict

This is a delightful example of independent research punching above its weight through creativity and persistence. The core idea — that you can meaningfully optimize LLMs by understanding and manipulating layer specialization without touching weights — is worth attention and follow-up. However, as a claimed breakthrough in understanding or practical capability, it falls short due to the dated leaderboard, limited generality, and anecdotal presentation.

Adopt now if you're an interpretability researcher or efficiency hacker who enjoys model surgery experiments on consumer hardware.
Wait if you're looking for production improvements — better to use current strong base models and proper optimization techniques.
Skip if you need robust, reproducible gains on modern benchmarks.

The real value is the spirit of curiosity and the reminder that there are still low-hanging empirical discoveries available to determined individuals. That's genuinely inspiring, even if the specific result doesn't rewrite the field.

FAQ

Should we switch from standard fine-tuning to layer duplication techniques?

No. Layer duplication is a curiosity-driven hack that worked once on an outdated leaderboard. Standard fine-tuning, continued pretraining, and model merging remain far more reliable and general. Use this as inspiration for interpretability experiments, not as a replacement workflow.

Is this worth the engineering effort for consumer GPU setups?

Only if your goal is leaderboard chasing on legacy benchmarks or pure research. The months of hacking required for what amounts to a ~10% architecture tweak on one model family is a poor ROI for most practical applications. The "two gaming GPUs" angle is more marketing than scalable advantage.

Does this prove LLMs have identifiable "reasoning modules" we can amplify?

It provides weak, correlational evidence consistent with that hypothesis, but falls well short of proof. Stronger mechanistic interpretability work using modern tools would be needed to substantiate the "neuroanatomy" claims. Treat it as a provocative observation, not settled science.

Sources


All technical specifications, pricing, and benchmark data in this article are sourced directly from official announcements. Competitor comparisons use publicly available data at time of publication. We update our coverage as new information becomes available.

Original Source

dnhkng.github.io

Comments

No comments yet. Be the first to share your thoughts!