ResearcharXivNEW

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Luo 2026-05-28
Yaxin LuoJiacheng CuiXiaohan Zhao

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level

Topics

AIResearch