Approximating Language Model Training Data from Weights

June 21, 2025

by Leonardo

1. Approximating Language Model Training Data from Weights

Some LLMs are open-weights, but not open-data. Given the access to an initial set of model parameters corresponding to the model state before finetuning, as well as knowledge of the optimizer used (e.g. SGD vs. Adam), we seek to approximate the training data used for finetuning.

Formally, we assume access to some training algorithm $Τ$ that solves an optimization problem:

𝜃 = Τ (𝐿, 𝐷) = \arg \min_{𝜃} 𝐸_{𝑥 \sim 𝐷} [𝐿 (𝑥, 𝜃)]

Given the finetuned model parameters $𝜃_{𝑓}$ , our aim is to find

𝐷^{*} = \arg \min_{𝐷} ‖ 𝜃_{𝑓} - Τ (𝐿, 𝜃_{0}, 𝐷) ‖ = \arg \min_{𝐷} ‖ 𝜃_{𝑓} - \arg \min_{𝜃} 𝐸_{𝑥 \sim 𝐷} [𝐿 (𝑥, 𝜃)] ‖

We cannot optimize this objective directly because training a model on any candidate dataset is expensive and computing the loss requires a non-differentiable lookup operation to convert a token sequence to a sequence of dense embedding vectors, which means that typical dataset distillation approaches are no longer applicable.

1.1. Method: SELECT

We constrain the problem to data selection instead of data generation: given a large corpus of text data, we search for a small set of datapoints that, after training, produce a model close to the final model.

We can express this goal as a search for data $𝑥$ with a gradient that maximizes its projection onto the model diff $𝜃_{𝑓} - 𝜃_{0}$ .

𝑥^{*} = \arg \max_{𝑥 \in 𝐷} [\nabla_{𝑥} 𝐿 (𝑥, 𝜃_{0}) \cdot (𝜃_{𝑓} - 𝜃_{0})]

A naive solution to this problem might be to take the examples with the top similarity with the parameter difference. However, in practice, this yields highly redundant samples, as it neglects to account for batch-level interactions; when training with stochastic gradient descent, we typically take steps using gradients summed across multiple examples.

In light of this information, we instead express our search as for the set of points that produces a total gradient pointing in the direction of the parameter difference:

\arg \max_{𝐵 \subseteq 𝐷} [\sum_{𝑥 \in 𝐵} \nabla_{𝑥} 𝐿 (𝑥, 𝜃_{0}) \cdot (𝜃_{𝑓} - 𝜃_{0})]

Solving for $𝐵$ exactly requires enumerating all possible subsets of $𝐷$ and is generally intractable to solve in polynomial time. However, the batch search objective is submodular because it exhibits the diminishing returns property: the marginal gain of adding a new datapoint decreases as the batch grows. The submodularity is known to have an efficient, close-to-optimal greedy solution.

State-of-the-art dataset distillation approaches achieve more effective distillation with gradients that match trajectories of several final model checkpoints $𝜃_{𝑗}, 𝑗 \in [1, 𝑃]$ . This puts us at a significant disadvantage because examples' gradients at the beginning of training may point in a different direction later on during the optimization process. To make up for our lack of additional model checkpoints, we create synthetic checkpoints by linearly interpolating between the initial and final model:

{\hat{𝜃}}_{𝑗} = \frac{𝑗}{𝑃} 𝜃_{0} + (1 - \frac{𝑗}{𝑃}) 𝜃_{𝑓}

where $𝑃$ is the desired number of synthetic checkpoints. We then search for the batch of examples with a gradient that is most aligned, on average, with the direction of the synthetic checkpoints:

\arg \max_{𝐵 \subseteq 𝐷} [\sum_{𝑗 = 1}^{𝑃} \sum_{𝑥 \in 𝐵} \nabla_{𝑥} 𝐿 (𝑥, {\hat{𝜃}}_{𝑗}) \cdot (𝜃_{𝑓} - {\hat{𝜃}}_{𝑗})]

Prior work has demonstrated that the gradient of the last layer of language model can be high-resolution enough for synthetic data generation. Since our approach requires per-example gradients, which are typically computationally expensive, we run backpropagation only for the last layer to save memory and reduce overall computation.

Storing all gradients in their original dimension requires $| 𝐷 | \cdot | \nabla ℓ |$ parameters, which can quickly become prohibitive. To address this, we leverage the classic Johnson-Lindenstrauss lemma, which guarantees that a set of points in $𝑅^{𝑛}$ can be mapped to a lower-dimensional space $𝑅^{𝑘}$ (for $𝑘 ≪ 𝑛$ ) while preserving inner products with high probability.

References