Instruction Following without Instruction Tuning
What is the nature of instruction following in language models?
Instruction tuning adapts a pretrained language model so that it responds to broad-domain requests with useful responses, instead of, e.g., generating more questions. We discover two forms of adaptation (finetuning) that feel deficient compared to instruction tuning, but still yield language models that follow instructions. We call this implicit instruction tuning.
We first find that training on only responses, with no instructions, yields instruction following:
- Question: What would happen if I were to remove the instructions during finetuning, just training the model to generate responses conditioned on nothing, so the finetuning can’t teach the instruction-response mapping?
- Answer: The resulting response-tuned model follows broad instructions at test time anyway (not quite as well, but still reasonably.)
Next, we find that single-task finetuning also yields instruction tuning:
- Question: What would happen if I were to finetune on a single-task dataset, like poetry generation or the Grade School Math dataset, so that the distribution of responses is wrong for general-purpose instructions?
- Answer: The resulting models follow broad instructions at test time anyway (not as well, but still reasonably.)
Is there any explanation why instruction following is so common a result of adapting a language model? We’re not sure, but one hypothesis is that very simple changes to a pretrained model’s distribution transform it to an instruction-following distribution. Past work has shown that it’s pretty sample-efficient to learn the mapping: (e.g., 1000 examples in LIMA) or a few in-context examples and some prompting, are good enough to elicit instruction following behavior (The Unlocking Spell). However, just how simple could it be?
- Question: What’s a simple transformation that could yield an instruction-following language model?
- Answer: We write a 3-rule rule-based language model (in ~20 lines of Python) that, in a product with a pretrained model, elicits instruction following. The rules are to (1) slowly upweight the EOS, (2) uniformly penalize a few vocabulary items everywhere, and (3) penalize repeating any word in the response.
Those are the results of the paper. Put another way, it’s pretty hard to get language models to not follow general instructions; for example, when finetuning on grade school math, we might expect math-like output for any instruction, but no, instead a request for tiramisu largely elicits a normal-looking recipe. One practical takeaway is that if you’ve adapted a language model on some specific task and put it in production, it may yet act like a general-purpose chatbot on novel user inputs, so you should consider putting it through the testing you’d put a general chatbot through!
This blog post is based on the paper Instruction Following without Instruction Tuning by me, Nelson Liu, Percy Liang, and Chris Manning.
Our code for replicating these results is here.
Response Tuning yields Instruction Following
Response tuning is just training a language model to maximize the likelihood of responses – without any specification of what instruction each response is for. Instruction tuning teaches \(p(y\mid x)\); where \(y\) is a response and \(x\) is an instruction; response tuning teaches \(p(y)\). When designing this experiment, I thought response tuning would lead to a nonsense model. I ran the experiment as part of another project; removing the instruction it was an extreme case of degrading instruction quality that I just wanted to “make sure” didn’t work.
To evaluate whether a model follows instructions, we (1) greedily decode from the model, so response-like tokens must be the highest probability at each timestep, and (2) we compute LLM-as-a-judge AlpacaEval win rates against a comparable pretrained model that was “standardly” instruction-tuned on the same dataset. We use the OLMo and Llama-2 7B models. For more details and hyperparameter optimization etc., see our paper.
The core result is that response tuning really does yield instruction following: Whereas the base models have single-digit win rates against the instruction tuned models, a response tuned model has >40% win rates:
Win rates aside, it’s useful to just look at some outputs and see that compared to the base model, the response tuned model really does seem to be following instructions:
In the paper, we study how this might be the case because pretrained models already score good responses for an instruction higher than good responses for a random instruction, but the model things totally other strings are more likely than all good responses. But as a simple takeaway, it doesn’t seem necessary to teach the model the mapping from instructions to responses.
Maybe it’s just necessary to teach the model what a good response is? We test that in the next section.
Single-Task Tuning yields Instruction Following
So, maybe teaching the distribution of good responses—say, that responses look a certain way and are widely varying—is really important to make a language model follow broad instructions. If this is true, then training a language model on a single task, with a narrow distribution of instructions and responses, and a formatting style unlike good responses for most instructions, should not make a language model follow instructions broadly.
We test exactly this, considering five single-task problems: English-to-Python snippets (MBPP), the Grade School Math English-to-math-derivation task (GSM), poem title-to-poem generation (Poetry), recipe title-to-recipe generation (Recipes) and Chess Player ELOs to chess game generation (Chess). Here’s an example from each of the datasets:
We finetune a language model on each of these datasets separately, and then test them on their ability to follow general instructions. Here’s an example response from each model; notice how the formatting of the responses is only loosely related to the formatting of the finetuning:
We test each of these single-task finetuned models against a model instruction-tuned on LIMA using AlpacaEval, finding that the single-task finetuned model winrates are substantially higher than the base models.
Intuitively, while the responses aren’t amazing, the single-task finetuned models don’t simply perform their finetuned behaviors; they seem to exhibit a behavior unlike both the pretrained model (which largely generates more questions or formatting tokens) or the finetuning (which has specific constraints.) Put concretely, why is it that the poetry-tuned model doesn’t, well, generate poetry? It’s a fascinating case of generalization that’s out-of-distribution for the finetuning distribution (though likely not for pretraining.)
However it happens, many of these single-task finetuning signals (though, not Chess) seem to change the pretrained model in a way that roughly follows some broad instruction following behavior. So, why is instruction following so common a result of adaptation of language models? We’re not sure, but our last experiment investigates this.
A Rule-Based Adaptation yields Instruction Following
Intuitively, there are some properties or constraints (or a few possible disjoint sets thereof) that an adaptation to a pretrained language model has to have in order to yield instruction following. Why do so many adaptation methods—response tuning, single-task tuning on various tasks—yield instruction following? One interesting hypothesis is that it’s because there are very simple adaptations that yield instruction following. Simple constraints—as opposed to strict requirements—are more likely to be met by a variety of adaptations.
Another way to put this hypothesis is that there are simple functions that map pretrained models’ distributions to a distribution that generates instruction-following strings. We’re inspired by the work of Lin et al., 2023, who found that roughly 78% of tokens greedily generated by an instruction-following model would also have been generated by the pretrained, model, and roughly 92% would’ve been in the top-3 most likely tokens. So, not too many tokens need to change. But this doesn’t necessarily mean that the tokens to change are simple to determine—for example, a strong chess engine might agree with a weak engine 95% of the time, but figuring out which 5% of moves to change (and to what) might still be difficult.
To explore this, we hand-write a rule-based language model that, when you take its distribution product with a pretrained language model, leads to instruction following. That is, we construct a simple mapping from pretrained to instruction following distribution; it has only 3 rules, which we’ll detail below.
Our combined pretrained+rules model is a local product-of-experts:
\[p_a(w\mid \mathbf{x}) = p_\text{base}(w\mid \mathbf{x})p_\text{rules}(w\mid \mathbf{x})/Z(\mathbf{x}),\]where the partition function is \(Z(\mathbf{x})=\sum_{w\in\mathcal{V}}p_\text{base}(w\mid \mathbf{x})p_\text{rules}(w\mid \mathbf{x})\). Intuitively, a product distribution is useful because it allow the rule-based probabilities to interact multiplicatively with the pretrained model’s probabilities. This computes a soft AND of the tokens both models find likely.
Each rule of our rule-based model assigns a score to each word in the vocabulary; the overall unnormalized probability of the word is the sum of the scores:
\[r(w, \mathbf{x}) = r_1(w, \mathbf{x}) + r_2(w, \mathbf{x}) + r_3(w, \mathbf{x}).\]The probabilities result from the usual softmax of the scores:
\[p_\text{rules}(\cdot \mid \mathbf{x}) = \text{softmax}\left(\left[r(w^{(1)}, \mathbf{x});\cdots;r(w^{(|\mathcal{V}|)}, \mathbf{x})\right]\right)\]The rules are as follows:
- Slowly upweight the EOS token. From token index \(i\) from 0 to 250 of the response, slowly increase the score of the special EOS token by \(15i/250\).
- Uniform token changes. Lower the likelihood of a few tokens that define formatting (like the open bracket
<
, which takes part in<|user|>
and<|assistant|>
, or words likeI
orshould
, which the model tends to use to hedge unnecessarily. (Full list in the paper.) These negative scores, like-4
for<
, are independent of the prefix. - Penalize Used Words. Compute the set of all tokens used in the response so far, and penalize their probabilities by a score of
-1.5
.
Intuitively, the first rule is necessary because pretrained models tend to go on and on; the EOS token is not naturally the top-1 most likely token nearly early enough. The second rule is necessary to cut off the model’s core tendency to just repeat the formatting of the instruction prompt, or generate highly likely strings like I don’t know. The third rule is more curious; during development I found that pretrained models were much more likely to generate a reasonable first sentence and then generate variations on that theme, instead of generating more useful content or concluding properly.
We show that LLAMA-2-7B, in a product with the model defined by these rules, betas LIMA instruction-tuned LLAMA-2-7B about 24% of the time. Further, each of the rules is important to achieving that win rate:
Here are some examples from our rule-based adapter product model. These aren’t cherry-picked, and they aren’t amazing, but you can try it out for yourself and see how the model is much more instruction-following than the base model.
Is this what language models learn when doing single-task finetuning? Probably not. But it’s a constructive proof that some simple operations to a pretrained mdoel can yield instruction following.
Conclusion
Language models are, in some sense, just really prone to following general instructions, even when our adaptation strategies don’t teach the behavior directly. We call this implicit instruction tuning. Instruction tuning certainly teaches important aspects of what we actually want from our responses, but in another sense, finetuning seems to have spotty impact on models’ behaviors; outside of the finetuning distribution, there seems to be reversion to some general instruction following behavior. Optimistically, this is a fascinating example of out-of-(finetuning)-distributiogeneralization. Pessimistically, this shows it’s perhaps surprisingly hard to get language models to change their behavior in general, since they are so prone to just following instructions outside the finetuning distribution.