Designing and Interpreting Probes

Probing turns supervised tasks into tools for interpreting representations. But the use of supervision leads to the question, did I interpret the representation?
Or did my probe just learn the task itself?

Wild human languages, confounding neural networks

Human languages are wild, delightfully tricky phenomena, exhibiting structure and variation, suggesting general rules and then breaking them. In some part, it’s for this reason that the best machine learning methods for modeling natural language are extremely flexible, left to their own devices to build internal representations of sentences. However, this flexibility comes at a cost. The most popular models in natural language processing research at this time are considered black boxes, offering few explicit cues to what they learn about how language works.

Earlier machine learning methods for NLP learned combinations of linguistically motivated features—word classes like noun and verb, syntax trees for understanding how phrases combine, semantic labels for understanding the roles of entities—to implement applications involving understanding some aspects of natural language. Though it was difficult at times to understand exactly how these combinations of features led to the decisions made by the models, practitioners at least understood the features themselves.

Recently, large-scale representation learning models like word2vec, GLoVe, ELMo, BERT, GPT2, XLNET, XLM, and others have replaced feature-based foundations to natural language processing systems. The representations of these models—vectors representing each word in each sentence—are used as features instead of linguistically motivated properties.

Probing to test linguistic hypotheses for deep representations

Despite the unsupervised nature of representation learning models in NLP, some researchers intuit that the representations' properties may parallel linguistic formalisms.

Gaining insights into the natures of NLP’s unsupervised representations may help us to understand why our models succeed and fail, what they’ve learned, and what we yet need to teach them. An emerging body of research has attempted to do just this – help us understand or interpret the properties of the internal representations of models like ELMo and BERT. To test intuitions about whether properties of representations line up with linguistically specified properties (like parts-of-speech,) probing methods train supervised models to predict linguistic properties from representations of language.

A probe is trained to predict properties we care about from representations of a model whose nature we'd like to know more about.

The claim is that achieving high evaluation accuracy (relative to a baseline) in predicting the property—like part-of-speech—from the representation—like ELMo—implies the property was encoded in the representation, and the probe found it.

Revisiting the premises of probing

Though the method seems simple, it has hidden complexity. For example, if you want to find property Y, how hard do you try? In other words, how complex a probe do you train in order to predict Y, like a linear model or a multi-layer perceptron? How much training data do you use? If the representation losslessly encodes the sentence, then with a complex enough probe, you can find any Y that you could find on the original sentence! For example, we don’t say our camera “learned” image classification just because we can train a CNN to predict labels from the camera’s images. As we’ll show, we should be careful about false positives, saying our network learned Y when it really hasn’t. In a graded notion, we don’t want to overestimate the extent to which high probe accuracy on linguistic tasks reflects properties of the representation.

How do we design probes whose accuracies faithfully reflect (unknown) properties of representations, and how do we interpret the accuracies returned by probes when making statements about the properties of representations?

A step towards more interpretable interpretability methods

In this blog post, we’ll describe control tasks, which put into action the intuition that the more a probe is able to make memorized output decisions independently of the linguistic properties of a representation, the less its performance on a task necessarily reflects properties of the representation. Through control tasks we define selectivity, which puts probes’ linguistic task accuracies in context of its ability to do this.

We find that probes, especially complex neural network probes, are able to memorize a large number of labeling decision independently of the linguistic properties of the representations. We then show that through selectivity, we can gain intuition for the expressivity of probes, and reopen questions about which representations better represent linguistic properties like parts-of-speech.

Our method is described in the forthcoming EMNLP 2019 publication, Designing and Interpreting Probes with Control Tasks; this post draws from the paper, which is joint work with Percy Liang. Every result in the paper, as well as its code and data provenance, can be found on our worksheet at Codalab, a platform for reproducible science. The repository for the code used in the paper, (which you can also use yourself!) is available at GitHub.

Background: Linguistic probing methods

Linguistic probing methods start with a broad hypothesis, like: “I think my representation learner unsupervisedly developed a notion of linguistic property Y, and encodes this notion in its intermediate representations in order to better perform the task it was trained on (like language modeling.)”

It is difficult to test this hypothesis directly. Probing methods substitute the following hypothesis, which has the nice property of being testable: “I think my representation has features that are:

  1. predictive of linguistic property Y,
  2. accessible to a specific family of functions (say, linear functions,)
  3. when trained in a specific manner on a specific dataset, and
  4. generalize to examples held-out during this training process
  5. better than a baseline representation.”

Task and probe notation

Some notation will help us be precise when talking about probing. As nicely described in Noah Smith’s introduction to contextual representations, we’ll use the term word token to denote an instance of a word in a corpus, and word type to denote the word in abstract.

Denote as \(1:T\) the sequence of integers \(\{1,...,T\}\). Let \(V\) be the vocabulary containing all word types in a corpus. A sentence of length \(T\) is \(x_{1:T}\) , where each \(x_i \in V\) , and the word representations of the model being probed are \(h_{1:T}\) , where \(h_i \in R^d\). A task is a function that maps a sentence to a single output per word, \(f (x_{1:T} ) = y_{1:T}\) , where each output is from a finite set of outputs: \(y_i \in \mathcal{Y}\).

A probe parametrized by \(\theta\) is a function \(f_\theta(h_{1:T}) = \hat{y}_{1:T}\). The accuracy achieved by this probe is the percent of outputs \(i\) such that \(\hat{y_i} = y_i\) on inputs \(h_{1:T}\) that were not used to train the probe.

Control representations

Probing papers tend to acknowledge the uncertainty that task accuracies derived from probing reflect not just the representations \(h\), but also the function \(f_\theta\) that is learned through supervision. Part 5 in the probing hypothesis above reflects this: baseline representations, for example resulting from random contextualization, \(h^\text{random}\), are used as input to the same probing methods as the representations we care about. The claim is, if training a probe on \(h\) leads to higher accuracy than training a probe on \(h^\text{random}\), then \(h\) encodes the linguistic task to some extent.

This control is useful, as it puts accuracies on \(h\) in context of accuracies you could have achieved before you even trained the contextualization part of the unsupervised model. However, it does not directly provide information about the expressivity of the function family to make decisions on its own, independently of interesting properties of \(h\).

The Probe Confounder Problem

What might it look like for a probe to achieve high accuracy without faithfully reflecting the properties of a representation? In general, this is the case if the probe is able to detect and combine disparate signals, some of which unrelated to the property we care about, and memorize arbitrary output distinctions based on those signals.

The probe confounder problem occurs when the probe is able to detect and combine disparate signals, some of which unrelated to the property we care about, and use supervision to memorize arbitrary output distinctions based on those signals.

Let’s consider a toy example, in which our language has the vocabulary \(V=\{a,b,...\}\). Linguists have decided upon the function \(f_\text{part-of-speech}(x_{1:T}) = y_{1:T}\) as the parts-of-speech for the language, where \(y_i \in \{Q, R, S\}\).

We have a representation learner \(\phi\) which emits a sequence of discrete symbol representations of a sentence, \(\phi(x_{1:T}) = h_{1:T}\), where \(h_i \in \{M, N\}\).

We’re wondering if the representation learner has, without supervision, learned the part-of-speech task, \(f_\text{part-of-speech}\).

To test this, we train a probe \(f_\theta(h_{1:T}) = \hat{y}_{1:T}\). Here’s an example of how this turns out for our particular toy language: The representations of \(\phi\) overlap with the parts-of-speech of the language – cool! The supervision of the probe let us learn the rough correspondence between the linguistic labels and the \(\phi\) representations, and the overlap seems interesting but isn’t perfect.

Unfortunately, representations \(h\) of actual models aren’t ever that simple; they’re real-valued vectors instead of symbols, and encode all kinds of information. In particular, we might expect that the word type of each word token ends up in the representations \(h\). If this is so in our toy example, our probe might learn something like the following:

In this case, the probe is able to memorize that word identities help disambiguate which part-of-speech should a word token has, even when the learned representation doesn’t actually make a distinction. Whereas before it was clear that the distinction between the part-of-speech \(R\) and the part-of-speech \(S\) isn’t made by \(\phi\), now our probe doesn’t reveal that distinction.

Control tasks and selectivity

Our goal is to put linguistic task accuracy in context of the probe’s ability to make output decisions that don’t reflect linguistic properties of the representation. We go about this by defining tasks that can’t have been learned a priori by a representation, but can be learned by the probe through memorization. We call these tasks control tasks. At a high level, control tasks have:

  • structure: The output for a word token is a deterministic function of the word type.

  • randomness: The output for each word type is sampled independently at random.

Because of randomness, no representation can have learned the task a priori. But because token outputs are deterministic functions of the word type, the probe itself can learn the task.

Control tasks associate word types with random labels; by construction, they can only be learned by the probe itself.

Constructing a control task for part-of-speech tagging

Let’s go through the construction of control tasks. Recall that a linguistic task is a function \(f(x_{1:T}) = y_{1:T}\), where \(y_i \in \mathcal{Y}\). Each control task is defined in reference to a single linguistic task, and the two share \(\mathcal{Y}\).

In part-of-speech tagging, the set \(\mathcal{Y}\) is the tagset, \(1:45\) (corresponding to NN, NNS, VB,…). To construct a control task, we independently sample a control behavior \(C(v)\) for each \(v \in V\). The control behavior specifies how to define \(y_i \in \mathcal{Y}\) for a word token \(x_i\) with word type \(v\). The part-of-speech control task is the function that maps each token \(x_i\) to the label specified by the behavior \(C(x_i)\):

\[\begin{align} f_{\text{control}}(x_{1:T}) = f(C(x_1), C(x_2),...,C(X_T))\end{align}.\]

This task is visualized below:

In our paper, we also construct a control task for a task derived from dependency parsing, but we’ll omit it from this blog post.

Properties of control tasks

To summarize, a control task is defined for a single linguistic task, and shares the linguistic task’s output space \(\mathcal{Y}\). To construct a control task, a control behavior is sampled independently at random for each word type \(v\in V\). The control task is a function mapping \(x_{1:T}\) to a sequence of outptus \(y_{1:T}\) which is fully specified by the sequence of behaviors, \([C(x_1), ..., C(x_{T})]\).

From this construction, we note that the ceiling on performance is the fraction of tokens in the evaluation set whose types occur in the training set (plus chance accuracy on all other tokens.) Further, \(C(v)\) must be memorized independently for each word type.


With a control task defined for each linguistic task, we define selectivity to be the difference between control task accuracy and linguistic task accuracy achievable by a probe family for a given representation.

\[\begin{align} \text{selectivity} = \text{linguistic acc} - \text{control acc}\end{align}\]

Selectivity puts linguistic task accuracy in context of the probe's ability to memorize arbitrary outputs for word types.

We propose selectivity as a tool for desinging probes to reflect properties of a representation, and for interpreting probing accuracies achieved by different probes or on different representations.


In our paper, we conduct a broad study of probe design decisions and hyperparameter choices and see how they affect linguistic task accuracy and selectivity. In this post, we’ll just go over some headline results.

The two most popular designs for probes are linear models or multi-layer perceptrons (MLPs.) We train probes from function families on both part-of-speech tagging and its control task to analyze the expressivity of the probe families. We also train bilinear and multi-layer perceptron probes on dependency edge prediction, a task derived from dependency parsing described in our paper. The representations \(h\) used are from the first layer of ELMo, and the data comes from the Penn Treebank.

We find that linear and bilinear probes are considerably more selective than multi-layer perceptron probes. For part-of-speech tagging, moving from linear to MLP probes leads to a slight incraese in part-of-speech tagging accuracy but a significant loss of selectivity, suggesting that the slight gain in part-of-speech accuracy may not faithfully reflect properties of the representation.

Popular probe design decisions lead to high control-task accuracy, low-selectivity probes, indicating that they're able to memorize a large number of decisions unmotivated by the representation.

We further find that we can control selectivity through careful complexity control, but that common use of regularization – to reduce generalization gap – does not encourage selectivity. In particular, we find that instead of the commonly used hidden state dimensionalities of a few hundred or 1000 for multi-layer perceptron probes, 10-dimensional MLPs achieve high part-of-speech tagging accuracy while being much more selective. This and a few other complexity control methods are visualized in the figure below.

Interpreting probing results for comparing representations

The last result we’ll discuss provides an example of how the probe confounder problem can muddy comparisons of linguistic capabilities of different representations.

Multiple studies have found that probes on the first BiLSTM output of ELMo (ELMo1) achieve higher accuracies than probes on the output of the second BiLSTM, ELMo2. One hypothesis to explain these results is that ELMo1 has higher-quality or more easily accesible part-of-speech representations than ELMo2. However, as we’ve seen, these results depend on the probe as well as the representation; given what we know about probes’ capacity for memorizing at the type-level, we explore an alternative hypothesis.

In particular, word type is a very useful feature in part-of-speech tagging a word token when combined with other features. Since ELMo1 is closer to the word inputs than ELMo2, it may be easier to identify word types. The higher accuracy of probes on ELMo1 may thus be explained by the accessibility of the word type feature instead of differences in part-of-speech representation.

We train probes on both ELMo1 and ELMo2, as well as random representation baseline Proj0, and report both the lingusitic task accuracy and control task accuracy on both models. The results for part-of-speech tagging are as follows:

From these results, we see that while ELMo1 achieves \(0.6\) better part-of-speech tagging accuracy than ELMo2, it comes at a loss of \(5.4\) selectivity; hence, probes on ELMo2 are considerably less able to rely on word identities. As we suggested above, this is consistent with the hypothesis that probes on ELMo1 achieve higher part-of-speech tagging accuracy due to easier access to word identities. However, it does not confirm this hypothesis; it merely reopens the question.

Relatedly, the small gain in part-of-speech tagging accuracy for ELmo2 over the random representation baseline Proj0 might at first suggest that ELMo2 encodes little about part-of-speech tags. In fact, a multi-layer perceptron achieves \(97.1\) part-of-speech tagging accuracy on Proj0, but only \(97.0\) accuracy on ELMo2.

However, by examining selectivity, we see that probes on ELMo2 have considerably less access to word identities than those on Proj0 (a difference of \(10.8\) in selectivity) This indicates that probes on ELMo2 must rely on emergent properties of the representation to predict parts-of-speech, suggesting that ELMo2 does encode part-of-speech information.


Probing methods have shown that a broad range of supervised tasks can be turned into tools for understanding the properties of contextual word representations.

Alain and Bengio (2016)1 suggested we may think of probes as “thermometers used to measure the temperature simultaneously at many different locations.” We instead emphasize the joint roles of representations and probes together in achieving high accuracy on a task; we suggest that probes be thought of as craftspeople; their performance depends not only on the materials they’re given, but also on their expressivity.

As probes are used increasingly to study representations, we hope that control tasks and selectivity, as diagnostic tools, can help us better interpret the results of these probes, ultimately leading us to better understand what is learned by these remarkably effective representations.


  1. Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ICLR. 2016. 

Join My Newsletter

Sign up to receive weekly updates.