Conditional Probing and Usable Information

A string of balanced parentheses of two types.

We introduce a new tool in the toolbox for understanding what knowledge neural networks encode in their representations.

Emergent properties of the representations in neural networks challenge our intuitions for what it means to have knowledge. Probing is a tool that grapples with concept by relating neural representations to well-understood properties. Probing is a useful tool if you believe, as I argue, that a system knows exactly what it makes easier to predict.

In this blog post, I discuss knowledge in neural networks, the probing methodology, and two contributions of an EMNLP 2021 paper I spearheaded: (1) an extension to the probing toolbox that allow us to more precisely specify what kind of knowledge we want to measure, by excluding any knowledge explainable by a baseline we specify, and (2) a theoretical grounding for probing in estimating usable information, that is, information we can extract when we’re limited in how we can try to extract it.

This post is loosely based on the following paper:

Conditional probing: measuring usable information beyond a baseline
John Hewitt, Kawin Ethayarajh, Percy Liang, and Christopher D. Manning.
EMNLP 2021 (short papers)

The Two Primes

What does it mean to know a prime number? Or, what does it mean for a system to know a prime number?

The following is meant to help us understand our intuitions about what it means to know a prime number; the argument first appeared to my knowledge in Aaronson, 2013. Here’s the first prime (or really, form of a prime):


For a specific prime, I’d pass you some \(k\). Okay, keep that in mind; here’s the second form of the prime:

\[\text{Let } z = 2^k-1; \ \ \ \text{the next prime after $z$}\]

For both of these descriptions, you could perform the computation specified and write out the digits of the number I specify, for any \(k\) I tell you. In that sense, both primes (for a given \(k\)) are well-defined, unambiguous, and can be checked to be prime.

Your intuition is your own, but I feel like I ‘‘know’’ the first prime, but not the the second. Why is this? Intuitively, I feel like there’s work left to be done before I know the second prime; I have to do some searching! Yet, if we were to ask the same question of even numbers – I give you a specification of an even number, and then another specification with the same ‘‘the next one’’ statement, I feel like I’d know both even numbers.

The difference seems to be in the amount or kind of work required to, say, write out the digits of each of our primes. Specifically, there’s a polynomial-time algorithm to take \(2^k-1\) and write out its digits, while there’s no (known) polynomial-time algorithm to take the ‘‘the next one’’ description and write out the digits. We sort of have to fumble along the integers after \(2^k-1\), primality-testing each one until we find a prime; this may take exponentially long! In summary, given a description of some piece of (possible) information, the extent to which the description contains knowledge for us depends on how (hard) we have to work to access that information.

Given a description of some piece of (possible) information, the extent to which the description contains knowledge for us depends on how (hard) we have to work to access that information.

The 3SAT Boxes

What does it mean for a neural network, say one trained on a body of text, to encode knowledge? This fascinating question touches on the philosophical: can a non-agentive neural network know? Can a text-only neural network know? It also touches on the practical: how can we characterize the properties that emerge through training neural networks on language data?

We will start with an esoteric, but simple, example, with three mysterious boxes. First, gold box, a mysterious box which can take as input hard 3SAT (boolean satisfiability) instances of \(n\) variables—that is, instances of (likely) rather computationally difficult problems. Recall that the goal of boolean satisfiability is to determine whether any of the \(2^n\) possible assignments to the \(n\) boolean variables causes the (input) boolean expression to evaluate to True. In 3SAT form, this is NP-complete; it is conjectured that there is no polynomial-time algorithm to solve this.1

The gold box

Enter gold box: when it takes a 3SAT instance as input, it churns a bit and then spits out an assignment to the \(n\) variables – zeros and ones. (Always the same output sequence for the same input.) When you evaluate the boolean expression with these variables, you find on some 3SAT instances, it’s a satisfying assignment! Other times, it’s not a satisfying assignment. Now, I’ll reveal the mystery to you: gold box is an oracle 3SAT solver, which solves 3SAT instances in polynomial time and emits a satisfying assignment when it’s satisfiable (and emits any non-satisfying assignment when it’s not satisfiable.) Your intuition is your own, but to me, gold box clearly has knowledge of 3SAT satisfiability.

The unk box

Okay, box number 2: unk box. When provided with an input 3SAT instance, it churns a bit and produces a sequence of \(n\) words, each blep or fren. Here’s the mystery of this box: when unk box was constructed—bear with me—without you knowing, a fair coin was flipped. If the coin landed heads, then the box would emit blep when, in a satisfying assignment to the 3SAT instance, the variable should be assigned the value 1; and fren if the variable should be assigned 0. If the coin landed tails, the role of blep and fren are swapped.

If we were to sample many such unk boxes, there would be no correlation between bleps, frens, and the correct assignments to variables in the 3SAT instance. And perhaps the box doesn’t know, philosophically, what boolean satisfiability instance is. Because the unk box system isn’t agentive, we never even arrive at the classic Turing Test-style question of whether we care about what’s going on ‘‘under the hood’’ if the system behaves intelligently. The unk box is not trying to convince us of anything, and it it wasn’t (necessarily) trained to mimic a specific distribution. And we don’t ‘‘speak its language’’–we never specified to it through instructions or examples what a 3SAT solution is or should look like.

Yet, an enterprising scientist may observe the behavior of the box, and devise a simple decoder that maps the bleps and frens to an assignment to the \(n\) variables, and find that the box has undeniably encoded a considerable amount of knowledge about 3SAT. My intuition is that the unk box is not too different from the gold box; it’s just a little harder to extract 3SAT knowledge from it. The decoder that helps us characterize this knowledge is a probe.

The id box

For each unique 3SAT instance that the id box receives, it emits a unique integer. There’s no more structure to it than this: it’s simply a uuid. Your intuition is your own, but to me, the id box cannot be said to have knowledge of satisfiability.

Why? It’s no easier to come to the satisfiability of each 3SAT instance with id box than it was without id box; it’s no help at all! This is in stark contrast to gold box, of course, which just solves the problem for us directly, but also in stark contrast to unk box, which does most of the work for us.

Neural networks, I argue, are frequently like unk box: they’ve done a lot of the work towards constructing, e.g., fascinating linguistic features, for us, and stored the results in their continuous representations. These representations take the place of the unk box’s sequence of bleps and frens: not directly a solution to anything, but instead represent “useful work” performed. So, just like unk box can be said to have knowledge of satisfiability, neural networks can be said to have knowledge of what they make easier to extract.

Knowledge in Neural Networks

This blog post is about neural networks, I promise. But first, let’s summarize the intuition from the introduction. From the 3SAT box and the two primes, I’ve argued that

  1. We humans feel like we have knowledge of a piece of information about a property if there’s a limited amount/kind of work we have to do in order to write out the property, and
  2. a system may contain knowledge about the property if it makes it easier to write out that property when we have the system than when we didn’t.

Our imperfect summary of these things is that a system knows exactly what it makes easier to predict. The system knows the property because it performs work on the input and produces some representation; we detect that the system knows the property (relative to having already known it in the input) because it’s easier to decode the property from the representation than the input, and perhaps from other baselines.

What if the knowledge isn’t used?

A reasonable argument against this interpretation is, what if the system does not use this knowledge in its predictions? Considerable work in interpretability of neural networks centers around determining what knowledge goes into the predictions of specific neural networks. This is an important goal. However, neural networks increasingly form the foundation, bad or good, for a large number of tasks, each one specified by some adaptation from the original network.

Understanding what’s easily extractable from the foundation neural network is in some sense a piece of basic science that can help us understand a range of adapted models–we might not ever even use the final predictions of the models we’re studying anyway! So, for our purposes, we’re not concerned with whether a neural network ‘‘uses’’ knowledge in its final predictions; even if a property is just easily extractable from some internal representation, that’s good enough for us.

Diving a bit deeper into this: consider that the neural network you’re analyzing may be extended, adapted, used as a puzzle piece in a complex system. Interpretability studies that connect the output predictions (e.g., probability distributions of BERT) to features of the input (which words in the input caused this output?) or the training data (which training examples caused this output?) are particularly important if you’re directly interested in using exactly those output predictions. They’re less relevant, however, when you have uncertainty over how the network you’re studying—its representations, gradients, predictions, etc.—will be used. In terms of representations, some representation at some middle layer of the network may be directly used in some other system—or its properties may become more relevant through finetuning. Proofs of ease of extraction–through the probing methods we describe, or through finetuning, or other ways—help us build conceptual understanding of the kinds of things that can be extracted, through various means and at various difficulties, from these neural networks.

So, interpretability work meant to explain individual predictions or make causal claims for the end task of the current model (e.g., masked word prediction in BERT) certainly is useful, but not to the preclusion of non-causal extraction results.

What if a network knows things we can’t extract?

Another possible problem with our loose definition of knowledge in neural networks is the fact that it’s sort of a moving target: if we come up with better ways of extracting knowledge, now the neural network knows more than before? There are at least two ways of dealing with this. The first way is to just accept this unintuitive result—the definition of a network’s knowledge depends on us, the ones trying to access that knowledge, so since we’ve changed, the knowledge has changed. If this is unsatisfying, I offer the following: the new method for extracting knowledge existed before we knew it, and so the knowledge was extractable by us. So, at any given time, we have only a lower-bound on the knowledge in a neural network. Extraction methods are constructive proofs of lower bounds of the network’s knowledge.

The probing methodology

Probing at its simplest is training a supervised classifier to predict a property from a (neural) representation. In doing so, one is studying the relationship between the representation (as from, say, BERT) and the property (say, named entity tags.) I’ve written on the mechanical nuances and philosophical difficulties before; also see Belinkov, 2021.

A probe is the classifier itself–the method of extracting knowledge from the neural network. The probe takes in a representation from the neural network, and produces a prediction of the property. A probe family is the set of classifiers that we decide, as a part of our probing experiment, that we’re willing to choose from in order to extract knowledge. An example of a probe family is the set of log-linear classifiers.

In a probing experiment, we use labeled data—paired samples of inputs like a sentence and word index, and outputs like the part-of-speech of the word at that index—to pick a probe from the probe family. That probe is then used on held-out data to test how well the property is predicted from the representation, given the probe as the decoder or extraction method.

For precise notation, take a look at our paper. But for a nice summary, we’ll just let


be the performance of the probe on representation \(R\) under probe family \(\mathcal{V}\), under whatever performance measure – accuracy, \(F_1\), or otherwise.

This measures how extractable the property we’re interested in — part-of-speech, named entities, sentiment, or otherwise — is from the representation under \(\mathcal{V}\).

Baselined probing

In the example of 3SAT, we saw a clear difference in knowledge between unk box and id box.. The unk box output a sequence of boolean values, which corresponded to a satisfying assignment to the 3SAT instance, modulo a simple transformation from two strings to \(0\) and \(1\). The id box emitted a unique integer for the 3SAT instance.

There exist decoding strategies from output of each box to the satisfying assignment, but it’s conspicuously (exponentially?) easier to decode from the output of the first box than it was to decode from the input to the first box. For the second box, it’s definitely no easier to decode satisfiability from the output of the box (the unique id of the input) than it was to decode from the input.

The input is a form of baseline; a representation that we can probe like we probe the representation. We then can estimate the following quantity, letting \(B\) be an arbitrary baseline:

\[\text{Perf}_{\mathcal{V}}(R) - \text{Perf}_{\mathcal{V}}(B).\]

This quantity is the baselined probing performance. To the extent that the property is easier to extract from the representation than a baseline (or number of baselines), we suggest that the representation encodes the property to an interesting extent.

Beyond baselined probing

Baselined probing is useful like baselines are useful in general in machine learning; it’s unclear how hard a prediction problem is, or how interesting it is that some structure is encoded, until we try some simple methods for predicting it.

But baselines can serve another purpose. They can help us specify that information we want excluded from our performance metrics. For example, in part-of-speech tagging, a simple baseline is predicting each word’s part of speech from some features of the individual word, outside of context. Part-of-speech is context-dependent, so this baseline could never get 100\% accuracy, but most words’ part-of-speech isn’t dependent on the context, so you get pretty good accuracy!

Here’s a picture of probing on a non-contextual word embedding baseline, predicting part-of-speech.

A string of balanced parentheses of two types.

75% for this sentence–not too bad. The only issue is that ‘‘record’’ can be a noun or a verb, but the probe assigns the most common tag to both instances, since the representation is the same for each. Okay, now we have a representation that we’d like to study; here’s a picture of probing on this mysterious representation:

A string of balanced parentheses of two types.

Oof! 25%, so the baselined probing performance is -50%; this is a negative probing result. We could stop here! The representation explains less about part-of-speech than does the baseline; hence we may conclude it doesn’t contain interesting knowledge relative to the baseline.

But since I came up with this mysterious representation, I now reveal its mysteries. It actually only encodes the part-of-speech of words in context whose part-of-speech isn’t the most common part-of-speech for that word type! For words whose part-of-speech is the most common for the word type, let’s say the representation is just a \(0\) vector.

Intuitively, this mystery representation encodes a lot about part-of-speech; the ambiguous cases, really, are what make part-of-speech interesting. Yet, we have a negative probing result. This motivates a desire for a probing methodology that measures only what isn’t captured by the baseline, instead of penalizing the representation for not capturing everything that the baseline captures.

Conditional probing

Conditional probing involves training two probes, just like baselined probing. One probe just takes the baseline as input, as before. The other probe, however, takes the representation and the baseline, instead of just the representation. Here’s what that might look like in a picture:

A string of balanced parentheses of two types.

As before, we subtract the performance of this probe from the probe on the baseline:

\[\text{Perf}_{\mathcal{V}}([R;B]) - \text{Perf}_{\mathcal{V}}(B).\]

We call this measure conditional probing performance. In the case of our running example, the conditional probe performance is 25%, showing that the representation does in fact explain an interesting aspect of part-of-speech beyond that captured by the representation. Intuitively, the fact that the probe has access to the baseline means that it doesn’t need to access the same features in the representation as it would if it had the representation alone.

Conditional probing measures just what the representation captures in the property beyond what's captured by the baseline.

Why do we call it conditional probing? Read on!

\(\mathcal{V}\)-information and probing

I’ve motivated, through the 3SAT and two primes examples, that how knowledge is encoded in a network relates to our ability to efficiently, or easily, extract information from the network. The intuition behind this idea (in the NLP community) is as old as probing itself (in the NLP community). The idea of probing, or decoder studies, is much older than the NLP community, and has been used, e.g., in neuroscience. (See Ivanova et al., 2021 for a survey of how insights from neuroscience may inform probing in NLP.)

Still, it has so many knobs – what probing family should I use? How much training data? – and philosophical baggage – what does a probing result mean? – that it’s well worth figuring out what theoretical framework(s) make sense to place the experiments in.

Mutual information

Pimentel et al. (2020) proposed that the goal of probing is to estimate the mutual information between representation \(R\) and property label \(Y\). Mutual information is a measure of how much uncertainty in one random variable is removed when you learn the value of another. An immediate issue with this view is the data processing inequality, a point that Pimentel et al. (2020) note. The data processing inequality states that information can only be destroyed (or preserved) by a deterministic function–it cannot be created.

So, if a function \(\phi\) is injective and takes r.v. \(X\) as input, and one is attempting to predict r.v. \(Y\), the data processing inequality says that the mutual information \(I(X;Y)\) is an upper bound on \(I(\phi(X);Y)\). Intuitively, one can think of the data processing inequality as resulting from imagining that there’s an agent trying to predict, e.g., \(Y\) from \(X\), and being able to use any deterministic transformation to construct a distribution over the space of \(Y\) from each value of \(X\). This includes, for example, possibly first mapping to \(\phi(X)\) and then performing more computation from there—so, there can’t be any use in starting from \(\phi(X)\)!

To provide an extreme example, consider \(X\) as a 3SAT instance, and \(Y\) as its satisfiability. Let \(\phi(X)\) be the output unk box 3SAT box we’ve discussed. We have that \(I(\phi(X);Y)=I(X;Y)\); under mutual information, it’s just as useful to have the input 3SAT instance as it is to have output of the unk box (that is, the satisfying (or non-satisfying) assignment modulo a simple transformation.) To predict \(Y\) from \(X\), we (probably) have exponential work left to do! Yet to predict \(Y\) from \(\phi(X)\) we have little remaining work to do. Separately, estimating mutual information is quite hard, despite bounds in absolute error–imagine in this example, we’d have to solve 3SAT instances in order to know that \(X\) was predictive of \(Y\)!

Probing attempts to understand the structure of representations, to help us characterize and measure the ‘‘useful work’’ that is performed by \(\phi\) like BERT. So, mutual information is not a good theoretical basis for probing.


\(\mathcal{V}\)-information, introduced by Xu et al. (2020), is a theory of usable information when you hypothesize a constrained set of ways to predict one random variable from another. The set \(\mathcal{V}\) in the name is the predictive family, or set of functions one specifies that one is willing to use to try to predict, say, random variable \(Y\) from r.v. \(R\). Let’s add a bit of notation here: let \(\mathcal{R}\) be the space of \(R\) and \(\mathcal{P}(\mathcal{Y})\) be the space of probability distributions over the space of \(Y\). Then \(\mathcal{V}\) is a set of functions of type \(f: \mathcal{R}\rightarrow \mathcal{P}(\mathcal{Y})\). When \(\mathcal{V}\) is the set of all functions that map from \(\mathcal{R}\) to \(\mathcal{P}(\mathcal{Y})\), \(\mathcal{V}\)-information reduces to mutual information!

The set \(\mathcal{V}\) can be specified as, e.g., the set of log-linear predictors. It has a technical requirement, however–intuitively, \(\mathcal{V}\) needs to contain functions such that it can’t be worse to know \(R\) when predicting \(Y\) than it would’ve been if you didn’t know \(R\). This is called optional ignorance. Intuitively, you can think of it as being able to set the weights in the log-linear predictor (or rows of the first layer of the multi-layer perceptron) to zero, so as to ignore the value of the variable.

The formal specification of this requirement is quite technical; I’d skip over it unless you’re particularly interested:2

Definition (predictive family):
Let \(\Omega = \{f:\mathcal{X}_1\times\cdots\times\mathcal{X}_n \rightarrow \mathcal{P}(\mathcal{Y})\}\). We say that \(\mathcal{V} \subseteq \Omega\) is a predictive family if, for any partition of \(\mathcal{X}_1\times\cdots\times\mathcal{X}_n\) into \(\mathcal{C}, \bar{\mathcal{C}}\), we have

\[\forall f, x_1,\dots,x_n \in \mathcal{V}\times \mathcal{X}_1\times\cdots\times\mathcal{X}_n\\\exists f' \in \mathcal{V} : \forall \bar{c}' \in\bar{\mathcal{C}}, \ f(c,\bar{c}) = f'(c,\bar{c}')\]

Once one fixes a predictive family, our uncertainty in the value of \(Y\) after knowing \(R\) is formalized as the \(\mathcal{V}\)-entropy:

Definition (v-entropy):
Let \(X_1,\dots,X_n\) be random variables in \(\mathcal{X_1},\dots,\mathcal{X}_n\). Let \(C\in\mathcal{C}\) and \(\bar{C}\in\bar{\mathcal{C}}\) form a binary partition of \(X_1,\dots,X_n\). Let \(\bar{a}\in\bar{\mathcal{C}}\). Then the \(\mathcal{V}\)-entropy of \(Y\) conditioned on \(C\) is defined as

\[H_{\mathcal{V}}(Y|C) = \inf_{f\in\mathcal{V}}\mathbb{E}_{c,y}\big[-\log f(c,\bar{a})[y]\big],\]

where \(f(c,\bar{a})\) is overloaded to equal \(f(x_1,\dots,x_n)\) for the values of the \(x_i\) specified by \(c, \bar{a}\).

In the case of having one predictive variable \(R\), this definition means that functions in \(\mathcal{V}\) map from each element of \(\mathcal{R}\) to a distribution over \(\mathcal{P}\). When we don’t know \(R\), an arbitrary placeholder (like the zero vector) is used to generate a distribution over \(Y\) independent of the value of \(R\).

With \(\mathcal{V}\)-entropy defined, \(\mathcal{V}\)-information quantities are defined analogously to mutual information:

Definition (v-information):

\[I_{\mathcal{V}}(X_{i+1} \rightarrow Y | X_1,\dots,X_i) = H_{\mathcal{V}}(Y|X_1,\dots,X_i) - H_{\mathcal{V}}(Y|X_1,\dots,X_{i+1})\]

where we’ve overloaded notation to let \(H_{\mathcal{V}}(Y\vert X_1,\dots,X_i)\) mean \(H_{\mathcal{V}}(Y\vert\{X_1,\dots,X_i\}).\)

Probing estimates \(\mathcal{V}\)-information

The connection between probing and \(\mathcal{V}\)-information is transparent: taking the Perf function in baselined probing to be the negative cross-entropy, we have that baselined probing performance on representation \(R\) with baseline \(B\) and property \(Y\) estimates

\[\begin{align*} \text{Perf}(R) - \text{Perf}(B) &= -H_{\mathcal{V}}(Y|R) - (-H_{\mathcal{V}}(Y|B))\\ &= (H_{\mathcal{V}}(Y) - H_{\mathcal{V}}(Y|R)) - (H_{\mathcal{V}}(Y) - H_{\mathcal{V}}(Y|B))\\ &= I_{\mathcal{V}}(R\rightarrow Y) - I_{\mathcal{V}}(B\rightarrow Y) \end{align*}\]

This difference of \(\mathcal{V}\)-informations formalizes how much more extractable \(Y\) is from \(R\) than it is from \(B\). The role of \(\mathcal{V}\), then, is a hypothesis that one makes as to the functional form of the mapping between \(R\) (respectively, \(B\)) and \(Y\). Intuitively, smaller and more constrained \(\mathcal{V}\) lead to more parsimonious, interesting hypotheses as to how structure is encoded.

The probe family specifies a hypothesis about the functional form of the mapping from representation to property.

This formalizes what we knew from the 3SAT box example: the fact that there exists a simple mapping from the output of the box to the satisfying assignment of the example is what makes the output of the box interesting, despite our need to learn that mapping. On the other hand, if we needed to learn a complex mapping to get it to work, then the box wouldn’t have been useful.

Conditional \(\mathcal{V}\)-information probing

We now come to why we came up with the name conditional probing for our simple proposed method of concatenating representations. We have that conditional probing performance on representation \(R\) with baseline \(B\) and property \(Y\), where the performance measure is negative cross-entropy, estimates

\[\begin{align*} \text{Perf}([B;R]) - \text{Perf}(B) &= H_{\mathcal{V}}(Y|B)) -H_{\mathcal{V}}(Y|R,B)\\ &= I_{\mathcal{V}}(R\rightarrow Y|B), \end{align*}\]

so conditional probing explicitly estimates conditional \(\mathcal{V}\)-information. Nice! Thus, it has the interpretation as measuring what \(R\) captures about \(Y\) that isn’t captured by \(B\), under the functions in \(\mathcal{V}\).


We ran a small suite of experiments on English language data to explore how information that’s not accessible in the input non-contextual word embedding layer of deep networks is constructed throughout the remaining layers. That is, while considerable probing work has used the input layer as a baseline, we ask whether measuring only information not in the input layer changes our understanding of the accessibility of features throughout the rest of the network.

We run experiments on part-of-speech tagging on a relatively general 17-tag set (upos) and an English-specific Penn Treebank tagset (xpos), on universal dependencies labeling (dep rel), named entity recognition (ner), and binary sentiment analysis (sst2). The data came from Ontonotes and the Stanford Sentiment Treebank (but the GLUE version); take a peek at the paper for data specifics.

Let \(\phi_i(X)_j\) be the neural representation from a model at layer \(i\) and token \(j\). We specify the probe family \(\mathcal{V}\) as the set of log-linear predictors

\[\begin{align*} f(\phi_i(X)_j) = \text{softmax}(W\phi_i(X)_j+b) \end{align*}\]

due to its relative simplicity and inability to learn non-linear combinations of features. For sst2, a sentence-level classification task, we average across \(j\) before applying the linear transformation.

ELMo experiments

ELMo is a two-layer pretrained bidirectional language model. Hewitt and Liang (2019) conjectured that while part-of-speech is better-decodable from the first layer than from the second, differences in ease of memorization from the word identity may explain these differences. Using our conditional probes, we can measure only what’s not explainable by the input layer, answering Hewitt and Liang (2019)’s conjecture.

A string of balanced parentheses of two types.

The units of the results are in bits of \(\mathcal{V}\)-information; higher is better. The baselined information quantities replicate the result that part-of-speech is in fact more accessible in layer \(1\) of ELMo than in layer \(2\). When we condition on the input layer , the difference between the two layers is cut roughly in half. So, the answer to the conjecture is that the difference between the layers is still there if just considering the ambiguous cases in part-of-speech, but it’s much smaller! The difference between the layers is lessened for dep rel as well, but not for the other properties.

RoBERTa experiments

RoBERTa (base) is a 12-layer Transformer encoder model. The following plots show our results for each of the linguistic properties.

A string of balanced parentheses of two types.
A string of balanced parentheses of two types.
A string of balanced parentheses of two types.
A string of balanced parentheses of two types.
A string of balanced parentheses of two types.

While RoBERTa is deeper than ELMo, the results are qualitatively similar to ELMo. The aspects of part-of-speech not attributable to the input layer are accessible much deeper into RoBERTa than previous (baselined) probing results would have indicated. For dependency relations, conditioning on the input layer makes the differences between the other layers less substantial. For the other properties, conditioning on the input layer doesn’t change the layerwise trends.

Concluding thoughts

The extent to which it’s easier to predict a property given (the representations of) a system than it is to do so without the (representations of the) system is a characterization of what the system knows. Intuitively, we may believe the system knows the thing because it can help teach it. This is not an agentive argument; one could equally say that a book knows (or contains) knowledge, or an information retrieval system.

Intuitively, we may believe the system knows the thing because it can help teach it, but one could equally say that of a book.

Probing as a methodology can help us characterize the knowledge of a system in a fundamental way – when the system wasn’t explicitly built to allow access to that knowledge, and so some kind of ‘‘translation’’ from the representations or behavior of the system to the properties we’re testing for needs to be performed. The type of translation, or decoding, we allow is the probe family; using supervision allows us to pick the member of the probe family via, e.g., gradient descent. The probe family is a hypothesis as to how knowledge is structured in a representation; finding parsimonious probe hypotheses, how complex properties may emerge via simple probe families, is a fascinating line of research.

Conditional probing allows us to explicitly measure only the information not accessible in a baseline, allowing us more precise control over our probing experiments.

Finally, \(\mathcal{V}\)-information is a theory of usable information that well describes both the methodology and aims of probing.

I hope you enjoyed reading this blog post, and I’d be happy to hear your thoughts on it!


  1. The 3SAT part refers to the form of the boolean expression: a conjunction of clauses in which each clause is a disjunction of three variables

  2. This specification of optional ignorance is from our EMNLP paper, not from Xu et al. (2020). We extended the \(\mathcal{V}\)-information of Xu et al. (2020) to the setting of multiple predictive variables, and needed to re-define optional ignorance to do so. In our appendix, we show that it is equivalent to theirs in the single-variable setting. 

Join My Newsletter

Sign up to receive weekly updates.