This is an old revision of the document!
Commentary/Errata on selected papers
This page contains my answers to questions about my papers, general comments that maybe inappropriate to include in the actual paper and references to related and followup works.
The equation before (7): should be changed to .
Q: Did you have any experiments with the regularized LR? I don't see any
“With the right regularization parameters and bigram features, our plain LR baseline is itself quite strong relative to previous work.” I did not mean plain as in unregularized. The provided code does a scan over all L2 regularization parameter to show that you cannot choose any L2 strength to beat this Gaussian dropout, at least on some datasets…
Q: What is MC Dropout, and Real Dropout?
Sorry about the inconsistency, MC means Monte Carlo, and Real means using MC to do real dropout.
Q: Where does the approximation formula (7) come from?
I got this trick from this paper:
MacKay, David J.C. The evidence framework applied to classification networks
Firstly we stress that this trick is non-essential to the main point of Fast Dropout paper, since accurately computing the value of any smooth function in 1D or 2D is probably quite easy by tabulating and interpolating.
However, the trick is quite interesting and it does give us some insights on the effect of dropout. So here is how it goes: Let be the Gaussian cumulative distribution with being Gaussian density. The main point is that we have the following integral (Eq. 1)
The substitution rule (chain rule, since ) suggests that the above can be evaluated analytically. So we substitute , and we get , so if we differentiate wrp to we get:
Since the product of two Gaussians is a Gaussian (in ), the above integral is just the normalization constant of the Gaussian density in , and a Gaussian density function in (a few lines of algebra omitted, and may be good exercise). Lastly, we can integrate back to get another Gaussian cumulative distribution in (Eq. 1).
So far everything in exact, and now we make the approximation that to get the desired approximation. If one were to use probit regression instead of logistic regression, then this whole chain is exact. Page 12 of the slides here plots the errors. However, the inaccuracy from making the Gaussian assumption is a lot larger than this approximation here so this is not at all the weakest link.
Baselines and Bigrams
Q: he data structure seems weird, why is it not just a sparse design matrix?
all the presented algorithm indeed just use a sparse design matrix as input. That is, these bag of words models do not make use of the order in which words appear. But the .mat data being loaded in does contain order information. It gets a bit weirder with bigrams, which I regret in retrospect, but oh well…