Dropout training (Hinton et al., 2012) attempts to prevent feature co-adaptation in logistic models by randomly dropping out (zeroing) features during supervised training. We show how to speed up dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization. We can also show that dropout performs a form of adaptive regularization, and is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We generalize this regularizer to structured predictions for sequence models, and also demonstrate its applicability to semi-supervised learning.
For any comments or questions, please feel free to email Sidaw at Stanford. edu