The Stanford Question Answering Dataset (SQuAD) is a reading comprehension benchmark with an active and highly-competitive leaderboard. Over 17 industry and academic teams have submitted their models (with executable code) since SQuAD’s release in June 2016, leading to the advancement of novel deep learning architectures which have outperformed baseline models by wide margins. As teams compete to build the best machine comprehension system, the challenge of rivaling human-level performance still remains open.
SQuAD is a unique large-scale benchmark in that it uses a hidden test set for official evaluation of models. Teams submit their executable code, which is then run on a test set that is not publicly readable. Such a setup preserves the integrity of the test results. Models can be rerun on new test sets, either to get tighter confidence bounds on model performance, or to evaluate the ability of the model to generalize to new domains. Another advantage of having teams submit executable code is that models can be ensembled to further boost performance so that the weaknesses of one model are offset by the strengths of another. But having teams submit arbitrary code poses technical challenges: different programs expect different arguments and command-line options, they often require custom environments and library dependencies, and some models may also involve running multiple programs in a sequential pipeline.
This is where CodaLab comes in. CodaLab is an online platform for collaborative and reproducible computational research. With CodaLab Worksheets, you can run your jobs on a cluster, document and share your experiments, all while keeping track of full provenance. The system exposes a simple command-line interface, with which you can upload your code and data as well as submit jobs to run them (see SQuAD data worksheet here). A job consists of 1) a Docker image, containing the environment in which to run your code, 2) a set of dependencies, i.e. the code and data to load into the Docker container where you job is run, and 3) the shell command to run inside this container. The files generated by a job can then be loaded into subsequent jobs as dependencies themselves. All this metadata about your jobs not only allows you to maintain a record of how you ran your code, but also enable others to reproduce your experiments, or even to rerun your pipelines on new datasets by substituting in different new dependencies.
These features allow us to run arbitrary code submissions for the SQuAD leaderboard, all while keeping the test set secret using CodaLab access control lists. Once a team has uploaded their code to CodaLab and successfully constructed jobs running the code on the public development dataset, we can reproduce the run by simply substituting the hidden test set for the development dataset. The results can then be queried using the CodaLab REST API to construct a live leaderboard on the web.
Last month, the CodaLab team and course staff organized a competition on SQuAD for Stanford’s popular CS224N (Natural Language Processing with Deep Learning) course. 162 student teams (with 1-3 students in each) competed in a tight, four-week expedition to apply their knowledge of deep learning for natural language processing to a real-world challenge task: SQuAD. CodaLab was employed for automated running and evaluation of the student submissions on the hidden test set, and a real-time online leaderboard that interfaced with CodaLab was set up for instantaneous feedback on the submission. Over the span of a short few weeks, many student teams managed to break a competitive EM/F1 score of 60/70, and the very top teams managed to rival entries on the external SQuAD leaderboard. The top student submission, at 77.5 F1, would have been a top 3 score on the leaderboard only 3 months ago -- not bad for a 4-week course!
We are grateful to Microsoft for their support of CodaLab and for giving students free GPU computing resources on Microsoft Azure, allowing them to build and test complex deep learning models on the large SQuAD dataset.