Code/Data for the paper: Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Code: scr_nb_acl12.zip

Data: data_nb_acl12 (404.4MB)

Very brief overview of the idea

The paper: compareacl.pdf

Slides of my talk given at ACL2012 pptx pdf

Follow up:

At least a couple more ACL/EMNLP 2012 papers also published results on these datasets that are not better than linear classifier with bigrams, highlighting the importance of establishing these baselines.

Readme

Data, not including the huge IMDB dataset: data_nb_acl12 (108.5MB)

- Make sure liblinear is in the path, or modify the first line of master.m

- the directory structure should be your_folder/scr and your_folder/data

- Put the data directory in parallel with the code directory

- Run master to produce the results from the paper

- Results and detailed are logged in resultslog.txt and details.txt

- A table with all results will be printed to the screen after master completes

- folder misc containing various data processing code and other research code not quite cleanned up

- The data folder contains datasets collected by others, please cite the original sources if you work with them

Note in 2013: I realize that this code is quite messy and does not conform to the typical ML pipeline. One reason is that it is beneficial to keep the order information of the document, instead of converting it to bag-of-words right away. Although all methods here actually converts it to bag-of-words. Vectorized, cleaner code should follow with a later project.