HistWords: Word Embeddings for Historical Text William L. Hamilton,   Jure Leskovec,   Dan Jurafsky

Detailed data description

Each corpus zip-file includes these historical, post-processed word vectors along with some useful historical statistics:

  • Word2Vec (SGNS) embeddings (/sgns), which can be loaded with the SequentialEmbedding class.
  • SVD-based embeddings (/svd), which can be loaded with the Embedding class.
  • PPMI-based embeddings (/svd), which can be loaded with the Explicit class.
  • Historical POS-tags (/pos) including majority tags per word (/pos/[year]-pos.pkl) and the POS-counts per word (/pos/[year]-pos_counts.pkl). Everything stored as standard Python pickles.
  • Historical word frequencies per-decade (/freqs.pkl) and averaged over the entire corpus (/avg_freqs.pkl).
  • Frequency-rank sorted lists (/word_lists/full-all.pkl) of all words and all words after removing stop-words and proper nouns (/word_lists/full-nstop_nproper.pkl).
  • Polysemy scores computed by measuring how clustered word co-occurrences are for the top-10000 words, excluding stop-words and proper nouns (/netstats/full-nstop_nproper-top10000.pkl)
  • Scores for how much the semantics of the top-10000 non-stop, non-proper-noun words changed between consecutive decades (i.e., rates of change; /volstats/vols.pkl) and between each decade and the last decade in the data (i.e., how different the semantics is from present; /volstats/disps.pkl). Both these files contain cosine similarity scores so they need to be converted to distances (e.g., by transforming via the arccos)