src.corpora.autoΒΆ

auto.py

Default Dataset/Corpus Utilities. Downloads (if necessary) from the Hugging Face datasets Hub, and organizes into de-facto training, validation, and testing tests. Performs additional tokenization and normalization as well.

Functions

auto_detokenize

build_indexed_dataset

Builds Indexed Datasets from a Dataset Dictionary.

get_auto_dataset

Run basic tokenization and grouping to turn a Hugging Face Dataset (via datasets) into a torch.Dataset.

get_lambada

Run special tokenization and grouping for the Lambada dataset.