src.corpora.autoΒΆ
auto.py
Default Dataset/Corpus Utilities. Downloads (if necessary) from the Hugging Face datasets Hub, and organizes into de-facto training, validation, and testing tests. Performs additional tokenization and normalization as well.
Functions
Builds Indexed Datasets from a Dataset Dictionary. |
|
Run basic tokenization and grouping to turn a Hugging Face Dataset (via datasets) into a torch.Dataset. |
|
Run special tokenization and grouping for the Lambada dataset. |