src.corpora.tokenization_utilsΒΆ

Functions

batch_tokenize

Yields batches of tokenized sentences from the given dataset.

batched

Yields batches of the given size from the given iterable.

concatenate_and_group_texts

Groups texts in a batch together.

Classes

SeededShufflerIterDataPipe

Very similar to ShufflerIterDataPipe, but with a seed, and it ignores the set_shuffle_settings stuff.