src.corpora.tokenization_utils.concatenate_and_group_texts¶

concatenate_and_group_texts(encoding: BatchEncoding, seq_len: int, stride: Optional[int] = None, drop_remainder: bool = True, mask_stride_overlap=True) → Iterator[BatchEncoding][source]¶

Groups texts in a batch together. Typically, you’ll want to use this with a fairly large set of texts, e.g. 1000 docs.

You should set mask_stride_overlap to True and drop_remainder to False if you want to use this for test data

Args:: encoding: The batch of texts to concatenate and group. seq_len: The max length of sequences to emit stride: The stride to use when grouping texts. If None, then the stride is set to seq_len. mask_stride_overlap: Whether to mask out overlapping tokens if we’re using a stride. drop_remainder: Whether to drop the last batch if it’s not a multiple of the seq_len.
Returns:: An iterator of tokenized texts, one at a time.